Status: ✅ Completed Date: October 21, 2025 Task: B1.2 - Create simple PDF text extractor (proof of concept)
This is a proof-of-concept PDF text and code extractor built for Skill Seeker. It demonstrates the feasibility of extracting documentation content from PDF files using PyMuPDF (fitz).
Analyzes font properties to find monospace fonts typically used for code:
Identifies code blocks by consistent indentation patterns:
Uses regex to find common code structures:
Supports detection of 19 programming languages:
pip install PyMuPDF
python3 -c "import fitz; print(fitz.__doc__)"
# Extract from PDF (print to stdout)
python3 cli/pdf_extractor_poc.py input.pdf
# Save to JSON file
python3 cli/pdf_extractor_poc.py input.pdf --output result.json
# Verbose mode (shows progress)
python3 cli/pdf_extractor_poc.py input.pdf --verbose
# Pretty-printed JSON
python3 cli/pdf_extractor_poc.py input.pdf --pretty
# Extract Python documentation
python3 cli/pdf_extractor_poc.py docs/python_guide.pdf -o python_extracted.json -v
# Extract with verbose and pretty output
python3 cli/pdf_extractor_poc.py manual.pdf -o manual.json -v --pretty
# Quick test (print to screen)
python3 cli/pdf_extractor_poc.py sample.pdf --pretty
{
"source_file": "input.pdf",
"metadata": {
"title": "Documentation Title",
"author": "Author Name",
"subject": "Subject",
"creator": "PDF Creator",
"producer": "PDF Producer"
},
"total_pages": 50,
"total_chars": 125000,
"total_code_blocks": 87,
"total_headings": 45,
"total_images": 12,
"languages_detected": {
"python": 52,
"javascript": 20,
"sql": 10,
"shell": 5
},
"pages": [
{
"page_number": 1,
"text": "Plain text content...",
"markdown": "# Heading\nContent...",
"headings": [
{
"level": "h1",
"text": "Getting Started"
}
],
"code_samples": [
{
"code": "def hello():\n print('Hello')",
"language": "python",
"detection_method": "font",
"font": "Courier-New"
}
],
"images_count": 2,
"char_count": 2500,
"code_blocks_count": 3
}
]
}
Each page contains:
page_number - 1-indexed page numbertext - Plain text contentmarkdown - Markdown-formatted contentheadings - Array of heading objectscode_samples - Array of detected code blocksimages_count - Number of images on pagechar_count - Character countcode_blocks_count - Number of code blocks foundEach code sample includes:
code - The actual code textlanguage - Detected language (or 'unknown')detection_method - How it was found ('font', 'indent', or 'pattern')font - Font name (if detected by font method)pattern_type - Type of pattern (if detected by pattern method)Font-based detection: ⭐⭐⭐⭐⭐ (Best)
Indent-based detection: ⭐⭐⭐⭐ (Good)
Pattern-based detection: ⭐⭐⭐ (Fair)
Detection based on keyword patterns, not AST parsing.
Tested on various PDF sizes:
Memory usage: ~50-200 MB depending on PDF size and image content.
| Feature | Web Scraper | PDF Extractor POC |
|---|---|---|
| Content source | HTML websites | PDF files |
| Code detection | CSS selectors | Font/indent/pattern |
| Language detection | CSS classes + heuristics | Pattern matching |
| Structure | Excellent | Good |
| Links | Full support | Not supported |
| Images | Referenced | Counted only |
| Categories | Auto-categorized | Not implemented |
| Output format | JSON | JSON (compatible) |
pdf_scraper.py CLI Tooldoc_scraper.pyscrape_pdfRun extractor:
python3 cli/pdf_extractor_poc.py test.pdf -o test_result.json -v --pretty
Verify output:
total_code_blocks > 0languages_detected includes expected languagescode_samples for accuracyRecommended test PDFs:
Good PDF (well-formatted with monospace code):
Poor PDF (scanned or badly formatted):
from cli.pdf_extractor_poc import PDFExtractor
# Create extractor
extractor = PDFExtractor('docs/manual.pdf', verbose=True)
# Extract all pages
result = extractor.extract_all()
# Access data
print(f"Total pages: {result['total_pages']}")
print(f"Code blocks: {result['total_code_blocks']}")
print(f"Languages: {result['languages_detected']}")
# Iterate pages
for page in result['pages']:
print(f"\nPage {page['page_number']}:")
print(f" Code blocks: {page['code_blocks_count']}")
for code in page['code_samples']:
print(f" - {code['language']}: {len(code['code'])} chars")
from cli.pdf_extractor_poc import PDFExtractor
extractor = PDFExtractor('input.pdf')
# Override language detection
def custom_detect(code):
if 'SELECT' in code.upper():
return 'sql'
return extractor.detect_language_from_code(code)
# Use in extraction
# (requires modifying the class to support custom detection)
To add language detection for a new language, edit detect_language_from_code():
patterns = {
# ... existing languages ...
'newlang': [r'pattern1', r'pattern2', r'pattern3'],
}
To add a new detection method, create a method like:
def detect_code_blocks_by_newmethod(self, page):
"""Detect code using new method"""
code_blocks = []
# ... your detection logic ...
return code_blocks
Then add it to extract_page():
newmethod_code_blocks = self.detect_code_blocks_by_newmethod(page)
all_code_blocks = font_code_blocks + indent_code_blocks + pattern_code_blocks + newmethod_code_blocks
This POC successfully demonstrates:
Ready for B1.3: The foundation is solid. Next step is adding page chunking and handling large PDFs.
POC Completed: October 21, 2025 Next Task: B1.3 - Add PDF page detection and chunking