Comprehensive guide to advanced PDF extraction features (Priority 2 & 3).
Skill Seeker's PDF extractor now includes powerful advanced features for handling complex PDF scenarios:
Priority 2 Features (More PDF Types):
Priority 3 Features (Performance Optimizations):
Extract text from scanned PDFs using Optical Character Recognition.
# Install Tesseract OCR engine
# Ubuntu/Debian
sudo apt-get install tesseract-ocr
# macOS
brew install tesseract
# Install Python packages
pip install pytesseract Pillow
# Basic OCR
python3 cli/pdf_extractor_poc.py scanned.pdf --ocr
# OCR with other options
python3 cli/pdf_extractor_poc.py scanned.pdf --ocr --verbose -o output.json
# Full skill creation with OCR
python3 cli/pdf_scraper.py --pdf scanned.pdf --name myskill --ocr
📄 Extracting from: scanned.pdf
Pages: 50
OCR: ✅ enabled
Page 1: 245 chars, 0 code blocks, 2 headings, 0 images, 0 tables
OCR extracted 245 chars (was 12)
Page 2: 389 chars, 1 code blocks, 3 headings, 0 images, 0 tables
OCR extracted 389 chars (was 5)
--parallel with OCR for faster processing--verbose to see OCR progressHandle encrypted PDFs with password protection.
# Basic usage
python3 cli/pdf_extractor_poc.py encrypted.pdf --password mypassword
# With full workflow
python3 cli/pdf_scraper.py --pdf encrypted.pdf --name myskill --password mypassword
doc.is_encrypted)📄 Extracting from: encrypted.pdf
🔐 PDF is encrypted, trying password...
✅ Password accepted
Pages: 100
Metadata: {...}
# Missing password
❌ PDF is encrypted but no password provided
Use --password option to provide password
# Wrong password
❌ Invalid password
Extract tables from PDFs and include them in skill references.
# Extract tables
python3 cli/pdf_extractor_poc.py data.pdf --extract-tables
# With other options
python3 cli/pdf_extractor_poc.py data.pdf --extract-tables --verbose -o output.json
# Full skill creation with tables
python3 cli/pdf_scraper.py --pdf data.pdf --name myskill --extract-tables
find_tables() method📄 Extracting from: data.pdf
Table extraction: ✅ enabled
Page 5: 892 chars, 2 code blocks, 4 headings, 0 images, 2 tables
Found table 0: 10x4
Found table 1: 15x6
✅ Extraction complete:
Tables found: 25
{
"tables": [
{
"table_index": 0,
"rows": [
["Header 1", "Header 2", "Header 3"],
["Data 1", "Data 2", "Data 3"],
...
],
"bbox": [x0, y0, x1, y1],
"row_count": 10,
"col_count": 4
}
]
}
Tables are automatically included in reference files when building skills:
## Data Tables
### Table 1 (Page 5)
| Header 1 | Header 2 | Header 3 |
|----------|----------|----------|
| Data 1 | Data 2 | Data 3 |
Process pages in parallel for 3x faster extraction.
# Enable parallel processing (auto-detects CPU count)
python3 cli/pdf_extractor_poc.py large.pdf --parallel
# Specify worker count
python3 cli/pdf_extractor_poc.py large.pdf --parallel --workers 8
# With full workflow
python3 cli/pdf_scraper.py --pdf large.pdf --name myskill --parallel --workers 8
📄 Extracting from: large.pdf
Pages: 500
Parallel processing: ✅ enabled (8 workers)
🚀 Extracting 500 pages in parallel (8 workers)...
✅ Extraction complete:
Total characters: 1,250,000
Code blocks found: 450
| Pages | Sequential | Parallel (4 workers) | Parallel (8 workers) |
|---|---|---|---|
| 50 | 25s | 10s (2.5x) | 8s (3.1x) |
| 100 | 50s | 18s (2.8x) | 15s (3.3x) |
| 500 | 4m 10s | 1m 30s (2.8x) | 1m 15s (3.3x) |
| 1000 | 8m 20s | 3m 00s (2.8x) | 2m 30s (3.3x) |
--workers equal to CPU core count--no-cache for first-time processingconcurrent.futures (Python 3.2+)Intelligent caching of expensive operations for faster re-extraction.
# Caching enabled by default
python3 cli/pdf_extractor_poc.py input.pdf
# Disable caching
python3 cli/pdf_extractor_poc.py input.pdf --no-cache
Page 1: Using cached data
Page 2: Using cached data
Page 3: 892 chars, 2 code blocks, 4 headings, 0 images, 0 tables
Extract everything as fast as possible:
python3 cli/pdf_scraper.py \
--pdf docs/manual.pdf \
--name myskill \
--extract-images \
--extract-tables \
--parallel \
--workers 8 \
--min-quality 5.0
python3 cli/pdf_scraper.py \
--pdf docs/scanned.pdf \
--name myskill \
--ocr \
--extract-tables \
--parallel \
--workers 4
python3 cli/pdf_scraper.py \
--pdf docs/encrypted.pdf \
--name myskill \
--password mypassword \
--extract-images \
--extract-tables \
--parallel \
--workers 8 \
--verbose
| Configuration | Time | Speedup |
|---|---|---|
| Basic (sequential) | 4m 10s | 1.0x (baseline) |
| + Caching | 2m 30s | 1.7x |
| + Parallel (4 workers) | 1m 30s | 2.8x |
| + Parallel (8 workers) | 1m 15s | 3.3x |
| + All optimizations | 1m 10s | 3.6x |
| Feature | Time Impact | Memory Impact |
|---|---|---|
| OCR | +2-5s per page | +50MB per page |
| Table extraction | +0.5s per page | +10MB |
| Image extraction | +0.2s per image | Varies |
| Parallel (8 workers) | -66% total time | +8x memory |
| Caching | -50% on re-run | +100MB |
Problem: pytesseract not found
# Install pytesseract
pip install pytesseract
# Install Tesseract engine
sudo apt-get install tesseract-ocr # Ubuntu
brew install tesseract # macOS
Problem: Low OCR quality
Problem: Out of memory errors
# Reduce worker count
python3 cli/pdf_extractor_poc.py large.pdf --parallel --workers 2
# Or disable parallel
python3 cli/pdf_extractor_poc.py large.pdf
Problem: Not faster than sequential
Problem: Tables not detected
--verbose to see detection attemptsProblem: Malformed table data
Use parallel processing:
python3 cli/pdf_scraper.py --pdf large.pdf --parallel --workers 8
Extract to JSON first, then build skill:
python3 cli/pdf_extractor_poc.py large.pdf -o extracted.json --parallel
python3 cli/pdf_scraper.py --from-json extracted.json --name myskill
Monitor system resources
Use OCR with parallel processing:
python3 cli/pdf_scraper.py --pdf scanned.pdf --ocr --parallel --workers 4
Test on sample pages first
Use --verbose to monitor OCR performance
Use environment variable for password:
export PDF_PASSWORD="mypassword"
python3 cli/pdf_scraper.py --pdf encrypted.pdf --password "$PDF_PASSWORD"
Clear history after use to remove password
Enable table extraction:
python3 cli/pdf_scraper.py --pdf data.pdf --extract-tables
Check table quality in output JSON
Manual review recommended for critical data
from pdf_extractor_poc import PDFExtractor
extractor = PDFExtractor(
pdf_path="input.pdf",
verbose=True,
chunk_size=10,
min_quality=5.0,
extract_images=True,
image_dir="images/",
min_image_size=100,
# Advanced features
use_ocr=True,
password="mypassword",
extract_tables=True,
parallel=True,
max_workers=8,
use_cache=True
)
result = extractor.extract_all()
| Parameter | Type | Default | Description |
|---|---|---|---|
pdf_path |
str | required | Path to PDF file |
verbose |
bool | False | Enable verbose logging |
chunk_size |
int | 10 | Pages per chunk |
min_quality |
float | 0.0 | Min code quality (0-10) |
extract_images |
bool | False | Extract images to files |
image_dir |
str | None | Image output directory |
min_image_size |
int | 100 | Min image dimension |
use_ocr |
bool | False | Enable OCR |
password |
str | None | PDF password |
extract_tables |
bool | False | Extract tables |
parallel |
bool | False | Parallel processing |
max_workers |
int | CPU count | Worker threads |
use_cache |
bool | True | Enable caching |
✅ 6 Advanced Features implemented (Priority 2 & 3) ✅ 3x Performance Boost with parallel processing ✅ OCR Support for scanned PDFs ✅ Password Protection support ✅ Table Extraction from complex PDFs ✅ Intelligent Caching for faster re-runs
The PDF extractor now handles virtually any PDF scenario with maximum performance!