Status: ✅ Completed Date: October 21, 2025 Tasks: B1.6 - Create pdf_scraper.py CLI tool, B1.8 - PDF config format
The PDF scraper (pdf_scraper.py) is a complete CLI tool that converts PDF documentation into Claude AI skills. It integrates all PDF extraction features (B1.1-B1.5) with the Skill Seeker workflow to produce packaged, uploadable skills.
pdf_extractor_poc.py for extractionpackage_skill.py# Create config file
cat > configs/my_manual.json <<EOF
{
"name": "mymanual",
"description": "My Manual documentation",
"pdf_path": "docs/manual.pdf",
"extract_options": {
"chunk_size": 10,
"min_quality": 6.0,
"extract_images": true,
"min_image_size": 150
},
"categories": {
"getting_started": ["introduction", "setup"],
"api": ["api", "reference", "function"],
"tutorial": ["tutorial", "example", "guide"]
}
}
EOF
# Run scraper
python3 cli/pdf_scraper.py --config configs/my_manual.json
Output:
🔍 Extracting from PDF: docs/manual.pdf
📄 Extracting from: docs/manual.pdf
Pages: 150
...
✅ Extraction complete
💾 Saved extracted data to: output/mymanual_extracted.json
🏗️ Building skill: mymanual
📋 Categorizing content...
✅ Created 3 categories
- Getting Started: 25 pages
- Api: 80 pages
- Tutorial: 45 pages
📝 Generating reference files...
Generated: output/mymanual/references/getting_started.md
Generated: output/mymanual/references/api.md
Generated: output/mymanual/references/tutorial.md
Generated: output/mymanual/references/index.md
Generated: output/mymanual/SKILL.md
✅ Skill built successfully: output/mymanual/
📦 Next step: Package with: python3 cli/package_skill.py output/mymanual/
# Quick conversion without config file
python3 cli/pdf_scraper.py --pdf manual.pdf --name mymanual --description "My Manual Docs"
Uses default settings:
# Step 1: Extract only (saves JSON)
python3 cli/pdf_extractor_poc.py manual.pdf -o manual_extracted.json --extract-images
# Step 2: Build skill from JSON (fast, can iterate)
python3 cli/pdf_scraper.py --from-json manual_extracted.json
Benefits:
{
"name": "godot_manual",
"description": "Godot Engine documentation from PDF manual",
"pdf_path": "docs/godot_manual.pdf",
"extract_options": {
"chunk_size": 15,
"min_quality": 6.0,
"extract_images": true,
"min_image_size": 200
},
"categories": {
"getting_started": [
"introduction",
"getting started",
"installation",
"first steps"
],
"scripting": [
"gdscript",
"scripting",
"code",
"programming"
],
"3d": [
"3d",
"spatial",
"mesh",
"shader"
],
"2d": [
"2d",
"sprite",
"tilemap",
"animation"
],
"api": [
"api",
"class reference",
"method",
"property"
]
}
}
name (string): Skill identifier
"python_guide"pdf_path (string): Path to PDF file
"docs/manual.pdf"description (string): Skill description
"Documentation skill for {name}"extract_options (object): Extraction settings
chunk_size (number): Pages per chunk (default: 10)min_quality (number): Minimum code quality 0-10 (default: 5.0)extract_images (boolean): Extract images to files (default: true)min_image_size (number): Minimum image dimension in pixels (default: 100)categories (object): Keyword-based categorization
output/
├── mymanual_extracted.json # Raw extraction data (B1.5 format)
└── mymanual/ # Skill directory
├── SKILL.md # Main skill file
├── references/ # Reference documentation
│ ├── index.md # Category index
│ ├── getting_started.md # Category 1
│ ├── api.md # Category 2
│ └── tutorial.md # Category 3
├── scripts/ # Empty (for user scripts)
└── assets/ # Assets directory
└── images/ # Extracted images (if enabled)
├── mymanual_page5_img1.png
└── mymanual_page12_img2.jpeg
# Mymanual Documentation Skill
My Manual documentation
## When to use this skill
Use this skill when the user asks about mymanual documentation,
including API references, tutorials, examples, and best practices.
## What's included
This skill contains:
- **Getting Started**: 25 pages
- **Api**: 80 pages
- **Tutorial**: 45 pages
## Quick Reference
### Top Code Examples
**Example 1** (Quality: 8.5/10):
python def initialize_system():
config = load_config()
setup_logging(config)
return System(config)
**Example 2** (Quality: 8.2/10):
javascript const app = createApp({ data() {
return { count: 0 }
} })
## Navigation
See `references/index.md` for complete documentation structure.
## Languages Covered
- python: 45 examples
- javascript: 32 examples
- shell: 8 examples
Each category gets its own reference file:
# Getting Started
## Installation
This guide will walk you through installing the software...
### Code Examples
bash curl -O https://example.com/install.sh bash install.sh
---
## Configuration
After installation, configure your environment...
### Code Examples
yaml server: port: 8080 host: localhost
---
If PDF has detectable chapters (from B1.3):
Advantages:
Example chapters:
chapter_1_introduction.mdpart_2_advanced_topics.mdIf categories config is provided:
Advantages:
Scoring:
# 1. Create PDF config
cat > configs/api_manual.json <<EOF
{
"name": "api_manual",
"pdf_path": "docs/api.pdf",
"extract_options": {
"min_quality": 7.0,
"extract_images": true
}
}
EOF
# 2. Run PDF scraper
python3 cli/pdf_scraper.py --config configs/api_manual.json
# 3. Package skill
python3 cli/package_skill.py output/api_manual/
# 4. Upload to Claude (if ANTHROPIC_API_KEY set)
python3 cli/package_skill.py output/api_manual/ --upload
# Result: api_manual.zip ready for Claude!
# After building, enhance with AI
python3 cli/enhance_skill_local.py output/api_manual/
# Or with API
export ANTHROPIC_API_KEY=sk-ant-...
python3 cli/enhance_skill.py output/api_manual/
| PDF Size | Pages | Extraction | Building | Total |
|---|---|---|---|---|
| Small | 50 | 30s | 5s | 35s |
| Medium | 200 | 2m | 15s | 2m 15s |
| Large | 500 | 5m | 45s | 5m 45s |
Extraction: PDF → JSON (cpu-intensive) Building: JSON → Skill (fast, i/o-bound)
Use --from-json for iteration
Adjust chunk size
Filter aggressively
min_quality: Fewer low-quality code blocksmin_image_size: Fewer small images{
"name": "python_reference",
"description": "Python 3.12 Language Reference",
"pdf_path": "python-3.12-reference.pdf",
"extract_options": {
"chunk_size": 20,
"min_quality": 7.0,
"extract_images": false
},
"categories": {
"basics": ["introduction", "basic", "syntax", "types"],
"functions": ["function", "lambda", "decorator"],
"classes": ["class", "object", "inheritance"],
"modules": ["module", "package", "import"],
"stdlib": ["library", "standard library", "built-in"]
}
}
{
"name": "rest_api_docs",
"description": "REST API Documentation",
"pdf_path": "api_docs.pdf",
"extract_options": {
"chunk_size": 10,
"min_quality": 6.0,
"extract_images": true,
"min_image_size": 200
},
"categories": {
"authentication": ["auth", "login", "token", "oauth"],
"users": ["user", "account", "profile"],
"products": ["product", "catalog", "inventory"],
"orders": ["order", "purchase", "checkout"],
"webhooks": ["webhook", "event", "callback"]
}
}
{
"name": "django_docs",
"description": "Django Web Framework Documentation",
"pdf_path": "django-4.2-docs.pdf",
"extract_options": {
"chunk_size": 15,
"min_quality": 6.5,
"extract_images": true
}
}
Note: No categories - uses chapter-based categorization
Problem: Only "content" or "other" category
Possible causes:
Solution:
# Check extracted chapters
cat output/mymanual_extracted.json | jq '.chapters'
# If empty, add keyword categories to config
# Or let it create single "content" category (OK for small PDFs)
Problem: Too many poor code examples
Solution:
{
"extract_options": {
"min_quality": 7.0 // Increase threshold
}
}
Problem: No images in assets/images/
Solution:
{
"extract_options": {
"extract_images": true, // Enable extraction
"min_image_size": 50 // Lower threshold
}
}
| Feature | Web Scraper | PDF Scraper |
|---|---|---|
| Input | HTML websites | PDF files |
| Crawling | Multi-page BFS | Single-file extraction |
| Structure detection | CSS selectors | Font/heading analysis |
| Categorization | URL patterns | Chapters/keywords |
| Images | Referenced | Embedded (extracted) |
| Code detection | <pre><code> |
Font/indent/pattern |
| Language detection | CSS classes | Pattern matching |
| Quality scoring | No | Yes (B1.4) |
| Chunking | No | Yes (B1.3) |
The PDF scraper will be available through MCP:
# Future: MCP tool
result = mcp.scrape_pdf(
config_path="configs/manual.json"
)
# Or direct
result = mcp.scrape_pdf(
pdf_path="manual.pdf",
name="mymanual",
extract_images=True
)
Tasks B1.6 and B1.8 successfully implement:
B1.6 - PDF Scraper CLI:
B1.8 - PDF Config Format:
Impact:
Ready for B1.7: MCP tool integration
Tasks Completed: October 21, 2025
Next Task: B1.7 - Add MCP tool scrape_pdf