Version: 2.0 (Feature complete as of October 2025)
Unified multi-source scraping allows you to combine knowledge from multiple sources into a single comprehensive Claude skill. Instead of choosing between documentation, GitHub repositories, or PDF manuals, you can now extract and intelligently merge information from all of them.
The Problem: Documentation and code often drift apart over time. Official docs might be outdated, missing features that exist in code, or documenting features that have been removed. Separately scraping docs and code creates two incomplete skills.
The Solution: Unified scraping:
Create a config file with multiple sources:
{
"name": "react",
"description": "Complete React knowledge from docs + codebase",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://react.dev/",
"extract_api": true,
"max_pages": 200
},
{
"type": "github",
"repo": "facebook/react",
"include_code": true,
"code_analysis_depth": "surface",
"max_issues": 100
}
]
}
python3 cli/unified_scraper.py --config configs/react_unified.json
The tool will:
python3 cli/package_skill.py output/react/
{
"name": "skill-name",
"description": "When to use this skill",
"merge_mode": "rule-based|claude-enhanced",
"sources": [
{
"type": "documentation|github|pdf",
...source-specific fields...
}
]
}
{
"type": "documentation",
"base_url": "https://docs.example.com/",
"extract_api": true,
"selectors": {
"main_content": "article",
"title": "h1",
"code_blocks": "pre code"
},
"url_patterns": {
"include": [],
"exclude": ["/blog/"]
},
"categories": {
"getting_started": ["intro", "tutorial"],
"api": ["api", "reference"]
},
"rate_limit": 0.5,
"max_pages": 200
}
{
"type": "github",
"repo": "owner/repo",
"github_token": "ghp_...",
"include_issues": true,
"max_issues": 100,
"include_changelog": true,
"include_releases": true,
"include_code": true,
"code_analysis_depth": "surface|deep|full",
"file_patterns": [
"src/**/*.js",
"lib/**/*.ts"
]
}
Code Analysis Depth:
surface (default): Basic structure, no code analysisdeep: Extract class/function signatures, parameters, return typesfull: Complete AST analysis (expensive){
"type": "pdf",
"path": "/path/to/manual.pdf",
"extract_tables": false,
"ocr": false,
"password": "optional-password"
}
The unified scraper automatically detects 4 types of conflicts:
Severity: Medium Description: API exists in code but is not documented
Example:
# Code has this method:
def move_local_x(self, delta: float, snap: bool = False) -> None:
"""Move node along local X axis"""
# But documentation doesn't mention it
Suggestion: Add documentation for this API
Severity: High Description: API is documented but not found in codebase
Example:
# Docs say:
def rotate(angle: float) -> None
# But code doesn't have this function
Suggestion: Update documentation to remove this API, or add it to codebase
Severity: Medium-High Description: API exists in both but signatures differ
Example:
# Docs say:
def move_local_x(delta: float)
# Code has:
def move_local_x(delta: float, snap: bool = False)
Suggestion: Update documentation to match actual signature
Severity: Low Description: Different descriptions/docstrings
Fast, deterministic merging using predefined rules:
[DOCS_ONLY] tag[UNDOCUMENTED] tagWhen to use:
Example:
python3 cli/unified_scraper.py --config config.json --merge-mode rule-based
AI-powered reconciliation using local Claude Code:
When to use:
Example:
python3 cli/unified_scraper.py --config config.json --merge-mode claude-enhanced
The unified scraper creates this structure:
output/skill-name/
├── SKILL.md # Main skill file with merged APIs
├── references/
│ ├── documentation/ # Documentation references
│ │ └── index.md
│ ├── github/ # GitHub references
│ │ ├── README.md
│ │ ├── issues.md
│ │ └── releases.md
│ ├── pdf/ # PDF references (if applicable)
│ │ └── index.md
│ ├── api/ # Merged API reference
│ │ └── merged_api.md
│ └── conflicts.md # Detailed conflict report
├── scripts/ # Empty (for user scripts)
└── assets/ # Empty (for user assets)
# React
Complete React knowledge base combining official documentation and React codebase insights.
## 📚 Sources
This skill combines knowledge from multiple sources:
- ✅ **Documentation**: https://react.dev/
- Pages: 200
- ✅ **GitHub Repository**: facebook/react
- Code Analysis: surface
- Issues: 100
## ⚠️ Data Quality
**5 conflicts detected** between sources.
**Conflict Breakdown:**
- missing_in_docs: 3
- missing_in_code: 2
See `references/conflicts.md` for detailed conflict information.
## 🔧 API Reference
*Merged from documentation and code analysis*
### ✅ Verified APIs
*Documentation and code agree*
#### `useState(initialValue)`
...
### ⚠️ APIs with Conflicts
*Documentation and code differ*
#### `useEffect(callback, deps?)`
⚠️ **Conflict**: Documentation signature differs from code implementation
**Documentation says:**
useEffect(callback: () => void, deps: any[])
**Code implementation:**
useEffect(callback: () => void | (() => void), deps?: readonly any[])
*Source: both*
---
{
"name": "react",
"description": "Complete React framework knowledge",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://react.dev/",
"extract_api": true,
"max_pages": 200
},
{
"type": "github",
"repo": "facebook/react",
"include_code": true,
"code_analysis_depth": "surface"
}
]
}
{
"name": "django",
"description": "Complete Django framework knowledge",
"merge_mode": "rule-based",
"sources": [
{
"type": "documentation",
"base_url": "https://docs.djangoproject.com/en/stable/",
"extract_api": true,
"max_pages": 300
},
{
"type": "github",
"repo": "django/django",
"include_code": true,
"code_analysis_depth": "deep",
"file_patterns": [
"django/db/**/*.py",
"django/views/**/*.py"
]
}
]
}
{
"name": "godot",
"description": "Complete Godot Engine knowledge",
"merge_mode": "claude-enhanced",
"sources": [
{
"type": "documentation",
"base_url": "https://docs.godotengine.org/en/stable/",
"extract_api": true,
"max_pages": 500
},
{
"type": "github",
"repo": "godotengine/godot",
"include_code": true,
"code_analysis_depth": "deep"
},
{
"type": "pdf",
"path": "/path/to/godot_manual.pdf",
"extract_tables": true
}
]
}
# Basic usage
python3 cli/unified_scraper.py --config configs/react_unified.json
# Override merge mode
python3 cli/unified_scraper.py --config configs/react_unified.json --merge-mode claude-enhanced
# Use cached data (skip re-scraping)
python3 cli/unified_scraper.py --config configs/react_unified.json --skip-scrape
python3 -c "
import sys
sys.path.insert(0, 'cli')
from config_validator import validate_config
validator = validate_config('configs/react_unified.json')
print(f'Format: {\"Unified\" if validator.is_unified else \"Legacy\"}')
print(f'Sources: {len(validator.config.get(\"sources\", []))}')
print(f'Needs API merge: {validator.needs_api_merge()}')
"
The unified scraper is fully integrated with MCP. The scrape_docs tool automatically detects unified vs legacy configs and routes to the appropriate scraper.
# MCP tool usage
{
"name": "scrape_docs",
"arguments": {
"config_path": "configs/react_unified.json",
"merge_mode": "rule-based" # Optional override
}
}
The tool will:
unified_scraper.pyLegacy configs still work! The system automatically detects legacy single-source configs and routes to the original doc_scraper.py.
// Legacy config (still works)
{
"name": "react",
"base_url": "https://react.dev/",
...
}
// Automatically detected as legacy format
// Routes to doc_scraper.py
Run integration tests:
python3 cli/test_unified_simple.py
Tests validate:
Unified Config
↓
ConfigValidator (validates format)
↓
UnifiedScraper.run()
↓
┌────────────────────────────────────┐
│ Phase 1: Scrape All Sources │
│ - Documentation → doc_scraper │
│ - GitHub → github_scraper │
│ - PDF → pdf_scraper │
└────────────────────────────────────┘
↓
┌────────────────────────────────────┐
│ Phase 2: Detect Conflicts │
│ - ConflictDetector │
│ - Compare docs APIs vs code APIs │
│ - Classify by type and severity │
└────────────────────────────────────┘
↓
┌────────────────────────────────────┐
│ Phase 3: Merge Sources │
│ - RuleBasedMerger (fast) │
│ - OR ClaudeEnhancedMerger (AI) │
│ - Create unified API reference │
└────────────────────────────────────┘
↓
┌────────────────────────────────────┐
│ Phase 4: Build Skill │
│ - UnifiedSkillBuilder │
│ - Generate SKILL.md with conflicts│
│ - Create reference structure │
│ - Generate conflicts report │
└────────────────────────────────────┘
↓
Unified Skill (.zip ready)
Rule-based is fast and works well for most cases. Only use Claude-enhanced if you need human oversight.
code_analysis_depth: "surface" is usually sufficient. Deep analysis is expensive and rarely needed.
max_issues: 100 is a good default. More than 200 issues rarely adds value.
"file_patterns": [
"src/**/*.js", // Good: specific paths
"lib/**/*.ts"
]
// Not recommended:
"file_patterns": ["**/*.js"] // Too broad, slow
Always review references/conflicts.md to understand discrepancies between sources.
Possible causes:
extract_api: false in documentation sourceinclude_code: false in GitHub sourcecode_analysis_depth)Solution: Ensure both sources have API extraction enabled
Possible causes:
Solution: Review conflicts manually and adjust merge strategy
Possible causes:
code_analysis_depth: "full" (very slow)Solution:
"surface" or "deep" analysisrate_limitPlanned features:
For issues, questions, or suggestions:
v2.0 (October 2025): Unified multi-source scraping feature complete