# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Overview This is a Python-based documentation scraper that converts ANY documentation website into a Claude skill. It's a single-file tool (`doc_scraper.py`) that scrapes documentation, extracts code patterns, detects programming languages, and generates structured skill files ready for use with Claude. ## Dependencies ```bash pip3 install requests beautifulsoup4 ``` ## Core Commands ### Run with a preset configuration ```bash python3 cli/doc_scraper.py --config configs/godot.json python3 cli/doc_scraper.py --config configs/react.json python3 cli/doc_scraper.py --config configs/vue.json python3 cli/doc_scraper.py --config configs/django.json python3 cli/doc_scraper.py --config configs/fastapi.json ``` ### Interactive mode (for new frameworks) ```bash python3 cli/doc_scraper.py --interactive ``` ### Quick mode (minimal config) ```bash python3 cli/doc_scraper.py --name react --url https://react.dev/ --description "React framework" ``` ### Skip scraping (use cached data) ```bash python3 cli/doc_scraper.py --config configs/godot.json --skip-scrape ``` ### Resume interrupted scrapes ```bash # If scrape was interrupted python3 cli/doc_scraper.py --config configs/godot.json --resume # Start fresh (clear checkpoint) python3 cli/doc_scraper.py --config configs/godot.json --fresh ``` ### Large documentation (10K-40K+ pages) ```bash # 1. Estimate page count python3 cli/estimate_pages.py configs/godot.json # 2. Split into focused sub-skills python3 cli/split_config.py configs/godot.json --strategy router # 3. Generate router skill python3 cli/generate_router.py configs/godot-*.json # 4. Package multiple skills python3 cli/package_multi.py output/godot*/ ``` ### AI-powered SKILL.md enhancement ```bash # Option 1: During scraping (API-based, requires ANTHROPIC_API_KEY) pip3 install anthropic export ANTHROPIC_API_KEY=sk-ant-... python3 cli/doc_scraper.py --config configs/react.json --enhance # Option 2: During scraping (LOCAL, no API key - uses Claude Code Max) python3 cli/doc_scraper.py --config configs/react.json --enhance-local # Option 3: Standalone after scraping (API-based) python3 cli/enhance_skill.py output/react/ # Option 4: Standalone after scraping (LOCAL, no API key) python3 cli/enhance_skill_local.py output/react/ ``` The LOCAL enhancement option (`--enhance-local` or `enhance_skill_local.py`) opens a new terminal with Claude Code, which analyzes reference files and enhances SKILL.md automatically. This requires Claude Code Max plan but no API key. ### MCP Integration (Claude Code) ```bash # One-time setup ./setup_mcp.sh # Then in Claude Code, use natural language: "List all available configs" "Generate config for Tailwind at https://tailwindcss.com/docs" "Split configs/godot.json using router strategy" "Generate router for configs/godot-*.json" "Package skill at output/react/" ``` 9 MCP tools available: list_configs, generate_config, validate_config, estimate_pages, scrape_docs, package_skill, upload_skill, split_config, generate_router ### Test with limited pages (edit config first) Set `"max_pages": 20` in the config file to test with fewer pages. ## Architecture ### Single-File Design The entire tool is contained in `doc_scraper.py` (~737 lines). It follows a class-based architecture with a single `DocToSkillConverter` class that handles: - **Web scraping**: BFS traversal with URL validation - **Content extraction**: CSS selectors for title, content, code blocks - **Language detection**: Heuristic-based detection from code samples (Python, JavaScript, GDScript, C++, etc.) - **Pattern extraction**: Identifies common coding patterns from documentation - **Categorization**: Smart categorization using URL structure, page titles, and content keywords with scoring - **Skill generation**: Creates SKILL.md with real code examples and categorized reference files ### Data Flow 1. **Scrape Phase**: - Input: Config JSON (name, base_url, selectors, url_patterns, categories, rate_limit, max_pages) - Process: BFS traversal starting from base_url, respecting include/exclude patterns - Output: `output/{name}_data/pages/*.json` + `summary.json` 2. **Build Phase**: - Input: Scraped JSON data from `output/{name}_data/` - Process: Load pages → Smart categorize → Extract patterns → Generate references - Output: `output/{name}/SKILL.md` + `output/{name}/references/*.md` ### Directory Structure ``` Skill_Seekers/ ├── cli/ # CLI tools │ ├── doc_scraper.py # Main scraping & building tool │ ├── enhance_skill.py # AI enhancement (API-based) │ ├── enhance_skill_local.py # AI enhancement (LOCAL, no API) │ ├── estimate_pages.py # Page count estimator │ ├── split_config.py # Large docs splitter (NEW) │ ├── generate_router.py # Router skill generator (NEW) │ ├── package_skill.py # Single skill packager │ └── package_multi.py # Multi-skill packager (NEW) ├── mcp/ # MCP server │ ├── server.py # 9 MCP tools (includes upload) │ └── README.md ├── configs/ # Preset configurations │ ├── godot.json │ ├── godot-large-example.json # Large docs example (NEW) │ ├── react.json │ └── ... ├── docs/ # Documentation │ ├── CLAUDE.md # Technical architecture (this file) │ ├── LARGE_DOCUMENTATION.md # Large docs guide (NEW) │ ├── ENHANCEMENT.md │ ├── MCP_SETUP.md │ └── ... └── output/ # Generated output (git-ignored) ├── {name}_data/ # Raw scraped data (cached) │ ├── pages/ # Individual page JSONs │ ├── summary.json # Scraping summary │ └── checkpoint.json # Resume checkpoint (NEW) └── {name}/ # Generated skill ├── SKILL.md # Main skill file with examples ├── SKILL.md.backup # Backup (if enhanced) ├── references/ # Categorized documentation │ ├── index.md │ ├── getting_started.md │ ├── api.md │ └── ... ├── scripts/ # Empty (for user scripts) └── assets/ # Empty (for user assets) ``` ### Configuration Format Config files in `configs/*.json` contain: - `name`: Skill identifier (e.g., "godot", "react") - `description`: When to use this skill - `base_url`: Starting URL for scraping - `selectors`: CSS selectors for content extraction - `main_content`: Main documentation content (e.g., "article", "div[role='main']") - `title`: Page title selector - `code_blocks`: Code sample selector (e.g., "pre code", "pre") - `url_patterns`: URL filtering - `include`: Only scrape URLs containing these patterns - `exclude`: Skip URLs containing these patterns - `categories`: Keyword-based categorization mapping - `rate_limit`: Delay between requests (seconds) - `max_pages`: Maximum pages to scrape - `split_strategy`: (Optional) How to split large docs: "auto", "category", "router", "size" - `split_config`: (Optional) Split configuration - `target_pages_per_skill`: Pages per sub-skill (default: 5000) - `create_router`: Create router/hub skill (default: true) - `split_by_categories`: Category names to split by - `checkpoint`: (Optional) Checkpoint/resume configuration - `enabled`: Enable checkpointing (default: false) - `interval`: Save every N pages (default: 1000) ### Key Features **Auto-detect existing data**: Tool checks for `output/{name}_data/` and prompts to reuse, avoiding re-scraping. **Language detection**: Detects code languages from: 1. CSS class attributes (`language-*`, `lang-*`) 2. Heuristics (keywords like `def`, `const`, `func`, etc.) **Pattern extraction**: Looks for "Example:", "Pattern:", "Usage:" markers in content and extracts following code blocks (up to 5 per page). **Smart categorization**: - Scores pages against category keywords (3 points for URL match, 2 for title, 1 for content) - Threshold of 2+ for categorization - Auto-infers categories from URL segments if none provided - Falls back to "other" category **Enhanced SKILL.md**: Generated with: - Real code examples from documentation (language-annotated) - Quick reference patterns extracted from docs - Common pattern section - Category file listings **AI-Powered Enhancement**: Two scripts to dramatically improve SKILL.md quality: - `enhance_skill.py`: Uses Anthropic API (~$0.15-$0.30 per skill, requires API key) - `enhance_skill_local.py`: Uses Claude Code Max (free, no API key needed) - Transforms generic 75-line templates into comprehensive 500+ line guides - Extracts best examples, explains key concepts, adds navigation guidance - Success rate: 9/10 quality (based on steam-economy test) **Large Documentation Support (NEW)**: Handle 10K-40K+ page documentation: - `split_config.py`: Split large configs into multiple focused sub-skills - `generate_router.py`: Create intelligent router/hub skills that direct queries - `package_multi.py`: Package multiple skills at once - 4 split strategies: auto, category, router, size - Parallel scraping support for faster processing - MCP integration for natural language usage **Checkpoint/Resume (NEW)**: Never lose progress on long scrapes: - Auto-saves every N pages (configurable, default: 1000) - Resume with `--resume` flag - Clear checkpoint with `--fresh` flag - Saves on interruption (Ctrl+C) ## Key Code Locations - **URL validation**: `is_valid_url()` doc_scraper.py:47-62 - **Content extraction**: `extract_content()` doc_scraper.py:64-131 - **Language detection**: `detect_language()` doc_scraper.py:133-163 - **Pattern extraction**: `extract_patterns()` doc_scraper.py:165-181 - **Smart categorization**: `smart_categorize()` doc_scraper.py:280-321 - **Category inference**: `infer_categories()` doc_scraper.py:323-349 - **Quick reference generation**: `generate_quick_reference()` doc_scraper.py:351-370 - **SKILL.md generation**: `create_enhanced_skill_md()` doc_scraper.py:424-540 - **Scraping loop**: `scrape_all()` doc_scraper.py:226-249 - **Main workflow**: `main()` doc_scraper.py:661-733 ## Workflow Examples ### First time scraping (with scraping) ```bash # 1. Scrape + Build python3 cli/doc_scraper.py --config configs/godot.json # Time: 20-40 minutes # 2. Package python3 cli/package_skill.py output/godot/ # Result: godot.zip ``` ### Using cached data (fast iteration) ```bash # 1. Use existing data python3 cli/doc_scraper.py --config configs/godot.json --skip-scrape # Time: 1-3 minutes # 2. Package python3 cli/package_skill.py output/godot/ ``` ### Creating a new framework config ```bash # Option 1: Interactive python3 cli/doc_scraper.py --interactive # Option 2: Copy and modify cp configs/react.json configs/myframework.json # Edit configs/myframework.json python3 cli/doc_scraper.py --config configs/myframework.json ``` ### Large documentation workflow (40K pages) ```bash # 1. Estimate page count (fast, 1-2 minutes) python3 cli/estimate_pages.py configs/godot.json # 2. Split into focused sub-skills python3 cli/split_config.py configs/godot.json --strategy router --target-pages 5000 # Creates: godot-scripting.json, godot-2d.json, godot-3d.json, etc. # 3. Scrape all in parallel (4-8 hours instead of 20-40!) for config in configs/godot-*.json; do python3 cli/doc_scraper.py --config $config & done wait # 4. Generate intelligent router skill python3 cli/generate_router.py configs/godot-*.json # 5. Package all skills python3 cli/package_multi.py output/godot*/ # 6. Upload all .zip files to Claude # Result: Router automatically directs queries to the right sub-skill! ``` **Time savings:** Parallel scraping reduces 20-40 hours to 4-8 hours **See full guide:** [Large Documentation Guide](LARGE_DOCUMENTATION.md) ## Testing Selectors To find the right CSS selectors for a documentation site: ```python from bs4 import BeautifulSoup import requests url = "https://docs.example.com/page" soup = BeautifulSoup(requests.get(url).content, 'html.parser') # Try different selectors print(soup.select_one('article')) print(soup.select_one('main')) print(soup.select_one('div[role="main"]')) ``` ## Running Tests **IMPORTANT: You must install the package before running tests** ```bash # 1. Install package in editable mode (one-time setup) pip install -e . # 2. Run all tests pytest # 3. Run specific test files pytest tests/test_config_validation.py pytest tests/test_github_scraper.py # 4. Run with verbose output pytest -v # 5. Run with coverage report pytest --cov=src/skill_seekers --cov-report=html ``` **Why install first?** - Tests import from `skill_seekers.cli` which requires the package to be installed - Modern Python packaging best practice (PEP 517/518) - CI/CD automatically installs with `pip install -e .` - conftest.py will show helpful error if package not installed **Test Coverage:** - 391+ tests passing - 39% code coverage - All core features tested - CI/CD tests on Ubuntu + macOS with Python 3.10-3.12 ## Troubleshooting **No content extracted**: Check `main_content` selector. Common values: `article`, `main`, `div[role="main"]`, `div.content` **Poor categorization**: Edit `categories` section in config with better keywords specific to the documentation structure **Force re-scrape**: Delete cached data with `rm -rf output/{name}_data/` **Rate limiting issues**: Increase `rate_limit` value in config (e.g., from 0.5 to 1.0 seconds) ## Output Quality Checks After building, verify quality: ```bash cat output/godot/SKILL.md # Should have real code examples cat output/godot/references/index.md # Should show categories ls output/godot/references/ # Should have category .md files ``` ## llms.txt Support Skill_Seekers automatically detects llms.txt files before HTML scraping: ### Detection Order 1. `{base_url}/llms-full.txt` (complete documentation) 2. `{base_url}/llms.txt` (standard version) 3. `{base_url}/llms-small.txt` (quick reference) ### Benefits - ⚡ 10x faster (< 5 seconds vs 20-60 seconds) - ✅ More reliable (maintained by docs authors) - 🎯 Better quality (pre-formatted for LLMs) - 🚫 No rate limiting needed ### Example Sites - Hono: https://hono.dev/llms-full.txt If no llms.txt is found, automatically falls back to HTML scraping.