As of this release, Skill Seeker supports asynchronous scraping for dramatically improved performance when scraping documentation websites.
| Metric | Sync (Threads) | Async | Improvement |
|---|---|---|---|
| Pages/second | ~15-20 | ~40-60 | 2-3x faster |
| Memory per worker | ~10-15 MB | ~1-2 MB | 80-90% less |
| Max concurrent | ~50-100 | ~500-1000 | 10x more |
| CPU efficiency | GIL-limited | Full cores | Much better |
# Enable async mode with 8 workers for best performance
python3 cli/doc_scraper.py --config configs/react.json --async --workers 8
# Quick mode with async
python3 cli/doc_scraper.py --name react --url https://react.dev/ --async --workers 8
# Dry run with async to test
python3 cli/doc_scraper.py --config configs/godot.json --async --workers 4 --dry-run
Add "async_mode": true to your config JSON:
{
"name": "react",
"base_url": "https://react.dev/",
"async_mode": true,
"workers": 8,
"rate_limit": 0.5,
"max_pages": 500
}
Then run normally:
python3 cli/doc_scraper.py --config configs/react-async.json
--async --workers 4
--async --workers 8
--async --workers 8 --no-rate-limit
Note: More workers isn't always better. Test with 4, then 8, to find optimal performance for your use case.
New Methods:
async def scrape_page_async() - Async version of page scrapingasync def scrape_all_async() - Async version of scraping loopKey Technologies:
Backwards Compatibility:
Sync Mode (Threads):
python3 cli/doc_scraper.py --config configs/react.json --workers 8
# Time: ~45 minutes
# Pages/sec: ~18
# Memory: ~120 MB
Async Mode:
python3 cli/doc_scraper.py --config configs/react.json --async --workers 8
# Time: ~15 minutes (3x faster!)
# Pages/sec: ~55
# Memory: ~40 MB (66% less)
✅ Use async when:
❌ Don't use async when:
Async mode respects rate limits just like sync mode:
# 0.5 second delay between requests (default)
--async --workers 8 --rate-limit 0.5
# No rate limiting (use carefully!)
--async --workers 8 --no-rate-limit
Async mode supports checkpoints for resuming interrupted scrapes:
{
"async_mode": true,
"checkpoint": {
"enabled": true,
"interval": 1000
}
}
Async mode includes comprehensive tests:
# Run async-specific tests
python -m pytest tests/test_async_scraping.py -v
# Run all tests
python cli/run_tests.py
Test Coverage:
Reduce worker count:
--async --workers 4 # Instead of 8
This can happen with:
Solution: Use sync mode for small docs, async for large ones.
Async reduces memory per worker, but:
Solution: Use 4-6 workers instead of 8-10.
# Godot documentation (~1,600 pages)
python3 cli/doc_scraper.py \\
--config configs/godot.json \\
--async \\
--workers 8 \\
--rate-limit 0.3
# Result: ~12 minutes (vs 40 minutes sync)
# Django documentation with polite rate limiting
python3 cli/doc_scraper.py \\
--config configs/django.json \\
--async \\
--workers 4 \\
--rate-limit 1.0
# Still faster than sync, but respectful to server
# Dry run to test async without actual scraping
python3 cli/doc_scraper.py \\
--config configs/react.json \\
--async \\
--workers 8 \\
--dry-run
# Preview URLs, test configuration
Planned improvements for async mode:
configs/ directory for examples