Date: October 21, 2025 Task: B1.1 - Research PDF parsing libraries Purpose: Evaluate Python libraries for extracting text and code from PDF documentation
After comprehensive research, PyMuPDF (fitz) is recommended as the primary library for Skill Seeker's PDF parsing needs, with pdfplumber as a secondary option for complex table extraction.
| Library | Speed | Text Quality | Code Detection | Tables | Maintenance | License |
|---|---|---|---|---|---|---|
| PyMuPDF | ⚡⚡⚡⚡⚡ Fastest (42ms) | High | Excellent | Good | Active | AGPL/Commercial |
| pdfplumber | ⚡⚡ Slower (2.5s) | Very High | Excellent | Excellent | Active | MIT |
| pypdf | ⚡⚡⚡ Fast | Medium | Good | Basic | Active | BSD |
| pdfminer.six | ⚡ Slow | Very High | Good | Medium | Active | MIT |
| pypdfium2 | ⚡⚡⚡⚡⚡ Very Fast (3ms) | Medium | Good | Basic | Active | Apache-2.0 |
Performance: 42 milliseconds (60x faster than pdfminer.six)
Installation:
pip install PyMuPDF
Pros:
Cons:
Code Example:
import fitz # PyMuPDF
# Extract text from entire PDF
def extract_pdf_text(pdf_path):
doc = fitz.open(pdf_path)
text = ''
for page in doc:
text += page.get_text()
doc.close()
return text
# Extract text from single page
def extract_page_text(pdf_path, page_num):
doc = fitz.open(pdf_path)
page = doc.load_page(page_num)
text = page.get_text()
doc.close()
return text
# Extract with markdown formatting
def extract_as_markdown(pdf_path):
doc = fitz.open(pdf_path)
markdown = ''
for page in doc:
markdown += page.get_text("markdown")
doc.close()
return markdown
Use Cases for Skill Seeker:
Performance: ~2.5 seconds (slower but more precise)
Installation:
pip install pdfplumber
Pros:
Cons:
Code Example:
import pdfplumber
# Extract text from PDF
def extract_with_pdfplumber(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
text = ''
for page in pdf.pages:
text += page.extract_text()
return text
# Extract tables
def extract_tables(pdf_path):
tables = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
page_tables = page.extract_tables()
tables.extend(page_tables)
return tables
# Extract specific region (for code blocks)
def extract_region(pdf_path, page_num, bbox):
with pdfplumber.open(pdf_path) as pdf:
page = pdf.pages[page_num]
cropped = page.crop(bbox)
return cropped.extract_text()
Use Cases for Skill Seeker:
Performance: Fast (medium speed)
Installation:
pip install pypdf
Pros:
Cons:
Code Example:
from pypdf import PdfReader
# Extract text
def extract_with_pypdf(pdf_path):
reader = PdfReader(pdf_path)
text = ''
for page in reader.pages:
text += page.extract_text()
return text
Use Cases for Skill Seeker:
Performance: Slow (~2.5 seconds)
Installation:
pip install pdfminer.six
Pros:
Cons:
Use Cases for Skill Seeker:
Performance: Very fast (3ms - fastest tested)
Installation:
pip install pypdfium2
Pros:
Cons:
Use Cases for Skill Seeker:
PyMuPDF requires AGPL compliance (source code must be shared) OR a commercial license for proprietary use. Since Skill Seeker is open source on GitHub, AGPL is acceptable.
Based on 2025 testing:
| Library | Time (single page) | Time (100 pages) |
|---|---|---|
| pypdfium2 | 0.003s | 0.3s |
| PyMuPDF | 0.042s | 4.2s |
| pypdf | 0.1s | 10s |
| pdfplumber | 2.5s | 250s |
| pdfminer.six | 2.5s | 250s |
Winner: pypdfium2 (speed) / PyMuPDF (features + speed balance)
Why:
Implementation Strategy:
import fitz # PyMuPDF
def extract_pdf_documentation(pdf_path):
"""
Extract documentation from PDF with code block detection
"""
doc = fitz.open(pdf_path)
pages = []
for page_num, page in enumerate(doc):
# Get text with layout info
text = page.get_text("text")
# Get markdown (preserves code blocks)
markdown = page.get_text("markdown")
# Get images (for diagrams)
images = page.get_images()
pages.append({
'page_number': page_num,
'text': text,
'markdown': markdown,
'images': images
})
doc.close()
return pages
When to use:
Implementation Strategy:
import pdfplumber
def extract_pdf_tables(pdf_path):
"""
Extract tables from PDF documentation
"""
with pdfplumber.open(pdf_path) as pdf:
tables = []
for page in pdf.pages:
page_tables = page.extract_tables()
if page_tables:
tables.extend(page_tables)
return tables
PDFs don't have semantic "code block" markers like HTML. Detection strategies:
# PyMuPDF can detect font changes
def detect_code_by_font(page):
blocks = page.get_text("dict")["blocks"]
code_blocks = []
for block in blocks:
if 'lines' in block:
for line in block['lines']:
for span in line['spans']:
font = span['font']
# Monospace fonts indicate code
if 'Courier' in font or 'Mono' in font:
code_blocks.append(span['text'])
return code_blocks
def detect_code_by_indent(text):
lines = text.split('\n')
code_blocks = []
current_block = []
for line in lines:
# Code often has consistent indentation
if line.startswith(' ') or line.startswith('\t'):
current_block.append(line)
elif current_block:
code_blocks.append('\n'.join(current_block))
current_block = []
return code_blocks
import re
def detect_code_by_pattern(text):
# Look for common code patterns
patterns = [
r'(def \w+\(.*?\):)', # Python functions
r'(function \w+\(.*?\) \{)', # JavaScript
r'(class \w+:)', # Python classes
r'(import \w+)', # Import statements
]
code_snippets = []
for pattern in patterns:
matches = re.findall(pattern, text)
code_snippets.extend(matches)
return code_snippets
Goal: Proof of concept using PyMuPDF
Implementation Plan:
cli/pdf_extractor_poc.pyDependencies:
pip install PyMuPDF
Expected Output:
{
"pages": [
{
"page_number": 1,
"text": "...",
"code_blocks": ["def main():", "import sys"],
"images": []
}
]
}
pdf_scraper.py CLIFor Skill Seeker's PDF documentation extraction:
Estimated Implementation Time:
License: AGPL (PyMuPDF) is acceptable for Skill Seeker (open source)
Research completed: ✅ October 21, 2025 Next task: B1.2 - Create simple PDF text extractor (proof of concept)