PDF Parsing Libraries Research (Task B1.1)

Date: October 21, 2025 Task: B1.1 - Research PDF parsing libraries Purpose: Evaluate Python libraries for extracting text and code from PDF documentation

Executive Summary

After comprehensive research, PyMuPDF (fitz) is recommended as the primary library for Skill Seeker's PDF parsing needs, with pdfplumber as a secondary option for complex table extraction.

Quick Recommendation:

Primary Choice: PyMuPDF (fitz) - Fast, comprehensive, well-maintained
Secondary/Fallback: pdfplumber - Better for tables, slower but more precise
Avoid: PyPDF2 (deprecated, merged into pypdf)

Library Comparison Matrix

Library	Speed	Text Quality	Code Detection	Tables	Maintenance	License
PyMuPDF	⚡⚡⚡⚡⚡ Fastest (42ms)	High	Excellent	Good	Active	AGPL/Commercial
pdfplumber	⚡⚡ Slower (2.5s)	Very High	Excellent	Excellent	Active	MIT
pypdf	⚡⚡⚡ Fast	Medium	Good	Basic	Active	BSD
pdfminer.six	⚡ Slow	Very High	Good	Medium	Active	MIT
pypdfium2	⚡⚡⚡⚡⚡ Very Fast (3ms)	Medium	Good	Basic	Active	Apache-2.0

Detailed Analysis

1. PyMuPDF (fitz) ⭐ RECOMMENDED

Performance: 42 milliseconds (60x faster than pdfminer.six)

Installation:

pip install PyMuPDF

Pros:

✅ Extremely fast (C-based MuPDF backend)
✅ Comprehensive features (text, images, tables, metadata)
✅ Supports markdown output
✅ Can extract images and diagrams
✅ Well-documented and actively maintained
✅ Handles complex layouts well

Cons:

⚠️ AGPL license (requires commercial license for proprietary projects)
⚠️ Requires MuPDF binary installation (handled by pip)
⚠️ Slightly larger dependency footprint

Code Example:

import fitz  # PyMuPDF

# Extract text from entire PDF
def extract_pdf_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ''
    for page in doc:
        text += page.get_text()
    doc.close()
    return text

# Extract text from single page
def extract_page_text(pdf_path, page_num):
    doc = fitz.open(pdf_path)
    page = doc.load_page(page_num)
    text = page.get_text()
    doc.close()
    return text

# Extract with markdown formatting
def extract_as_markdown(pdf_path):
    doc = fitz.open(pdf_path)
    markdown = ''
    for page in doc:
        markdown += page.get_text("markdown")
    doc.close()
    return markdown

Use Cases for Skill Seeker:

Fast extraction of code examples from PDF docs
Preserving formatting for code blocks
Extracting diagrams and screenshots
High-volume documentation scraping

2. pdfplumber ⭐ RECOMMENDED (for tables)

Performance: ~2.5 seconds (slower but more precise)

Installation:

pip install pdfplumber

Pros:

✅ MIT license (fully open source)
✅ Exceptional table extraction
✅ Visual debugging tool
✅ Precise layout preservation
✅ Built on pdfminer (proven text extraction)
✅ No binary dependencies

Cons:

⚠️ Slower than PyMuPDF
⚠️ Higher memory usage for large PDFs
⚠️ Requires more configuration for optimal results

Code Example:

import pdfplumber

# Extract text from PDF
def extract_with_pdfplumber(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = ''
        for page in pdf.pages:
            text += page.extract_text()
        return text

# Extract tables
def extract_tables(pdf_path):
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables()
            tables.extend(page_tables)
    return tables

# Extract specific region (for code blocks)
def extract_region(pdf_path, page_num, bbox):
    with pdfplumber.open(pdf_path) as pdf:
        page = pdf.pages[page_num]
        cropped = page.crop(bbox)
        return cropped.extract_text()

Use Cases for Skill Seeker:

Extracting API reference tables from PDFs
Precise code block extraction with layout
Documentation with complex table structures

3. pypdf (formerly PyPDF2)

Performance: Fast (medium speed)

Installation:

pip install pypdf

Pros:

✅ BSD license
✅ Simple API
✅ Can modify PDFs (merge, split, encrypt)
✅ Actively maintained (PyPDF2 merged back)
✅ No external dependencies

Cons:

⚠️ Limited complex layout support
⚠️ Basic text extraction only
⚠️ Poor with scanned/image PDFs
⚠️ No table extraction

Code Example:

from pypdf import PdfReader

# Extract text
def extract_with_pypdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ''
    for page in reader.pages:
        text += page.extract_text()
    return text

Use Cases for Skill Seeker:

Simple text extraction
Fallback when PyMuPDF licensing is an issue
Basic PDF manipulation tasks

4. pdfminer.six

Performance: Slow (~2.5 seconds)

Installation:

pip install pdfminer.six

Pros:

✅ MIT license
✅ Excellent text quality (preserves formatting)
✅ Handles complex layouts
✅ Pure Python (no binaries)

Cons:

⚠️ Slowest option
⚠️ Complex API
⚠️ Poor documentation
⚠️ Limited table support

Use Cases for Skill Seeker:

Not recommended (pdfplumber is built on this with better API)

5. pypdfium2

Performance: Very fast (3ms - fastest tested)

Installation:

pip install pypdfium2

Pros:

✅ Extremely fast
✅ Apache 2.0 license
✅ Lightweight
✅ Clean output

Cons:

⚠️ Basic features only
⚠️ Limited documentation
⚠️ No table extraction
⚠️ Newer/less proven

Use Cases for Skill Seeker:

High-speed basic extraction
Potential future optimization

Licensing Considerations

Open Source Projects (Skill Seeker):

PyMuPDF: ✅ AGPL license is fine for open-source projects
pdfplumber: ✅ MIT license (most permissive)
pypdf: ✅ BSD license (permissive)

Important Note:

PyMuPDF requires AGPL compliance (source code must be shared) OR a commercial license for proprietary use. Since Skill Seeker is open source on GitHub, AGPL is acceptable.

Performance Benchmarks

Based on 2025 testing:

Library	Time (single page)	Time (100 pages)
pypdfium2	0.003s	0.3s
PyMuPDF	0.042s	4.2s
pypdf	0.1s	10s
pdfplumber	2.5s	250s
pdfminer.six	2.5s	250s

Winner: pypdfium2 (speed) / PyMuPDF (features + speed balance)

Recommendations for Skill Seeker

Primary Approach: PyMuPDF (fitz)

Why:

Speed - 60x faster than alternatives
Features - Text, images, markdown output, metadata
Quality - High-quality text extraction
Maintained - Active development, good docs
License - AGPL is fine for open source

Implementation Strategy:

import fitz  # PyMuPDF

def extract_pdf_documentation(pdf_path):
    """
    Extract documentation from PDF with code block detection
    """
    doc = fitz.open(pdf_path)
    pages = []

    for page_num, page in enumerate(doc):
        # Get text with layout info
        text = page.get_text("text")

        # Get markdown (preserves code blocks)
        markdown = page.get_text("markdown")

        # Get images (for diagrams)
        images = page.get_images()

        pages.append({
            'page_number': page_num,
            'text': text,
            'markdown': markdown,
            'images': images
        })

    doc.close()
    return pages

Fallback Approach: pdfplumber

When to use:

PDF has complex tables that PyMuPDF misses
Need visual debugging
License concerns (use MIT instead of AGPL)

Implementation Strategy:

import pdfplumber

def extract_pdf_tables(pdf_path):
    """
    Extract tables from PDF documentation
    """
    with pdfplumber.open(pdf_path) as pdf:
        tables = []
        for page in pdf.pages:
            page_tables = page.extract_tables()
            if page_tables:
                tables.extend(page_tables)
        return tables

Code Block Detection Strategy

PDFs don't have semantic "code block" markers like HTML. Detection strategies:

1. Font-based Detection

# PyMuPDF can detect font changes
def detect_code_by_font(page):
    blocks = page.get_text("dict")["blocks"]
    code_blocks = []

    for block in blocks:
        if 'lines' in block:
            for line in block['lines']:
                for span in line['spans']:
                    font = span['font']
                    # Monospace fonts indicate code
                    if 'Courier' in font or 'Mono' in font:
                        code_blocks.append(span['text'])

    return code_blocks

2. Indentation-based Detection

def detect_code_by_indent(text):
    lines = text.split('\n')
    code_blocks = []
    current_block = []

    for line in lines:
        # Code often has consistent indentation
        if line.startswith('    ') or line.startswith('\t'):
            current_block.append(line)
        elif current_block:
            code_blocks.append('\n'.join(current_block))
            current_block = []

    return code_blocks

3. Pattern-based Detection

import re

def detect_code_by_pattern(text):
    # Look for common code patterns
    patterns = [
        r'(def \w+\(.*?\):)',  # Python functions
        r'(function \w+\(.*?\) \{)',  # JavaScript
        r'(class \w+:)',  # Python classes
        r'(import \w+)',  # Import statements
    ]

    code_snippets = []
    for pattern in patterns:
        matches = re.findall(pattern, text)
        code_snippets.extend(matches)

    return code_snippets

Next Steps (Task B1.2+)

Immediate Next Task: B1.2 - Create Simple PDF Text Extractor

Goal: Proof of concept using PyMuPDF

Implementation Plan:

Create cli/pdf_extractor_poc.py
Extract text from sample PDF
Detect code blocks using font/pattern matching
Output to JSON (similar to web scraper)

Dependencies:

pip install PyMuPDF

Expected Output:

{
  "pages": [
    {
      "page_number": 1,
      "text": "...",
      "code_blocks": ["def main():", "import sys"],
      "images": []
    }
  ]
}

Future Tasks:

B1.3: Add page chunking (split large PDFs)
B1.4: Improve code block detection
B1.5: Extract images/diagrams
B1.6: Create full pdf_scraper.py CLI
B1.7: Add MCP tool integration
B1.8: Create PDF config format

Additional Resources

Documentation:

PyMuPDF: https://pymupdf.readthedocs.io/
pdfplumber: https://github.com/jsvine/pdfplumber
pypdf: https://pypdf.readthedocs.io/

Comparison Studies:

2025 Comparative Study: https://arxiv.org/html/2410.09871v1
Performance Benchmarks: https://github.com/py-pdf/benchmarks

Example Use Cases:

Extracting API docs from PDF manuals
Converting PDF guides to markdown
Building skills from PDF-only documentation

Conclusion

For Skill Seeker's PDF documentation extraction:

Use PyMuPDF (fitz) as primary library
Add pdfplumber for complex table extraction
Detect code blocks using font + pattern matching
Preserve formatting with markdown output
Extract images for diagrams/screenshots

Estimated Implementation Time:

B1.2 (POC): 2-3 hours
B1.3-B1.5 (Features): 5-8 hours
B1.6 (CLI): 3-4 hours
B1.7 (MCP): 2-3 hours
B1.8 (Config): 1-2 hours
Total: 13-20 hours for complete PDF support

License: AGPL (PyMuPDF) is acceptable for Skill Seeker (open source)

Research completed: ✅ October 21, 2025 Next task: B1.2 - Create simple PDF text extractor (proof of concept)

PDF_PARSING_RESEARCH.md 12 KB История Исходник

PDF Parsing Libraries Research (Task B1.1)

Executive Summary

Quick Recommendation:

Library Comparison Matrix

Detailed Analysis

1. PyMuPDF (fitz) ⭐ RECOMMENDED

2. pdfplumber ⭐ RECOMMENDED (for tables)

3. pypdf (formerly PyPDF2)

4. pdfminer.six

5. pypdfium2

Licensing Considerations

Open Source Projects (Skill Seeker):

Important Note:

Performance Benchmarks

Recommendations for Skill Seeker

Primary Approach: PyMuPDF (fitz)

Fallback Approach: pdfplumber

Code Block Detection Strategy

1. Font-based Detection

2. Indentation-based Detection

3. Pattern-based Detection

Next Steps (Task B1.2+)

Immediate Next Task: B1.2 - Create Simple PDF Text Extractor

Future Tasks:

Additional Resources

Documentation:

Comparison Studies:

Example Use Cases:

Conclusion

PDF_PARSING_RESEARCH.md 12 KB

История Исходник