PDF_PARSING_RESEARCH.md 12 KB

PDF Parsing Libraries Research (Task B1.1)

Date: October 21, 2025 Task: B1.1 - Research PDF parsing libraries Purpose: Evaluate Python libraries for extracting text and code from PDF documentation


Executive Summary

After comprehensive research, PyMuPDF (fitz) is recommended as the primary library for Skill Seeker's PDF parsing needs, with pdfplumber as a secondary option for complex table extraction.

Quick Recommendation:

  • Primary Choice: PyMuPDF (fitz) - Fast, comprehensive, well-maintained
  • Secondary/Fallback: pdfplumber - Better for tables, slower but more precise
  • Avoid: PyPDF2 (deprecated, merged into pypdf)

Library Comparison Matrix

Library Speed Text Quality Code Detection Tables Maintenance License
PyMuPDF ⚡⚡⚡⚡⚡ Fastest (42ms) High Excellent Good Active AGPL/Commercial
pdfplumber ⚡⚡ Slower (2.5s) Very High Excellent Excellent Active MIT
pypdf ⚡⚡⚡ Fast Medium Good Basic Active BSD
pdfminer.six ⚡ Slow Very High Good Medium Active MIT
pypdfium2 ⚡⚡⚡⚡⚡ Very Fast (3ms) Medium Good Basic Active Apache-2.0

Detailed Analysis

1. PyMuPDF (fitz) ⭐ RECOMMENDED

Performance: 42 milliseconds (60x faster than pdfminer.six)

Installation:

pip install PyMuPDF

Pros:

  • ✅ Extremely fast (C-based MuPDF backend)
  • ✅ Comprehensive features (text, images, tables, metadata)
  • ✅ Supports markdown output
  • ✅ Can extract images and diagrams
  • ✅ Well-documented and actively maintained
  • ✅ Handles complex layouts well

Cons:

  • ⚠️ AGPL license (requires commercial license for proprietary projects)
  • ⚠️ Requires MuPDF binary installation (handled by pip)
  • ⚠️ Slightly larger dependency footprint

Code Example:

import fitz  # PyMuPDF

# Extract text from entire PDF
def extract_pdf_text(pdf_path):
    doc = fitz.open(pdf_path)
    text = ''
    for page in doc:
        text += page.get_text()
    doc.close()
    return text

# Extract text from single page
def extract_page_text(pdf_path, page_num):
    doc = fitz.open(pdf_path)
    page = doc.load_page(page_num)
    text = page.get_text()
    doc.close()
    return text

# Extract with markdown formatting
def extract_as_markdown(pdf_path):
    doc = fitz.open(pdf_path)
    markdown = ''
    for page in doc:
        markdown += page.get_text("markdown")
    doc.close()
    return markdown

Use Cases for Skill Seeker:

  • Fast extraction of code examples from PDF docs
  • Preserving formatting for code blocks
  • Extracting diagrams and screenshots
  • High-volume documentation scraping

2. pdfplumber ⭐ RECOMMENDED (for tables)

Performance: ~2.5 seconds (slower but more precise)

Installation:

pip install pdfplumber

Pros:

  • ✅ MIT license (fully open source)
  • ✅ Exceptional table extraction
  • ✅ Visual debugging tool
  • ✅ Precise layout preservation
  • ✅ Built on pdfminer (proven text extraction)
  • ✅ No binary dependencies

Cons:

  • ⚠️ Slower than PyMuPDF
  • ⚠️ Higher memory usage for large PDFs
  • ⚠️ Requires more configuration for optimal results

Code Example:

import pdfplumber

# Extract text from PDF
def extract_with_pdfplumber(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = ''
        for page in pdf.pages:
            text += page.extract_text()
        return text

# Extract tables
def extract_tables(pdf_path):
    tables = []
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_tables = page.extract_tables()
            tables.extend(page_tables)
    return tables

# Extract specific region (for code blocks)
def extract_region(pdf_path, page_num, bbox):
    with pdfplumber.open(pdf_path) as pdf:
        page = pdf.pages[page_num]
        cropped = page.crop(bbox)
        return cropped.extract_text()

Use Cases for Skill Seeker:

  • Extracting API reference tables from PDFs
  • Precise code block extraction with layout
  • Documentation with complex table structures

3. pypdf (formerly PyPDF2)

Performance: Fast (medium speed)

Installation:

pip install pypdf

Pros:

  • ✅ BSD license
  • ✅ Simple API
  • ✅ Can modify PDFs (merge, split, encrypt)
  • ✅ Actively maintained (PyPDF2 merged back)
  • ✅ No external dependencies

Cons:

  • ⚠️ Limited complex layout support
  • ⚠️ Basic text extraction only
  • ⚠️ Poor with scanned/image PDFs
  • ⚠️ No table extraction

Code Example:

from pypdf import PdfReader

# Extract text
def extract_with_pypdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ''
    for page in reader.pages:
        text += page.extract_text()
    return text

Use Cases for Skill Seeker:

  • Simple text extraction
  • Fallback when PyMuPDF licensing is an issue
  • Basic PDF manipulation tasks

4. pdfminer.six

Performance: Slow (~2.5 seconds)

Installation:

pip install pdfminer.six

Pros:

  • ✅ MIT license
  • ✅ Excellent text quality (preserves formatting)
  • ✅ Handles complex layouts
  • ✅ Pure Python (no binaries)

Cons:

  • ⚠️ Slowest option
  • ⚠️ Complex API
  • ⚠️ Poor documentation
  • ⚠️ Limited table support

Use Cases for Skill Seeker:

  • Not recommended (pdfplumber is built on this with better API)

5. pypdfium2

Performance: Very fast (3ms - fastest tested)

Installation:

pip install pypdfium2

Pros:

  • ✅ Extremely fast
  • ✅ Apache 2.0 license
  • ✅ Lightweight
  • ✅ Clean output

Cons:

  • ⚠️ Basic features only
  • ⚠️ Limited documentation
  • ⚠️ No table extraction
  • ⚠️ Newer/less proven

Use Cases for Skill Seeker:

  • High-speed basic extraction
  • Potential future optimization

Licensing Considerations

Open Source Projects (Skill Seeker):

  • PyMuPDF: ✅ AGPL license is fine for open-source projects
  • pdfplumber: ✅ MIT license (most permissive)
  • pypdf: ✅ BSD license (permissive)

Important Note:

PyMuPDF requires AGPL compliance (source code must be shared) OR a commercial license for proprietary use. Since Skill Seeker is open source on GitHub, AGPL is acceptable.


Performance Benchmarks

Based on 2025 testing:

Library Time (single page) Time (100 pages)
pypdfium2 0.003s 0.3s
PyMuPDF 0.042s 4.2s
pypdf 0.1s 10s
pdfplumber 2.5s 250s
pdfminer.six 2.5s 250s

Winner: pypdfium2 (speed) / PyMuPDF (features + speed balance)


Recommendations for Skill Seeker

Primary Approach: PyMuPDF (fitz)

Why:

  1. Speed - 60x faster than alternatives
  2. Features - Text, images, markdown output, metadata
  3. Quality - High-quality text extraction
  4. Maintained - Active development, good docs
  5. License - AGPL is fine for open source

Implementation Strategy:

import fitz  # PyMuPDF

def extract_pdf_documentation(pdf_path):
    """
    Extract documentation from PDF with code block detection
    """
    doc = fitz.open(pdf_path)
    pages = []

    for page_num, page in enumerate(doc):
        # Get text with layout info
        text = page.get_text("text")

        # Get markdown (preserves code blocks)
        markdown = page.get_text("markdown")

        # Get images (for diagrams)
        images = page.get_images()

        pages.append({
            'page_number': page_num,
            'text': text,
            'markdown': markdown,
            'images': images
        })

    doc.close()
    return pages

Fallback Approach: pdfplumber

When to use:

  • PDF has complex tables that PyMuPDF misses
  • Need visual debugging
  • License concerns (use MIT instead of AGPL)

Implementation Strategy:

import pdfplumber

def extract_pdf_tables(pdf_path):
    """
    Extract tables from PDF documentation
    """
    with pdfplumber.open(pdf_path) as pdf:
        tables = []
        for page in pdf.pages:
            page_tables = page.extract_tables()
            if page_tables:
                tables.extend(page_tables)
        return tables

Code Block Detection Strategy

PDFs don't have semantic "code block" markers like HTML. Detection strategies:

1. Font-based Detection

# PyMuPDF can detect font changes
def detect_code_by_font(page):
    blocks = page.get_text("dict")["blocks"]
    code_blocks = []

    for block in blocks:
        if 'lines' in block:
            for line in block['lines']:
                for span in line['spans']:
                    font = span['font']
                    # Monospace fonts indicate code
                    if 'Courier' in font or 'Mono' in font:
                        code_blocks.append(span['text'])

    return code_blocks

2. Indentation-based Detection

def detect_code_by_indent(text):
    lines = text.split('\n')
    code_blocks = []
    current_block = []

    for line in lines:
        # Code often has consistent indentation
        if line.startswith('    ') or line.startswith('\t'):
            current_block.append(line)
        elif current_block:
            code_blocks.append('\n'.join(current_block))
            current_block = []

    return code_blocks

3. Pattern-based Detection

import re

def detect_code_by_pattern(text):
    # Look for common code patterns
    patterns = [
        r'(def \w+\(.*?\):)',  # Python functions
        r'(function \w+\(.*?\) \{)',  # JavaScript
        r'(class \w+:)',  # Python classes
        r'(import \w+)',  # Import statements
    ]

    code_snippets = []
    for pattern in patterns:
        matches = re.findall(pattern, text)
        code_snippets.extend(matches)

    return code_snippets

Next Steps (Task B1.2+)

Immediate Next Task: B1.2 - Create Simple PDF Text Extractor

Goal: Proof of concept using PyMuPDF

Implementation Plan:

  1. Create cli/pdf_extractor_poc.py
  2. Extract text from sample PDF
  3. Detect code blocks using font/pattern matching
  4. Output to JSON (similar to web scraper)

Dependencies:

pip install PyMuPDF

Expected Output:

{
  "pages": [
    {
      "page_number": 1,
      "text": "...",
      "code_blocks": ["def main():", "import sys"],
      "images": []
    }
  ]
}

Future Tasks:

  • B1.3: Add page chunking (split large PDFs)
  • B1.4: Improve code block detection
  • B1.5: Extract images/diagrams
  • B1.6: Create full pdf_scraper.py CLI
  • B1.7: Add MCP tool integration
  • B1.8: Create PDF config format

Additional Resources

Documentation:

Comparison Studies:

Example Use Cases:

  • Extracting API docs from PDF manuals
  • Converting PDF guides to markdown
  • Building skills from PDF-only documentation

Conclusion

For Skill Seeker's PDF documentation extraction:

  1. Use PyMuPDF (fitz) as primary library
  2. Add pdfplumber for complex table extraction
  3. Detect code blocks using font + pattern matching
  4. Preserve formatting with markdown output
  5. Extract images for diagrams/screenshots

Estimated Implementation Time:

  • B1.2 (POC): 2-3 hours
  • B1.3-B1.5 (Features): 5-8 hours
  • B1.6 (CLI): 3-4 hours
  • B1.7 (MCP): 2-3 hours
  • B1.8 (Config): 1-2 hours
  • Total: 13-20 hours for complete PDF support

License: AGPL (PyMuPDF) is acceptable for Skill Seeker (open source)


Research completed: ✅ October 21, 2025 Next task: B1.2 - Create simple PDF text extractor (proof of concept)