What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Migrate from pdfminer.six to PDF Oxide

A complete guide to switching from pdfminer (pdfminer.six) to PDF Oxide, covering every API you use today and how to replace it.

Why Switch from pdfminer?

There are four compelling reasons to migrate:

~30x faster — pdfminer is the slowest mainstream Python PDF library. PDF Oxide averages 0.8ms per page while pdfminer takes tens of milliseconds. Batch jobs that took minutes now take seconds.
Actively maintained — pdfminer.six receives infrequent updates and has a large backlog of open issues. PDF Oxide is actively developed with regular releases.
All-in-one library — pdfminer only does text extraction. PDF Oxide also creates PDFs, edits them, renders pages to images, extracts images, handles forms, and converts to Markdown/HTML.
No configuration needed — pdfminer requires manual LAParams tuning (word_margin, line_margin, char_margin) to get decent results. PDF Oxide handles layout detection automatically.

Step 1: Install

pip install pdf_oxide
pip uninstall pdfminer.six  # optional

Step 2: Replace Imports

# Before
from pdfminer.high_level import extract_text, extract_pages
from pdfminer.layout import LAParams

# After
from pdf_oxide import PdfDocument

Step 3: API Mapping

Task	pdfminer	PDF Oxide
Extract text	`extract_text("file.pdf")`	`PdfDocument("file.pdf").extract_text(0)`
Extract pages	`extract_pages("file.pdf")`	Page-by-page with `doc.extract_text(i)`
Layout analysis	`LAParams()` configuration	Built-in layout detection
Character positions	`LTChar` objects	`doc.extract_chars(0)`
Encrypted PDF	Limited (fails on AES-256)	Full support
To Markdown	Not supported	`doc.to_markdown(0)`
Form fields	Not supported	`doc.get_form_fields()`

Step 4: Common Pattern Changes

Basic Text Extraction

pdfminer’s extract_text processes the entire document at once. PDF Oxide gives you per-page control:

# pdfminer — entire document at once
from pdfminer.high_level import extract_text
text = extract_text("report.pdf")
print(text)

# PDF Oxide — per-page control
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(text)

Layout Analysis

pdfminer requires manual LAParams configuration. PDF Oxide handles it automatically:

# pdfminer — manual layout configuration
from pdfminer.high_level import extract_text
from pdfminer.layout import LAParams

params = LAParams(
    word_margin=0.1,
    line_margin=0.5,
    char_margin=2.0,
    boxes_flow=0.5,
)
text = extract_text("report.pdf", laparams=params)

# PDF Oxide — automatic layout detection
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)  # Layout handled automatically

Character-Level Extraction

pdfminer uses a complex tree of layout objects. PDF Oxide returns a flat list:

# pdfminer — traverse layout tree
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTChar, LTTextBox

for page_layout in extract_pages("report.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextBox):
            for line in element:
                for char in line:
                    if isinstance(char, LTChar):
                        print(f"{char.get_text()} at ({char.x0}, {char.y0})")

# PDF Oxide — flat character list
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for c in doc.extract_chars(0):
    print(f"{c.char} at ({c.x}, {c.y})")

Encrypted PDFs

pdfminer has limited encryption support and fails on AES-256 encrypted files:

# pdfminer — fails on many encrypted PDFs
from pdfminer.high_level import extract_text
text = extract_text("encrypted.pdf", password="password")
# May throw an error on AES-256 encrypted files

# PDF Oxide — full encryption support
from pdf_oxide import PdfDocument
doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)  # Works with all encryption methods

Markdown Conversion (New Capability)

pdfminer has no Markdown support. PDF Oxide makes it easy to feed PDFs into LLM pipelines:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
    md = doc.to_markdown(i)
    print(md)

Form Field Extraction (New Capability)

pdfminer cannot extract form fields. PDF Oxide handles them:

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name}: {f.value}")

Page Rendering (New Capability)

pdfminer has no rendering capability. PDF Oxide can render pages to images:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
png_bytes = doc.render_page(0, dpi=150)
with open("page.png", "wb") as f:
    f.write(png_bytes)

Key Differences

No LAParams tuning — PDF Oxide handles layout automatically. No need to configure word_margin, line_margin, etc.
Speed — pdfminer is the slowest Python PDF library. PDF Oxide is ~30x faster.
All-in-one — pdfminer only does extraction. PDF Oxide also creates, edits, and renders PDFs.

Step 5: Testing Your Migration

Run your existing test files through both libraries and compare output:

from pdf_oxide import PdfDocument

doc = PdfDocument("your-test-file.pdf")

# Verify text extraction
text = doc.extract_text(0)
print(text[:500])

# Verify page count
print(f"Pages: {doc.page_count()}")

# Verify form fields (if applicable)
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name}: {f.value}")

Other Migration Guides

Getting Started with Python — installation guide
Extract Text from PDF — text extraction guide