What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Performance

PDF Oxide v0.3.11 delivers 0.8ms mean text extraction in Python across a 3,830-PDF corpus — 5× faster than PyMuPDF and 15× faster than pypdf, with 100% pass rate on valid PDFs.

Benchmark Results

Corpus: 3,830 PDFs

Three independent public test suites combined:

Suite	PDFs	Source
veraPDF	2,907	PDF/A conformance test corpus
Mozilla pdf.js	897	Browser PDF rendering test suite
SafeDocs	26	DARPA SafeDocs malformed PDF corpus

Why this corpus is reliable

These are not hand-picked PDFs. Each suite is an established, peer-reviewed test corpus maintained by standards bodies or major open-source projects:

veraPDF is the official PDF/A conformance validator used by the PDF standards community for over 10 years. Its 2,907 test files are atomic — each tests exactly one PDF specification feature — spanning every PDF/A version (1A/1B, 2A/2B/2U, 3A/3B/3U, 4/4E/4F), PDF/UA accessibility (UA1, UA2), and both ISO 32000-1 (PDF 1.7) and ISO 32000-2 (PDF 2.0). Licensed under CC BY 4.0.
Mozilla pdf.js powers PDF rendering in Firefox, processing billions of PDFs annually. Its 897 test files cover real-world rendering edge cases: complex multi-page documents, annotation types (border styles, highlights, carets, file attachments), form widgets, unusual font encodings, large stress-test documents, and content stream edge cases. 7 files are intentionally corrupted to test correct error rejection.
DARPA SafeDocs is a U.S. government-funded security research program focused on parser robustness. Its 26 files target the hardest edge cases: content stream cycles in Type3 fonts, dual startxref trailers, compacted PDF syntax, dialect variations, inline image edge cases, recursive font nesting, and encrypted PDFs with Unicode passwords (including UTF-16 LE). These files are designed to crash, hang, or exploit vulnerable parsers.

What this corpus covers

Every major PDF version: PDF 1.0 through PDF 2.0
Encryption & passwords: AES-256, RC4, Unicode passwords, UTF-16 LE encoding
Security edge cases: Recursive structures, content stream cycles, malformed trailers, fuzz-generated corruption
Font diversity: TrueType, CIDFont, Type1, Type3, CJK encodings, embedded subsets
Document complexity: Single-page fixtures to 10,000+ page documents, inline images, nested Form XObjects
Correct rejection: 7 intentionally broken files (missing PDF headers, invalid xref streams) — a library that “parses” these is less secure, not more reliable

Of the 3,830 files, 3,823 are valid PDFs. The 7 invalid files are test fixtures for error handling — PDF Oxide correctly rejects all 7.

Python Library Comparison

Mean text extraction time per PDF on the full 3,830-PDF corpus:

Library	Mean	p99	Pass Rate	License
PDF Oxide	0.8ms	9ms	100%	MIT
PyMuPDF	4.6ms	28ms	99.3%	AGPL-3.0
pypdfium2	4.1ms	42ms	99.2%	Apache-2.0
pymupdf4llm	55.5ms	280ms	99.1%	AGPL-3.0
pdftext	7.3ms	82ms	99.0%	GPL-3.0
pdfminer	16.8ms	124ms	98.8%	MIT
pdfplumber	23.2ms	189ms	98.8%	MIT
markitdown	108.8ms	378ms	98.6%	MIT
pypdf	12.1ms	97ms	98.4%	BSD-3

PDF Oxide is the fastest Python PDF library available. Unlike PyMuPDF, it uses the MIT license — no AGPL restrictions for commercial use.

Rust Library Comparison

Library	Mean	p99	Pass Rate	Text Extraction
PDF Oxide	0.8ms	9ms	100%	Built-in, production-grade
oxidize_pdf	13.5ms	11ms	99.1%	Basic
unpdf	2.8ms	10ms	95.1%	Basic
pdf_extract	4.08ms	37ms	91.5%	Basic
lopdf	0.3ms	2ms	80.2%	No built-in extraction

lopdf is faster on the PDFs it can parse, but fails on 20% of the corpus. pdf_oxide is the only Rust crate that combines 100% reliability with built-in text extraction. Note that lopdf provides no text extraction – you must build font decoding and spacing analysis yourself.

Text Quality

PDF Oxide achieves 99.5% text parity compared to PyMuPDF and pypdfium2 on the full corpus. Quality was measured by comparing extracted text output character-by-character across all 3,823 valid PDFs.

Per-Corpus Breakdown

Corpus	PDFs	PDF Oxide Mean	pypdfium2 Mean	PyMuPDF Mean
veraPDF	2,907	0.7ms	3.6ms	4.1ms
Mozilla pdf.js	897	1.1ms	5.8ms	6.2ms
SafeDocs	26	0.9ms	4.0ms	4.3ms

Optimization History: v0.3.5 → v0.3.8

v0.3.8 eliminated two critical O(n) bottlenecks that caused the mean to drop from 23.3ms to 0.8ms (Python) across the same corpus.

1. Bulk Page Tree Cache

Before: get_page() traversed the page tree from root for every uncached page. For sequential extraction of all pages, this was O(n) per page and O(n²) total.

After: On first page access, the entire page tree is walked once and all pages are cached in a HashMap<usize, Object>. Every subsequent access is O(1).

This is the fix that brought a 10,000-page PDF from 55 seconds to 332 milliseconds.

2. Scan-for-Object Offset Cache

Before: When objects were missing from the xref table, scan_for_object() read the entire PDF file for each missing object. Tagged PDFs with hundreds of structure tree elements not in xref triggered hundreds of full file reads.

After: The file is scanned once and all object offsets are cached in a HashMap. Subsequent lookups are O(1).

3. Single-Pass Text Extraction

Before: extract_spans() ran two passes over the page content — first to classify the document type (academic, newspaper, form, etc.), then to extract text.

After: The classification pass was eliminated entirely. Adaptive font-aware thresholds now produce equal or better results in a single pass.

4. Content Stream Pre-Allocation

Before: parse_content_stream() built the operator Vec starting from default capacity, causing repeated reallocations on large content streams.

After: The Vec is pre-allocated based on stream size (data.len() / 20), which estimates roughly one operator per 20 bytes.

Methodology

All benchmarks use the same methodology:

Each library processes all 3,830 PDFs using Python multiprocessing (one PDF per process)
60-second timeout per PDF — any PDF exceeding this is counted as a failure
Extracted text is saved to disk per library for quality comparison
Wall-clock time measured from file open to final text extraction
No warm-up runs, no caching between files
Single-thread per PDF

The benchmark harness runs all 18 libraries (3 Rust, 15 Python) on the same machine, same corpus, same conditions.

Reproducing the Benchmarks

The public test corpora are freely available:

veraPDF: github.com/veraPDF/veraPDF-corpus
Mozilla pdf.js: github.com/mozilla/pdf.js/tree/master/test/pdfs
SafeDocs: github.com/pdf-association/safedocs

Run the verification:

cargo run --release --example verify_corpus -- \
    /path/to/veraPDF-corpus \
    /path/to/pdfjs-test \
    /path/to/safedocs \
    --csv results.csv

Performance Characteristics

What PDF Oxide Is Fast At

Text extraction: The primary optimization target. Sub-millisecond for typical documents.
Sequential multi-page extraction: The page tree cache makes extracting all pages from a large document nearly as fast as extracting one.
Tagged PDFs: Structure tree traversal and object resolution are now cached.
Malformed PDFs: Lenient parsing with fallback strategies avoids expensive retries.

What Scales Linearly

Page count: Each page is processed independently. 100 pages takes roughly 100x one page.
Content stream size: Parsing operators is linear in stream length.
Image extraction: Proportional to the number and size of images.

When to Expect Slower Results

Scanned PDFs with OCR: OCR processing (if enabled) is significantly slower than text extraction.
Rendering: Page rendering to images depends on content complexity and target DPI.
Heavily encrypted PDFs: AES-256 decryption adds overhead per stream.
PDFs with thousands of fonts: Font parsing is cached per document, but initial parsing scales with font count.

Next Steps

Changelog – full version history
Python Library Comparison – detailed comparison with PyMuPDF, pypdf, pdfplumber, pdfminer
Rust Library Comparison – detailed comparison with lopdf, pdf_extract, pdf-rs
Getting Started with Rust – installation and first extraction
Rust API Reference – complete API documentation