Performance
PDF Oxide v0.3.11 delivers 0.8ms mean text extraction in Python across a 3,830-PDF corpus — 5× faster than PyMuPDF and 15× faster than pypdf, with 100% pass rate on valid PDFs.
Benchmark Results
Corpus: 3,830 PDFs
Three independent public test suites combined:
| Suite | PDFs | Source |
|---|---|---|
| veraPDF | 2,907 | PDF/A conformance test corpus |
| Mozilla pdf.js | 897 | Browser PDF rendering test suite |
| SafeDocs | 26 | DARPA SafeDocs malformed PDF corpus |
Why this corpus is reliable
These are not hand-picked PDFs. Each suite is an established, peer-reviewed test corpus maintained by standards bodies or major open-source projects:
-
veraPDF is the official PDF/A conformance validator used by the PDF standards community for over 10 years. Its 2,907 test files are atomic — each tests exactly one PDF specification feature — spanning every PDF/A version (1A/1B, 2A/2B/2U, 3A/3B/3U, 4/4E/4F), PDF/UA accessibility (UA1, UA2), and both ISO 32000-1 (PDF 1.7) and ISO 32000-2 (PDF 2.0). Licensed under CC BY 4.0.
-
Mozilla pdf.js powers PDF rendering in Firefox, processing billions of PDFs annually. Its 897 test files cover real-world rendering edge cases: complex multi-page documents, annotation types (border styles, highlights, carets, file attachments), form widgets, unusual font encodings, large stress-test documents, and content stream edge cases. 7 files are intentionally corrupted to test correct error rejection.
-
DARPA SafeDocs is a U.S. government-funded security research program focused on parser robustness. Its 26 files target the hardest edge cases: content stream cycles in Type3 fonts, dual startxref trailers, compacted PDF syntax, dialect variations, inline image edge cases, recursive font nesting, and encrypted PDFs with Unicode passwords (including UTF-16 LE). These files are designed to crash, hang, or exploit vulnerable parsers.
What this corpus covers
- Every major PDF version: PDF 1.0 through PDF 2.0
- Encryption & passwords: AES-256, RC4, Unicode passwords, UTF-16 LE encoding
- Security edge cases: Recursive structures, content stream cycles, malformed trailers, fuzz-generated corruption
- Font diversity: TrueType, CIDFont, Type1, Type3, CJK encodings, embedded subsets
- Document complexity: Single-page fixtures to 10,000+ page documents, inline images, nested Form XObjects
- Correct rejection: 7 intentionally broken files (missing PDF headers, invalid xref streams) — a library that “parses” these is less secure, not more reliable
Of the 3,830 files, 3,823 are valid PDFs. The 7 invalid files are test fixtures for error handling — PDF Oxide correctly rejects all 7.
Python Library Comparison
Mean text extraction time per PDF on the full 3,830-PDF corpus:
| Library | Mean | p99 | Pass Rate | License |
|---|---|---|---|---|
| PDF Oxide | 0.8ms | 9ms | 100% | MIT |
| PyMuPDF | 4.6ms | 28ms | 99.3% | AGPL-3.0 |
| pypdfium2 | 4.1ms | 42ms | 99.2% | Apache-2.0 |
| pymupdf4llm | 55.5ms | 280ms | 99.1% | AGPL-3.0 |
| pdftext | 7.3ms | 82ms | 99.0% | GPL-3.0 |
| pdfminer | 16.8ms | 124ms | 98.8% | MIT |
| pdfplumber | 23.2ms | 189ms | 98.8% | MIT |
| markitdown | 108.8ms | 378ms | 98.6% | MIT |
| pypdf | 12.1ms | 97ms | 98.4% | BSD-3 |
PDF Oxide is the fastest Python PDF library available. Unlike PyMuPDF, it uses the MIT license — no AGPL restrictions for commercial use.
Rust Library Comparison
| Library | Mean | p99 | Pass Rate | Text Extraction |
|---|---|---|---|---|
| PDF Oxide | 0.8ms | 9ms | 100% | Built-in, production-grade |
| oxidize_pdf | 13.5ms | 11ms | 99.1% | Basic |
| unpdf | 2.8ms | 10ms | 95.1% | Basic |
| pdf_extract | 4.08ms | 37ms | 91.5% | Basic |
| lopdf | 0.3ms | 2ms | 80.2% | No built-in extraction |
lopdf is faster on the PDFs it can parse, but fails on 20% of the corpus. pdf_oxide is the only Rust crate that combines 100% reliability with built-in text extraction. Note that lopdf provides no text extraction – you must build font decoding and spacing analysis yourself.
Text Quality
PDF Oxide achieves 99.5% text parity compared to PyMuPDF and pypdfium2 on the full corpus. Quality was measured by comparing extracted text output character-by-character across all 3,823 valid PDFs.
Per-Corpus Breakdown
| Corpus | PDFs | PDF Oxide Mean | pypdfium2 Mean | PyMuPDF Mean |
|---|---|---|---|---|
| veraPDF | 2,907 | 0.7ms | 3.6ms | 4.1ms |
| Mozilla pdf.js | 897 | 1.1ms | 5.8ms | 6.2ms |
| SafeDocs | 26 | 0.9ms | 4.0ms | 4.3ms |
Optimization History: v0.3.5 → v0.3.8
v0.3.8 eliminated two critical O(n) bottlenecks that caused the mean to drop from 23.3ms to 0.8ms (Python) across the same corpus.
1. Bulk Page Tree Cache
Before: get_page() traversed the page tree from root for every uncached page. For sequential extraction of all pages, this was O(n) per page and O(n²) total.
After: On first page access, the entire page tree is walked once and all pages are cached in a HashMap<usize, Object>. Every subsequent access is O(1).
This is the fix that brought a 10,000-page PDF from 55 seconds to 332 milliseconds.
2. Scan-for-Object Offset Cache
Before: When objects were missing from the xref table, scan_for_object() read the entire PDF file for each missing object. Tagged PDFs with hundreds of structure tree elements not in xref triggered hundreds of full file reads.
After: The file is scanned once and all object offsets are cached in a HashMap. Subsequent lookups are O(1).
3. Single-Pass Text Extraction
Before: extract_spans() ran two passes over the page content — first to classify the document type (academic, newspaper, form, etc.), then to extract text.
After: The classification pass was eliminated entirely. Adaptive font-aware thresholds now produce equal or better results in a single pass.
4. Content Stream Pre-Allocation
Before: parse_content_stream() built the operator Vec starting from default capacity, causing repeated reallocations on large content streams.
After: The Vec is pre-allocated based on stream size (data.len() / 20), which estimates roughly one operator per 20 bytes.
Methodology
All benchmarks use the same methodology:
- Each library processes all 3,830 PDFs using Python multiprocessing (one PDF per process)
- 60-second timeout per PDF — any PDF exceeding this is counted as a failure
- Extracted text is saved to disk per library for quality comparison
- Wall-clock time measured from file open to final text extraction
- No warm-up runs, no caching between files
- Single-thread per PDF
The benchmark harness runs all 18 libraries (3 Rust, 15 Python) on the same machine, same corpus, same conditions.
Reproducing the Benchmarks
The public test corpora are freely available:
- veraPDF: github.com/veraPDF/veraPDF-corpus
- Mozilla pdf.js: github.com/mozilla/pdf.js/tree/master/test/pdfs
- SafeDocs: github.com/pdf-association/safedocs
Run the verification:
cargo run --release --example verify_corpus -- \
/path/to/veraPDF-corpus \
/path/to/pdfjs-test \
/path/to/safedocs \
--csv results.csv
Performance Characteristics
What PDF Oxide Is Fast At
- Text extraction: The primary optimization target. Sub-millisecond for typical documents.
- Sequential multi-page extraction: The page tree cache makes extracting all pages from a large document nearly as fast as extracting one.
- Tagged PDFs: Structure tree traversal and object resolution are now cached.
- Malformed PDFs: Lenient parsing with fallback strategies avoids expensive retries.
What Scales Linearly
- Page count: Each page is processed independently. 100 pages takes roughly 100x one page.
- Content stream size: Parsing operators is linear in stream length.
- Image extraction: Proportional to the number and size of images.
When to Expect Slower Results
- Scanned PDFs with OCR: OCR processing (if enabled) is significantly slower than text extraction.
- Rendering: Page rendering to images depends on content complexity and target DPI.
- Heavily encrypted PDFs: AES-256 decryption adds overhead per stream.
- PDFs with thousands of fonts: Font parsing is cached per document, but initial parsing scales with font count.
Next Steps
- Changelog – full version history
- Python Library Comparison – detailed comparison with PyMuPDF, pypdf, pdfplumber, pdfminer
- Rust Library Comparison – detailed comparison with lopdf, pdf_extract, pdf-rs
- Getting Started with Rust – installation and first extraction
- Rust API Reference – complete API documentation