What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

vs Rust PDF Libraries

PDF Oxide compared with the most-used Rust PDF crates: lopdf, printpdf, pdf-rs, and pdf_extract. Each targets a different level of abstraction and a different set of use cases.

Summary

	PDF Oxide	lopdf	printpdf	pdf-rs	pdf_extract
API level	High-level	Low-level	Mid-level (creation)	Low-level (read)	Mid-level (read)
Read PDFs	Yes	Yes	No	Yes	Yes
Write PDFs	Yes	Yes	Yes	No	No
Text extraction	Yes (high-level)	Manual	No	Manual	Yes (basic)
Image extraction	Yes (high-level)	Manual	No	Manual	No
Form fields	Read + Write	Manual	No	Read only	No
PDF creation	Yes	Yes	Yes	No	No
Markdown/HTML input	Yes	No	No	No	No
Editing existing PDFs	Yes	Yes (low-level)	No	No	No
Annotations	Read + Write	Manual	No	Read only	No
Encryption	Read + Write	No	No	No	No
PDF/A validation	Yes	No	No	No	No
Rendering	Yes (tiny-skia)	No	No	Partial	No
Python bindings	Yes	No	No	No	No
License	MIT	MIT	MIT	MIT	Apache-2.0

All libraries are permissively licensed. The differences are in scope and abstraction level.

Performance Comparison

Full Corpus Benchmark (3,830 PDFs)

Tested on the full 3,830-PDF corpus — three independent, publicly available test suites covering PDF specification compliance (veraPDF, 2,907 files), real-world browser rendering edge cases (Mozilla pdf.js, 897 files), and security/robustness stress tests including malformed structures and fuzz-generated corruption (DARPA SafeDocs, 26 files). See full corpus details.

Library	Mean	p99	Pass Rate	Text Extraction	Notes
PDF Oxide	0.8ms	9ms	100%	Built-in, production-grade	Unicode, CJK, reading order
oxidize_pdf	13.5ms	11ms	99.1%	Basic	48s max outlier
unpdf	2.8ms	10ms	95.1%	Basic	185 failures on full corpus
pdf_extract	4.08ms	37ms	91.5%	Basic	Missing complex layouts
lopdf	0.3ms	2ms	80.2%	No built-in extraction	Fails on 20% of PDFs

lopdf is faster on the PDFs it can parse — but it fails on 20% of the corpus and provides no text extraction. You would need to build font decoding, CMap resolution, and spacing analysis yourself.

pdf_extract provides basic text extraction but has a 91.5% pass rate and struggles with complex layouts, CJK text, and tagged PDFs. oxidize_pdf has decent reliability (99.1%) but is 17× slower than pdf_oxide on mean extraction time, with a 48-second worst-case outlier. unpdf processes the full corpus but fails on 185 PDFs.

PDF Oxide is the only Rust crate that combines 100% reliability with production-grade text extraction.

API Design Comparison

PDF Oxide: High-Level, Task-Oriented

PDF Oxide provides purpose-built methods for common tasks. You work with text, images, and form fields — not PDF objects and dictionaries.

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;

// Text extraction -- one call
let text = doc.extract_text(0)?;
println!("{}", text);

// Styled spans with font metadata
let spans = doc.extract_spans(0)?;
for span in &spans {
    println!("'{}' font={} size={:.1}pt", span.text, span.font_name, span.font_size);
}

// Image extraction
let images = doc.extract_images(0)?;
for img in &images {
    println!("{}x{} {:?}", img.width, img.height, img.format);
}

// Form fields
let fields = doc.extract_form_fields()?;
for field in &fields {
    println!("{}: {:?}", field.name, field.value);
}

PDF creation is equally straightforward:

use pdf_oxide::api::Pdf;

// From Markdown
let pdf = Pdf::from_markdown("# Report\n\n| A | B |\n|---|---|\n| 1 | 2 |")?;
pdf.save("report.pdf")?;

// From HTML
let pdf = Pdf::from_html("<h1>Report</h1><p>Content here.</p>")?;
pdf.save("report.pdf")?;

lopdf: Low-Level Object Manipulation

lopdf gives you direct access to PDF objects, streams, and the cross-reference table. You must understand the PDF specification to use it effectively. There is no built-in text extraction — you navigate dictionaries and decode streams yourself.

use lopdf::Document;

let doc = Document::load("report.pdf")?;

// Get page dictionary
let page_id = doc.page_iter().next().unwrap();
let page = doc.get_dictionary(page_id)?;

// Get content stream -- manual work
let contents = page.get("Contents")?;
let stream = doc.get_object(contents.as_reference()?)?;

// To extract text you must:
// 1. Parse the content stream operators
// 2. Resolve font references from /Resources
// 3. Decode CMap/ToUnicode mappings
// 4. Apply text matrix transformations
// 5. Handle encoding differences
//
// lopdf does not provide any of this -- it is raw object access
println!("Page has {} objects", doc.objects.len());

lopdf is the right tool when you need to manipulate PDF structure directly: merging documents, rewriting object streams, or building specialized PDF processors.

printpdf: PDF Creation Only

printpdf is a creation-only library. It cannot read or parse existing PDFs. It provides a typed API for building PDF documents from scratch with text, images, and vector graphics.

use printpdf::*;

let (doc, page1, layer1) = PdfDocument::new(
    "Report", Mm(210.0), Mm(297.0), "Layer 1"
);

let current_layer = doc.get_page(page1).get_layer(layer1);

// Add text -- requires manual font loading
let font = doc.add_builtin_font(BuiltinFont::Helvetica)?;
current_layer.use_text("Hello World", 24.0, Mm(10.0), Mm(280.0), &font);

// Save
doc.save(&mut std::io::BufWriter::new(
    std::fs::File::create("output.pdf")?,
))?;

// Cannot read existing PDFs
// Cannot extract text, images, or form fields

printpdf is the right tool when you only need to generate new PDFs and want a clean, focused creation API.

pdf-rs: Low-Level PDF Reading

pdf-rs parses PDF structure into Rust types but provides minimal high-level functionality. You get typed access to PDF objects but must still handle text decoding, font resolution, and content stream parsing.

use pdf::file::FileOptions;

let file = FileOptions::cached().open("report.pdf")?;

// Access page objects
let page = file.get_page(0)?;
let media_box = page.media_box()?;
println!("Page size: {:?}", media_box);

// Content stream access -- low-level
if let Some(ref contents) = page.contents {
    // Returns raw operations -- you must interpret them
    // No built-in text assembly, font decoding, or layout analysis
}

// Cannot write or modify PDFs

pdf-rs is the right tool when you need a type-safe PDF parser for analysis, validation, or building a custom renderer.

Feature Comparison by Task

Text Extraction

Library	Built-in	Quality	Effort Required
PDF Oxide	Yes	Production-grade (Unicode, CJK, reading order)	One method call
pdf_extract	Yes	Basic (misses complex layouts)	One method call
lopdf	No	N/A	Hundreds of lines of custom code
printpdf	No	N/A	Not possible (write-only)
pdf-rs	No	N/A	Significant custom code required

PDF Oxide handles CMap/ToUnicode decoding, font metric-based spacing, structure tree reading order, and ligature reconstruction. Implementing equivalent functionality on top of lopdf or pdf-rs requires thousands of lines of code and deep PDF specification knowledge.

PDF Creation

Library	Approach	Markdown/HTML Input	Tables	Barcodes
PDF Oxide	High-level + low-level	Yes	Yes	Yes
lopdf	Raw object construction	No	No	No
printpdf	Typed layer API	No	No	No
pdf-rs	N/A (read-only)	N/A	N/A	N/A

Encryption

Library	Read Encrypted	Write Encrypted	Algorithms
PDF Oxide	Yes	Yes	RC4-40, RC4-128, AES-128, AES-256
lopdf	No	No	–
printpdf	No	No	–
pdf-rs	Partial	No	RC4 only

Compliance

Library	PDF/A	PDF/X	PDF/UA
PDF Oxide	Validate + Convert	Validate	Validate
lopdf	No	No	No
printpdf	Partial (PDF/A-1b output)	No	No
pdf-rs	No	No	No

Dependency Footprint

Library	Dependencies	Compile Time	Binary Size
PDF Oxide	~40 (core)	~30s	~4 MB
lopdf	~15	~10s	~1 MB
printpdf	~20	~15s	~2 MB
pdf-rs	~25	~20s	~2 MB

PDF Oxide has more dependencies because it includes font parsing, image decoding, content stream interpretation, and encryption — features that the other libraries leave to the user or omit entirely. With all optional features (rendering, barcodes, office), the count rises to ~100.

Combining Libraries

Since all are permissively licensed, you can combine them in a single project:

[dependencies]
pdf_oxide = "0.3"
lopdf = "0.32"        # Optional: raw object access for edge cases

Common patterns:

PDF Oxide + lopdf: Use PDF Oxide for extraction and creation, fall back to lopdf for edge cases requiring raw object manipulation.
PDF Oxide + printpdf: Use PDF Oxide for reading and printpdf for specialized creation workflows.

Use Case Matrix

“I need to extract text from PDFs”

Crate	Suitable?	Notes
PDF Oxide	Yes	Best extraction quality, 100% pass rate, reading order, font metadata
pdf_extract	Partial	Basic extraction, 91.5% pass rate
lopdf	No	No text extraction
printpdf	No	Cannot read PDFs
pdf-rs	Partial	Basic parsing, no high-level text extraction

“I need to create PDFs”

Crate	Suitable?	Notes
PDF Oxide	Yes	High-level (Markdown/HTML) and low-level APIs
lopdf	Partial	Low-level object construction
printpdf	Yes	Clean creation API, no reading
pdf-rs	No	Read-only

“I need to edit existing PDFs”

Crate	Suitable?	Notes
PDF Oxide	Yes	DOM-like editing, annotations, forms
lopdf	Partial	Low-level object manipulation
printpdf	No	Cannot read PDFs
pdf-rs	No	Read-only

“I need the full lifecycle (extract + create + edit)”

Crate	Suitable?	Notes
PDF Oxide	Yes	Only crate covering all three
lopdf + printpdf	Partial	Two crates, no text extraction
pdf-rs + printpdf	Partial	Two crates, no editing

When to Use Each

Choose PDF Oxide if you need more than one PDF capability (extraction + creation, or extraction + editing) and want a single, well-tested dependency with 100% reliability.

Choose lopdf if you need low-level PDF structure manipulation and are comfortable working with the PDF spec directly. Good for merging, splitting, and batch PDF processing.

Choose printpdf if you only create PDFs and never need to read them. The cleanest API for report and document generation.

Choose pdf-rs if you need a spec-compliant parser for PDF analysis or are building your own rendering pipeline.

Choose pdf_extract if you need basic text extraction and don’t require high reliability or complex layout support.

Performance Benchmarks – full corpus benchmark results
Getting Started with Rust – installation and first extraction
Rust API Reference – complete Rust API
vs Python PDF Libraries – Python ecosystem comparison