What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide vs lopdf

lopdf is a low-level Rust crate for direct PDF object manipulation. PDF Oxide is a high-level library with built-in text extraction, creation, and editing. They target fundamentally different use cases.

Key Differences

Abstraction level. lopdf gives you raw PDF objects — dictionaries, streams, and cross-reference tables. There is no text extraction, no font decoding, no image export. PDF Oxide provides purpose-built methods: extract_text(), extract_images(), to_markdown().

Reliability. lopdf fails to parse 20% of the 3,830-PDF test corpus. Of the PDFs it does parse, 57% produce empty output because lopdf has no text extraction — you get the objects but no text. PDF Oxide passes 100%.

Speed on parseable PDFs. lopdf is faster at raw object parsing: 0.3ms mean vs PDF Oxide’s 0.8ms. But lopdf does no text extraction work — you’d need to build font decoding, CMap resolution, spacing analysis, and reading order yourself.

Quick Comparison

	PDF Oxide	lopdf
API level	High-level	Low-level
Text extraction	Built-in (production-grade)	None
Pass rate (3,830 PDFs)	100%	80.2%
Mean parse time	0.8ms	0.3ms
Image extraction	Built-in	Manual (raw streams)
Form fields	Read + Write	Manual (raw dictionaries)
PDF creation	Yes (Markdown/HTML)	Yes (raw objects)
Markdown/HTML output	Yes	No
Encryption	Read + Write	No
Rendering	Yes	No
PDF/A validation	Yes	No
License	MIT	MIT

What lopdf Can’t Do

lopdf provides access to PDF objects, but text extraction requires interpreting those objects according to the PDF specification. Here’s what you’d need to build yourself:

Content stream parsing — parse PostScript-like operators (Tj, TJ, Tm, Tf, etc.)
Font resolution — look up /Font resources, resolve indirect references
CMap/ToUnicode decoding — convert glyph IDs to Unicode characters
Font metric spacing — calculate character widths from font descriptors
Text matrix transforms — apply Tm, Td, T* operators to position text
Reading order — determine the correct order for multi-column layouts
Ligature reconstruction — handle fi, fl, ffi ligatures
CJK encoding — decode Chinese, Japanese, Korean text encodings

This is thousands of lines of code and deep knowledge of ISO 32000. PDF Oxide handles all of it internally.

Side-by-Side Code

Text Extraction

PDF Oxide:

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;
let text = doc.extract_text(0)?;
println!("{}", text);

lopdf:

use lopdf::Document;

let doc = Document::load("report.pdf")?;

// lopdf does not provide text extraction.
// You get access to PDF objects only:
let page_id = doc.page_iter().next().unwrap();
let page = doc.get_dictionary(page_id)?;
let contents = page.get("Contents")?;
let stream = doc.get_object(contents.as_reference()?)?;

// To get actual text, you must:
// 1. Parse content stream operators
// 2. Resolve font references from /Resources
// 3. Decode CMap/ToUnicode mappings
// 4. Apply text matrix transformations
// 5. Handle encoding differences
// ... (hundreds to thousands of lines of code)

PDF Creation

PDF Oxide:

use pdf_oxide::api::Pdf;

let pdf = Pdf::from_markdown("# Report\n\n| Q1 | Q2 |\n|---|---|\n| $1M | $2M |")?;
pdf.save("report.pdf")?;

lopdf:

use lopdf::{Document, Object, Stream, dictionary};

let mut doc = Document::with_version("1.5");

// Create font dictionary
let font_id = doc.add_object(dictionary! {
    "Type" => "Font",
    "Subtype" => "Type1",
    "BaseFont" => "Helvetica",
});

// Create resources
let resources_id = doc.add_object(dictionary! {
    "Font" => dictionary! { "F1" => font_id },
});

// Create content stream (raw PostScript operators)
let content = Stream::new(
    dictionary! {},
    b"BT /F1 12 Tf 72 720 Td (Hello World) Tj ET".to_vec(),
);
let content_id = doc.add_object(content);

// Create page
let page_id = doc.add_object(dictionary! {
    "Type" => "Page",
    "MediaBox" => vec![0.into(), 0.into(), 612.into(), 792.into()],
    "Contents" => content_id,
    "Resources" => resources_id,
});

// Wire up page tree
let pages_id = doc.add_object(dictionary! {
    "Type" => "Pages",
    "Kids" => vec![page_id.into()],
    "Count" => 1,
});
doc.add_object(dictionary! {
    "Type" => "Catalog",
    "Pages" => pages_id,
});

doc.save("report.pdf")?;

Encrypted PDFs

PDF Oxide:

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open_with_password("encrypted.pdf", "secret")?;
let text = doc.extract_text(0)?;
println!("{}", text);

lopdf:

// lopdf does not support encrypted PDFs.
// Loading an encrypted PDF will fail or produce undecrypted streams.

Reliability Comparison

Metric	PDF Oxide	lopdf
PDFs parsed successfully	3,823 / 3,823 (100%)	3,071 / 3,823 (80.2%)
PDFs with text output	3,823 / 3,823	~1,320 / 3,823 (estimated)
Encrypted PDF support	Yes	No
Malformed PDF recovery	Yes	No

lopdf’s 80.2% pass rate means it fails on roughly 1 in 5 PDFs. The failures occur on encrypted documents, PDFs with non-standard xref tables, and documents using cross-reference streams. PDF Oxide handles all of these with lenient parsing and fallback strategies.

When to Use Each

Choose PDF Oxide if:

You need text extraction, image extraction, or any content-level operation
You want a single crate for read + write + create
You need to handle all PDFs reliably (encrypted, malformed, complex)
You need Markdown/HTML output, rendering, or OCR
You want compliance validation (PDF/A, PDF/X, PDF/UA)

Choose lopdf if:

You need direct access to PDF objects for custom processing
You’re building a specialized PDF tool that works at the object level
You need to merge documents by manipulating object trees directly
Your PDFs are simple and well-formed (not encrypted, standard xref tables)

Combine both:

Use PDF Oxide for high-level operations and lopdf for edge cases requiring raw object access:

[dependencies]
pdf_oxide = "0.3"
lopdf = "0.32"

Performance Benchmarks — full corpus results
vs Rust PDF Libraries — all Rust crates compared
Getting Started with Rust — installation and first extraction

PDF Oxide vs lopdf

Key Differences

Quick Comparison

What lopdf Can’t Do

Side-by-Side Code

Text Extraction

PDF Creation

Encrypted PDFs

Reliability Comparison

When to Use Each

Choose PDF Oxide if:

Choose lopdf if:

Combine both:

Related Pages