What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Extraction Profiles — Tune Space Detection per Document Type

Different PDFs bury their spaces differently. An arXiv paper uses tight, justified columns. An IRS form uses rigid cell alignment. A GDPR policy runs dense, justified paragraphs with minimal kerning. A single tj_offset_threshold that works for one of these will insert garbage spaces into another.

ExtractionProfile ships nine pre-tuned parameter sets that map cleanly onto real document classes. Pass the profile to extract_text() / extract_words() and PDF Oxide applies the right word-margin ratio, TJ offset threshold, and adaptive-threshold toggle for that document style.

Binding coverage. Extraction profiles are currently exposed in Python (pdf_oxide.ExtractionProfile) and Rust (pdf_oxide::config::ExtractionProfile). The Node, WASM, Go, and C# bindings use the CONSERVATIVE default internally; to apply a different profile from those runtimes, invoke the Rust CLI (pdf-oxide extract --profile academic doc.pdf) or bridge through a Python / Rust step.

Quick Example

Python

from pdf_oxide import PdfDocument, ExtractionProfile

doc = PdfDocument("paper.pdf")

# Academic papers: tight spacing, citation detection on
text = doc.extract_text(0, profile=ExtractionProfile.academic())
print(text)

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::config::ExtractionProfile;

let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text_with_profile(0, ExtractionProfile::ACADEMIC)?;
println!("{}", text);

Available Profiles

Profile	Best for	TJ threshold	Word-margin ratio	Adaptive
`conservative()`	Default — general text, minimal false spaces	−120	0.10	off
`aggressive()`	PDFs that suppress spaces; fixes merged words	−80	0.20	off
`balanced()`	Mixed content	−100	0.15	off
`academic()`	arXiv papers, conference proceedings, tech reports	−105	0.12	on + citation / email detection
`policy()`	Legal, GDPR, government regulations	−110	0.18	on
`form()`	IRS forms, applications, questionnaires	−120	0.08	off
`government()`	Mixed government reports with tables	−105	0.14	off
`scanned_ocr()`	OCR output where coordinates are noisy	depends	depends	on
`adaptive()`	Let the extractor auto-tune from font statistics	depends	depends	on

When Each Profile Helps

Academic / conference papers — `academic()`

Tight typesetting, two-column layouts, embedded citations. Default settings often over-insert spaces inside ligatures (fi, ff) or under-insert between words where kerning is aggressive.

doc = PdfDocument("neurips-paper.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.academic())

The academic profile turns on adaptive thresholding plus citation / email detection so inline [1,2,3] refs and author@lab.edu emails survive cleanly.

IRS forms, applications — `form()`

Form PDFs care about column alignment more than about word boundaries. The form() profile uses a very tight word-margin ratio (0.08) so rigidly-aligned field labels don’t collapse into their values.

doc = PdfDocument("w2.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.form())

GDPR / policy / regulation — `policy()`

Justified paragraphs insert variable whitespace that breaks the default threshold. policy() uses a more generous word margin (0.18) plus adaptive thresholding to correctly read dense legal prose.

doc = PdfDocument("gdpr.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.policy())

Scanned OCR output — `scanned_ocr()`

When the page was OCR’d (Tesseract, PaddleOCR, Azure), character positions are noisy and kerning hints are absent. scanned_ocr() compensates with adaptive thresholding that re-reads font statistics per page.

doc = PdfDocument("scanned.pdf")
text = doc.extract_text(0, profile=ExtractionProfile.scanned_ocr())

Let the library pick — `adaptive()`

If you don’t know the document class ahead of time, adaptive() samples font statistics on the first pass and picks thresholds before extracting. Slightly slower than a fixed profile but forgiving across mixed corpora.

for pdf_path in Path("mixed_corpus/").glob("*.pdf"):
    doc = PdfDocument(str(pdf_path))
    text = doc.extract_text(0, profile=ExtractionProfile.adaptive())

Profile fields

Every profile exposes its tuning knobs so you can read or clone them:

Python

from pdf_oxide import ExtractionProfile

p = ExtractionProfile.academic()
print(p.name)                # "Academic"
print(p.word_margin_ratio)   # 0.12
print(p.tj_offset_threshold) # -105.0

# Inspect every preset
for profile in ExtractionProfile.all_profiles():
    print(profile.name, profile.word_margin_ratio)

Rust

use pdf_oxide::config::ExtractionProfile;

let p = ExtractionProfile::ACADEMIC;
println!("{} margin={} tj={}",
    p.name, p.word_margin_ratio, p.tj_offset_threshold);

Choosing a profile in production pipelines

If you ingest a mixed corpus — academic papers alongside IRS forms alongside web-scraped HTML-exports — pick adaptive() as your default. It costs a few percent extra per page but eliminates the worst failures (merged words, missing spaces across columns).

If your corpus is homogeneous — you run a Title IX intake pipeline, a contract-review tool, or an arXiv crawler — pick the matching profile explicitly: you’ll get the best extraction quality and you’ll avoid the per-page sampling cost of adaptive().

Text Extraction — full extraction API
Reading Order (XY-cut) — column-aware reading order
OCR Scanned PDFs — when a profile isn’t enough
Python API Reference

Extraction Profiles — Tune Space Detection per Document Type

Quick Example

Available Profiles

When Each Profile Helps

Academic / conference papers — academic()

IRS forms, applications — form()

GDPR / policy / regulation — policy()

Scanned OCR output — scanned_ocr()

Let the library pick — adaptive()