What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide vs pypdfium2

Both PDF Oxide and pypdfium2 are fast, natively-compiled Python PDF libraries. pypdfium2 wraps Google’s PDFium engine; PDF Oxide is built on a Rust core. The key difference is scope: pypdfium2 is primarily a reader and renderer, while PDF Oxide covers the full PDF lifecycle.

Key Differences

Speed. Both are fast. PDF Oxide is slightly faster: 0.8ms mean vs 4.1ms (5.1× difference). Both are dramatically faster than pure-Python libraries.

Features. pypdfium2 is read-only with rendering. PDF Oxide adds creation, editing, form writing, encryption, Markdown/HTML output, and OCR.

Reliability. PDF Oxide passes 100% of valid PDFs. pypdfium2 passes 99.2% — 31 failures.

License. Both are permissive. PDF Oxide is MIT; pypdfium2 is Apache-2.0. No AGPL concerns with either.

Quick Comparison

	PDF Oxide	pypdfium2
Mean extraction time	0.8ms	4.1ms
Pass rate (3,830 PDFs)	100%	99.2%
License	MIT	Apache-2.0
Language	Rust + PyO3	C (PDFium)
Text extraction	Yes	Yes
Character positions	Yes	Yes
Image extraction	Yes	Yes
Markdown output	Yes	No
HTML output	Yes	No
PDF creation	Yes	No
PDF editing	Yes	No
Form fields	Read + Write	Read only
Encryption	Read + Write	Read only
Rendering	Yes	Yes
OCR	Built-in	No
Search	Regex + spatial	Yes

Side-by-Side Code

Text Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

pypdfium2:

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
textpage = page.get_textpage()
text = textpage.get_text_range()
print(text)

Image Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

pypdfium2:

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
for i, obj in enumerate(page.get_objects()):
    if obj.type == pdfium.FPDF_PAGEOBJ_IMAGE:
        bitmap = obj.get_bitmap()
        bitmap.to_pil().save(f"image_{i}.png")

PDF Creation

PDF Oxide:

from pdf_oxide import Pdf

pdf = Pdf.from_markdown("# Report\n\nQuarterly results are in.")
pdf.save("report.pdf")

pypdfium2:

# pypdfium2 cannot create PDFs.
# It is a read-only library with rendering capabilities.

Rendering

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
image = doc.render_page(0, dpi=150)
image.save("page.png")

pypdfium2:

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
bitmap = page.render(scale=150/72)
bitmap.to_pil().save("page.png")

Benchmark Details

Metric	PDF Oxide	pypdfium2
Mean extraction time	0.8ms	4.1ms
p99 extraction time	9ms	42ms
Pass rate (valid PDFs)	100% (3,823/3,823)	99.2% (3,792/3,823)

Both libraries use native code (Rust and C respectively), but PDF Oxide’s text extraction pipeline is optimized specifically for this task — single-pass extraction with pre-allocated buffers and cached page trees.

See full benchmark methodology for corpus details.

Feature Completeness

The biggest difference between these libraries is scope. pypdfium2 is a reader with rendering; PDF Oxide covers the full PDF lifecycle:

Capability	PDF Oxide	pypdfium2
Read and extract	Yes	Yes
Render pages	Yes	Yes
Create PDFs	Yes (Markdown, HTML, images)	No
Edit existing PDFs	Yes (text, images, annotations)	No
Fill form fields	Yes	No
Write encryption	Yes (AES-256)	No
Markdown/HTML output	Yes	No
OCR scanned pages	Yes (PaddleOCR via ONNX)	No
PDF/A validation	Yes	No

If you only need to read and render PDFs, pypdfium2 is a solid choice. If you need any write capability — creation, editing, form filling, or encryption — PDF Oxide is the single-library solution.

pypdfium2 License (Apache-2.0)

pypdfium2 is licensed under Apache-2.0, which allows commercial use. However, it wraps Google’s PDFium (the Chromium PDF engine), which has its own BSD-style license. Both are permissive.

Key considerations:

Apache-2.0 — permissive, allows commercial use, requires attribution
PDFium dependency — binary includes Chromium’s PDFium engine (~15 MB)
Google’s release cycle — pypdfium2 depends on PDFium releases from the Chromium project
No Python API stability guarantee — the API follows PDFium’s C API closely

PDF Oxide is MIT licensed — even more permissive than Apache-2.0, with no attribution requirements for binary distribution.

When to Use Each

Choose PDF Oxide if:

You need more than read/render (creation, editing, forms, encryption)
You want Markdown or HTML conversion
You want built-in OCR for scanned documents
You need the highest reliability (100% vs 99.2%)
Speed is critical and the 5× difference matters at scale

Choose pypdfium2 if:

You only need to read and render PDFs
You prefer PDFium’s specific rendering output
You want a smaller dependency footprint

Performance Benchmarks — full corpus results
vs Python PDF Libraries — all Python libraries compared
Getting Started with Python — installation and first extraction

PDF Oxide vs pypdfium2

Key Differences

Quick Comparison

Side-by-Side Code

Text Extraction

Image Extraction

PDF Creation

Rendering

Benchmark Details

Feature Completeness

pypdfium2 License (Apache-2.0)

When to Use Each

Choose PDF Oxide if:

Choose pypdfium2 if:

Related Pages