What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide vs pypdf

PDF Oxide is 15× faster than pypdf with a higher pass rate, built-in rendering, and Markdown/HTML conversion. If you need more than basic PDF manipulation, PDF Oxide does in one library what pypdf requires multiple packages to achieve.

Why Consider PDF Oxide Over pypdf

Speed. pypdf is pure Python. PDF Oxide uses a Rust core compiled via PyO3, running directly in the Python process. Mean text extraction: 0.8ms vs 12.1ms — a 15× difference.

Reliability. PDF Oxide passes 100% of 3,830 test PDFs. pypdf passes 98.4% — 61 failures on valid PDFs.

Features. pypdf is a PDF manipulation library (merge, split, rotate, encrypt). For text extraction, rendering, Markdown output, or form creation, you need additional packages. PDF Oxide covers all of these in a single install.

Quick Comparison

	PDF Oxide	pypdf
Mean extraction time	0.8ms	12.1ms
Pass rate (3,830 PDFs)	100%	98.4%
License	MIT	BSD-3
Language	Rust + PyO3	Pure Python
Text extraction	Yes	Yes
Character positions	Yes	Partial
Image extraction	Yes	Yes
Markdown output	Yes	No
HTML output	Yes	No
PDF creation	Yes (Markdown/HTML/images)	Limited (merge only)
Form fields	Read + Write	Read + Write
Encryption	Read + Write	Read + Write
Rendering	Yes	No
OCR	Built-in	No
Search	Regex + spatial	No
Install size	~5 MB	~1 MB

Side-by-Side Code

Text Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
text = reader.pages[0].extract_text()
print(text)

Extract All Pages

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")
for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(f"--- Page {i + 1} ---")
    print(text)

pypdf:

from pypdf import PdfReader

reader = PdfReader("book.pdf")
for page in reader.pages:
    text = page.extract_text()
    print(text)

Image Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
page = reader.pages[0]
for i, image in enumerate(page.images):
    with open(f"image_{i}.{image.name.split('.')[-1]}", "wb") as f:
        f.write(image.data)

Encrypted PDFs

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("encrypted.pdf", password="secret")
text = doc.extract_text(0)

pypdf:

from pypdf import PdfReader

reader = PdfReader("encrypted.pdf")
reader.decrypt("secret")
text = reader.pages[0].extract_text()

Markdown Conversion

PDF Oxide (built-in):

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

pypdf:

# pypdf has no Markdown conversion.
# You would need a separate tool chain.

Benchmark Details

Metric	PDF Oxide	pypdf
Mean extraction time	0.8ms	12.1ms
p99 extraction time	9ms	97ms
Pass rate (valid PDFs)	100% (3,823/3,823)	98.4% (3,762/3,823)

pypdf’s pure-Python implementation means every operation runs in the interpreter. PDF Oxide’s Rust core handles parsing, font decoding, and text assembly natively, with only the final result crossing the Python boundary.

See full benchmark methodology for corpus details.

Feature Gap

pypdf excels at PDF manipulation — merge, split, rotate, and encrypt. But it lacks:

Feature	PDF Oxide	pypdf
Markdown conversion	`doc.to_markdown(0)`	Not available
HTML conversion	`doc.to_html(0)`	Not available
PDF creation from content	`Pdf.from_markdown()`, `Pdf.from_html()`	Not available
Rendering to images	Yes	Not available
OCR for scanned PDFs	Built-in PaddleOCR	Not available
Text search	`doc.search("query")`	Not available
Character-level bounding boxes	`doc.extract_chars(0)`	Partial
PDF/A validation	Yes	Not available

If your workflow is purely merge/split/rotate, pypdf’s lightweight pure-Python approach is a reasonable choice. For anything involving text extraction quality, creation, or conversion, PDF Oxide is the more complete option.

When to Stay with pypdf

You need a pure-Python dependency with zero compiled extensions
Your use case is strictly merge/split/rotate/encrypt with no text extraction
You need pypdf’s specific PDF manipulation methods for legacy integration

Performance Benchmarks — full corpus results
vs Python PDF Libraries — all Python libraries compared
Getting Started with Python — installation and first extraction

PDF Oxide vs pypdf

Why Consider PDF Oxide Over pypdf

Quick Comparison

Side-by-Side Code

Text Extraction

Extract All Pages

Image Extraction

Encrypted PDFs

Markdown Conversion

Benchmark Details

Feature Gap

When to Stay with pypdf

Related Pages