What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

vs Python PDF Libraries

PDF Oxide compared with PyMuPDF (fitz), pypdfium2, pypdf, pdfplumber, pdfminer, and more. This page covers performance, feature coverage, licensing, and API differences to help you choose the right Python PDF library for text extraction.

Summary

	PDF Oxide	PyMuPDF	pypdfium2	pypdf	pdfplumber	pdfminer
Mean extraction time	0.8ms	4.6ms	4.1ms	12.1ms	23.2ms	16.8ms
Pass rate (3,830 PDFs)	100%	99.3%	99.2%	98.4%	98.8%	98.8%
License	MIT	AGPL-3.0	Apache-2.0	BSD-3	MIT	MIT
Language	Rust + PyO3	C (MuPDF)	C (PDFium)	Pure Python	Pure Python	Pure Python
Text extraction	Yes	Yes	Yes	Yes	Yes	Yes
Character positions	Yes	Yes	Yes	Partial	Yes	Yes
Image extraction	Yes	Yes	Yes	Yes	No	No
Form fields	Read + Write	Read + Write	Read only	Read + Write	Read only	No
PDF creation	Yes	Yes	No	Limited	No	No
PDF editing	Yes	Yes	No	Yes	No	No
Markdown output	Yes	No	No	No	No	No
HTML output	Yes	No	No	No	No	No
Encryption	Read + Write	Read + Write	Read only	Read + Write	No	No
PDF/A validation	Yes	No	No	No	No	No
Rendering	Yes	Yes	Yes	No	No	No
Search	Regex + spatial	Yes	Yes	No	No	No
Python versions	3.8–3.14	3.8–3.12	3.8+	3.6+	3.8+	3.6+
Install size	~5 MB wheel	~20 MB wheel	~3 MB wheel	~1 MB	~1 MB	~1 MB

Performance Comparison

Mean text extraction time per PDF, benchmarked on the full 3,830-PDF corpus — three independent, publicly available test suites that together cover every PDF specification version (1.0–2.0), encrypted files, malformed documents, CJK encodings, complex layouts, and security edge cases. See full corpus details for what each suite tests and why these results are reproducible.

Library	Mean	Relative	p99	Pass Rate
PDF Oxide	0.8ms	1×	9ms	100%
PyMuPDF	4.6ms	5.8×	28ms	99.3%
pypdfium2	4.1ms	5.1×	42ms	99.2%
pymupdf4llm	55.5ms	69×	280ms	99.1%
pdftext	7.3ms	9.1×	82ms	99.0%
pdfminer	16.8ms	21×	124ms	98.8%
pdfplumber	23.2ms	29×	189ms	98.8%
markitdown	108.8ms	136×	378ms	98.6%
pypdf	12.1ms	15.1×	97ms	98.4%

PDF Oxide achieves its speed through a native Rust core compiled to a Python extension module via PyO3. There is no subprocess overhead or C library bridging — the Rust code runs directly in the Python process.

Reliability

PDF Oxide processes 3,823 of 3,823 valid PDFs without failure — a 100% pass rate. The 7 non-passing files in the 3,830-file corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams).

Library	Valid PDFs Passed	Pass Rate
PDF Oxide	3,823 / 3,823	100%
PyMuPDF	3,796 / 3,823	99.3%
pypdfium2	3,792 / 3,823	99.2%
pymupdf4llm	3,787 / 3,823	99.1%
pdftext	3,784 / 3,823	99.0%
pdfminer	3,777 / 3,823	98.8%
pdfplumber	3,777 / 3,823	98.8%
markitdown	3,771 / 3,823	98.6%
pypdf	3,762 / 3,823	98.4%

Text Quality

PDF Oxide achieves 99.5% text parity compared to PyMuPDF and pypdfium2 across the full corpus. Quality was measured by comparing extracted text output character-by-character. The remaining 0.5% difference is in whitespace normalization and ligature handling where PDF Oxide produces cleaner output.

License Comparison

Library	License	Commercial Use	Copyleft
PDF Oxide	MIT	Unrestricted	No
pypdfium2	Apache-2.0	Unrestricted	No
PyMuPDF	AGPL-3.0	Requires commercial license ($)	Yes
pypdf	BSD-3	Unrestricted	No
pdfplumber	MIT	Unrestricted	No
pdfminer	MIT	Unrestricted	No
pdftext	GPL-3.0	Requires open source	Yes

PyMuPDF uses MuPDF under the AGPL-3.0 license. If you distribute software that uses PyMuPDF, your software must also be released under AGPL-3.0 — or you must purchase a commercial license from Artifex. This applies to SaaS products, web applications, and any distributed binaries.

PDF Oxide is MIT-licensed with no restrictions. Use it in proprietary products, SaaS platforms, or closed-source applications without any licensing obligations.

Use Case	PDF Oxide (MIT)	PyMuPDF (AGPL)	pypdfium2 (Apache)	pypdf (BSD)	pdfplumber (MIT)	pdfminer (MIT)
Commercial product	Yes	Requires license	Yes	Yes	Yes	Yes
Closed source	Yes	No (unless licensed)	Yes	Yes	Yes	Yes
SaaS/cloud	Yes	Requires license	Yes	Yes	Yes	Yes
Internal tools	Yes	Yes	Yes	Yes	Yes	Yes

API Comparison

Text Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

PyMuPDF:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
text = page.get_text()
print(text)

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
page = reader.pages[0]
text = page.extract_text()
print(text)

pdfplumber:

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    text = page.extract_text()
    print(text)

pdfminer:

from pdfminer.high_level import extract_text

text = extract_text("report.pdf", page_numbers=[0])
print(text)

Character-Level Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
for ch in chars:
    print(f"'{ch.char}' at ({ch.bbox[0]:.1f}, {ch.bbox[1]:.1f}) "
          f"size={ch.font_size:.1f}")

PyMuPDF:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
blocks = page.get_text("dict")["blocks"]
for block in blocks:
    if "lines" in block:
        for line in block["lines"]:
            for span in line["spans"]:
                print(f"'{span['text']}' size={span['size']:.1f}")

pdfplumber:

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    for char in page.chars:
        print(f"'{char['text']}' at ({char['x0']:.1f}, {char['top']:.1f}) "
              f"size={char['size']:.1f}")

pdfminer:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTChar

for page_layout in extract_pages("report.pdf"):
    for element in page_layout:
        if hasattr(element, '__iter__'):
            for text_line in element:
                if hasattr(text_line, '__iter__'):
                    for char in text_line:
                        if isinstance(char, LTChar):
                            print(f"'{char.get_text()}' at ({char.x0:.1f}, {char.y0:.1f}) "
                                  f"size={char.size:.1f}")

Image Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

PyMuPDF:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
for i, img in enumerate(page.get_images()):
    xref = img[0]
    base_image = doc.extract_image(xref)
    with open(f"image_{i}.{base_image['ext']}", "wb") as f:
        f.write(base_image["image"])

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
page = reader.pages[0]
for i, image in enumerate(page.images):
    with open(f"image_{i}.{image.name.split('.')[-1]}", "wb") as f:
        f.write(image.data)

PDF Creation

PDF Oxide:

from pdf_oxide import Pdf

pdf = Pdf.from_markdown("# Hello World\n\nThis is a PDF.")
pdf.save("output.pdf")

# Also supports HTML
pdf = Pdf.from_html("<h1>Hello</h1><p>World</p>")
pdf.save("output.pdf")

PyMuPDF:

import fitz

doc = fitz.open()
page = doc.new_page()
text_point = fitz.Point(72, 72)
page.insert_text(text_point, "Hello World", fontsize=24)
doc.save("output.pdf")

pypdf:

# pypdf can merge/modify PDFs but cannot create from scratch with text content.
# Use reportlab or fpdf2 for creation, then merge with pypdf.

Encrypted PDFs

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)

PyMuPDF:

import fitz

doc = fitz.open("encrypted.pdf")
doc.authenticate("password")
page = doc[0]
text = page.get_text()

pypdf:

from pypdf import PdfReader

reader = PdfReader("encrypted.pdf")
reader.decrypt("password")
text = reader.pages[0].extract_text()

Markdown and HTML Output

PDF Oxide (unique feature):

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")

# Convert to Markdown with heading detection
md = doc.to_markdown(0, detect_headings=True)
print(md)

# Convert to HTML
html = doc.to_html(0)
print(html)

No other Python PDF library provides built-in Markdown or HTML conversion.

Library Profiles

PDF Oxide

Strengths:

Fastest text extraction in benchmarks due to Rust core — 5.8× faster than PyMuPDF
100% pass rate on 3,830-PDF corpus — highest reliability of any tested library
Unified API for extraction, creation, and editing in a single library
Built-in Markdown and HTML export with heading detection
MIT licensed with no copyleft restrictions
Native compliance validation (PDF/A, PDF/UA, PDF/X)
Pre-built wheels for all major platforms and Python 3.8–3.14
No system dependencies — the wheel includes everything

Limitations:

Newer library with a smaller community
Table extraction is basic compared to pdfplumber’s algorithms
Rendering engine is less mature than MuPDF

PyMuPDF (fitz)

Strengths:

Mature and battle-tested (backed by MuPDF, in development since 2005)
Excellent rendering quality for complex PDFs
Built-in OCR integration (Tesseract)
Rich feature set: SVG export, page manipulation, table detection

Limitations:

AGPL-3.0 license requires open-sourcing your application or purchasing a commercial license
Large wheel size (~20 MB) due to bundled MuPDF
No built-in Markdown export
No compliance validation

pypdfium2

Strengths:

Fast (backed by Google’s PDFium engine)
Apache-2.0 license — permissive for commercial use
Good rendering quality

Limitations:

Limited text extraction API compared to PDF Oxide or PyMuPDF
No PDF creation or editing
No form field support beyond read-only

pypdf

Strengths:

Pure Python — installs anywhere, no compiled dependencies
Lightweight and well-maintained
Good for PDF manipulation (merge, split, rotate, encrypt)
Large community and extensive documentation

Limitations:

15× slower than PDF Oxide for text extraction
Text extraction quality struggles with complex layouts
No rendering, no Markdown/HTML export, no table extraction

pdfplumber

Strengths:

Best table extraction of any Python PDF library
Excellent character-level positioning data
Visual debugging tools (annotated page images)
MIT licensed

Limitations:

Pure Python — 29× slower than PDF Oxide
Read-only — no PDF creation or editing
No encryption or rendering

pdfminer

Strengths:

Detailed character and layout analysis
Good CJK text support
Foundation for pdfplumber and other tools
MIT licensed

Limitations:

21× slower than PDF Oxide (pure Python, unoptimized)
Read-only, no creation or editing
Verbose API for common tasks
Less actively maintained

When to Use Each

Use Case	Recommended Library
Fast text extraction	PDF Oxide
Commercial / proprietary product	PDF Oxide, pypdfium2, pypdf, pdfplumber, or pdfminer
PyMuPDF alternative (MIT licensed)	PDF Oxide
PDF creation from Markdown/HTML	PDF Oxide
Compliance validation (PDF/A, PDF/X)	PDF Oxide
Table extraction from invoices	pdfplumber
Visual debugging of extraction	pdfplumber
Existing MuPDF investment	PyMuPDF (if AGPL-compatible)
Minimal dependencies	pypdf (pure Python)
Detailed layout analysis	pdfminer
OCR for scanned documents	PyMuPDF

Installation

# PDF Oxide
pip install pdf_oxide

# PyMuPDF
pip install pymupdf

# pypdfium2
pip install pypdfium2

# pypdf
pip install pypdf

# pdfplumber
pip install pdfplumber

# pdfminer
pip install pdfminer.six

PDF Oxide ships pre-built wheels for Linux (x86_64, aarch64), macOS (x86_64, arm64), and Windows (x86_64). No compiler or system libraries required.

Performance Benchmarks – full corpus benchmark results
Getting Started with Python – installation and first extraction
Python API Reference – complete Python API
vs Rust PDF Libraries – Rust ecosystem comparison

vs Python PDF Libraries

Summary

Performance Comparison

Reliability

Text Quality

License Comparison

API Comparison

Text Extraction

Character-Level Extraction

Image Extraction

PDF Creation

Encrypted PDFs

Markdown and HTML Output

Library Profiles

PDF Oxide

PyMuPDF (fitz)

pypdfium2

pypdf

pdfplumber

pdfminer

When to Use Each

Installation

Related Pages