What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PyMuPDF vs pypdf — Which Python PDF Library?

PyMuPDF and pypdf are two of the most popular Python PDF libraries, but both have significant trade-offs. PyMuPDF is fast but locked behind AGPL-3.0 licensing. pypdf is permissively licensed but 15× slower. This page compares them head-to-head — and shows why PDF Oxide is a better choice than either.

The short answer: PDF Oxide is 5.8× faster than PyMuPDF, 15× faster than pypdf, MIT-licensed, and has more features than both — including built-in Markdown/HTML output, XFA form support, and OCR with no system dependencies.

Quick Comparison

	PyMuPDF	pypdf	PDF Oxide
License	AGPL-3.0	BSD-3	MIT
Language	C (MuPDF)	Pure Python	Rust + PyO3
Mean extraction time	4.6ms	12.1ms	0.8ms
p99 extraction time	28ms	97ms	9ms
Pass rate (3,830 PDFs)	99.3%	98.4%	100%
Text extraction	Yes	Yes	Yes
Character positions	Yes	Partial	Yes
Image extraction	Yes	Yes	Yes
Form fields	Read + Write	Read + Write	Read + Write
PDF creation	Yes	Limited (merge only)	Yes (Markdown/HTML)
Markdown output	No	No	Yes
HTML output	No	No	Yes
Rendering	Yes	No	Yes
OCR	Tesseract	No	Built-in (PaddleOCR)
Install size	~20 MB	~1 MB	~5 MB
Encryption	Read + Write	Read + Write	Read + Write
Search	Yes	No	Regex + spatial
Python versions	3.8–3.12	3.6+	3.8–3.14

PyMuPDF is faster and more feature-rich than pypdf, but its AGPL license is a dealbreaker for many commercial projects. pypdf is lighter and BSD-licensed, but significantly slower and more limited in extraction capabilities. PDF Oxide combines the speed advantage of a native engine with the licensing freedom of a permissive license.

Licensing: AGPL vs BSD vs MIT

The licensing difference between PyMuPDF and pypdf is often the deciding factor for teams choosing between them.

PyMuPDF — AGPL-3.0

PyMuPDF wraps MuPDF, which is licensed under AGPL-3.0. This is a strong copyleft license. If you distribute any software that uses PyMuPDF — including SaaS applications, Docker containers, web services, desktop apps, or CLI tools — your entire application must be released under AGPL-3.0. That means publishing your full source code under the same license.

The alternative is purchasing a commercial license from Artifex, the company behind MuPDF. Artifex does not publish pricing publicly; you must contact their sales team for a quote. Commercial licenses are typically annual and priced per application.

AGPL affects you if:

You ship a product that includes PyMuPDF (desktop app, mobile app, Electron)
You run a SaaS or web service that processes PDFs with PyMuPDF
You distribute Docker images that contain PyMuPDF
You provide an API that uses PyMuPDF internally

AGPL does not affect you if:

Your project is already open-sourced under an AGPL-compatible license
You use PyMuPDF only for internal tooling that is never distributed

pypdf — BSD-3

pypdf uses the BSD 3-Clause license, which is permissive. You can use pypdf in commercial products, closed-source software, and SaaS applications without any obligation to open-source your code. The only requirement is retaining the copyright notice in redistributions.

PDF Oxide — MIT

PDF Oxide is MIT licensed — the most permissive common open-source license. Use it in any context (commercial, proprietary, SaaS, open source) with no restrictions beyond including the license text.

Licensing Summary

Use Case	PyMuPDF (AGPL)	pypdf (BSD)	PDF Oxide (MIT)
Commercial product	Requires license	Yes	Yes
Closed-source SaaS	Requires license	Yes	Yes
Docker distribution	Requires license	Yes	Yes
Internal tools	Yes	Yes	Yes
Open-source (AGPL-compatible)	Yes	Yes	Yes
Open-source (MIT/BSD/Apache)	No	Yes	Yes

For commercial projects where licensing compliance matters, pypdf and PDF Oxide are both safe choices. PyMuPDF requires either open-sourcing your application or purchasing a commercial license.

Speed Benchmarks

All benchmarks were run on the same 3,830-PDF corpus — three independent, publicly available test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs) covering every PDF specification version (1.0–2.0), encrypted files, CJK encodings, complex layouts, and malformed documents.

Text Extraction Speed

Library	Mean	p99	Relative to PDF Oxide
PDF Oxide	0.8ms	9ms	1x
PyMuPDF	4.6ms	28ms	5.8x slower
pypdf	12.1ms	97ms	15.1x slower

PyMuPDF is 2.6x faster than pypdf because it delegates parsing to MuPDF’s C engine. pypdf does everything in pure Python — parsing, font decoding, text assembly — which means every operation pays the interpreter overhead.

PDF Oxide is faster than both because its Rust core handles all PDF parsing, font decoding, and text layout natively via PyO3, with only the final result crossing the Python boundary. There is no subprocess overhead, no C library bridging through ctypes, and no interpreter bottleneck.

Reliability

Library	Valid PDFs Passed	Pass Rate
PDF Oxide	3,823 / 3,823	100%
PyMuPDF	3,796 / 3,823	99.3%
pypdf	3,762 / 3,823	98.4%

PyMuPDF fails on 27 valid PDFs in the corpus. pypdf fails on 61. In both cases, these are valid PDF files that the library either crashes on or returns empty/incorrect text from. PDF Oxide handles all 3,823 valid PDFs without failure.

The 7 non-passing files in the full 3,830-file corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams) and are excluded from pass-rate calculations for all libraries.

What This Means in Practice

For a pipeline processing thousands of PDFs daily, PyMuPDF’s 99.3% pass rate means roughly 7 failures per 1,000 documents. pypdf’s 98.4% means 16 failures per 1,000. These are documents you need to handle with fallback logic, manual review, or simply accept as lost data.

PDF Oxide’s 100% pass rate on the test corpus means fewer edge cases to handle in production.

Feature Comparison

Text Extraction

All three libraries support basic text extraction. The API styles differ:

PyMuPDF:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
text = page.get_text()
print(text)

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
text = reader.pages[0].extract_text()
print(text)

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

PyMuPDF uses a page-object model (doc[0] returns a page). pypdf uses a reader/pages pattern. PDF Oxide uses page indices directly.

For character-level extraction (positions, font sizes, bounding boxes), PyMuPDF provides get_text("dict") which returns a nested dict structure. pypdf offers partial character position data. PDF Oxide provides extract_chars() with per-character bounding boxes and font metadata.

Markdown Conversion

This is a significant differentiator. Many LLM and RAG pipelines need Markdown output from PDFs.

PyMuPDF:

# PyMuPDF has no built-in Markdown conversion.
# You need pymupdf4llm, a separate package:
import pymupdf4llm

md = pymupdf4llm.to_markdown("paper.pdf")

pymupdf4llm works but is 69x slower than PDF Oxide’s built-in Markdown conversion (55.5ms mean vs 0.8ms). It is also a separate dependency with its own maintenance cycle.

pypdf:

# pypdf has no Markdown conversion.
# You would need an external tool chain (e.g., extract text,
# then use a separate library to structure it as Markdown).

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

PDF Oxide’s Markdown conversion is built-in, handles heading detection, preserves table structure, and runs at the same speed as plain text extraction.

HTML Conversion

PyMuPDF: No built-in HTML output.

pypdf: No HTML output.

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
html = doc.to_html(0)
print(html)

Form Fields

All three libraries support reading and writing form fields (AcroForm).

PyMuPDF:

import fitz

doc = fitz.open("form.pdf")
page = doc[0]
for widget in page.widgets():
    print(f"{widget.field_name}: {widget.field_value}")

pypdf:

from pypdf import PdfReader

reader = PdfReader("form.pdf")
fields = reader.get_fields()
for name, field in fields.items():
    print(f"{name}: {field.get('/V', '')}")

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")
fields = doc.get_form_fields()
for field in fields:
    print(f"{field.name}: {field.value}")

One notable difference: PDF Oxide supports XFA forms (XML Forms Architecture), which are used in many government and enterprise PDF forms. Neither PyMuPDF nor pypdf handles XFA form data extraction.

Image Extraction

PyMuPDF:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
for i, img in enumerate(page.get_images()):
    xref = img[0]
    base_image = doc.extract_image(xref)
    with open(f"image_{i}.{base_image['ext']}", "wb") as f:
        f.write(base_image["image"])

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
page = reader.pages[0]
for i, image in enumerate(page.images):
    with open(f"image_{i}.{image.name.split('.')[-1]}", "wb") as f:
        f.write(image.data)

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

All three handle embedded image extraction. PyMuPDF’s approach requires a two-step xref lookup. pypdf and PDF Oxide offer more streamlined APIs.

Rendering

PyMuPDF can render PDF pages to images (PNG, JPEG) using MuPDF’s rendering engine. pypdf cannot render pages at all. PDF Oxide includes a built-in rendering engine.

OCR

PyMuPDF integrates with Tesseract for OCR on scanned PDFs. pypdf has no OCR support. PDF Oxide has built-in OCR via PaddleOCR, requiring no external system dependencies.

PDF Creation

PyMuPDF can create PDFs but requires manual placement of text, images, and shapes on pages — there is no high-level API for creating PDFs from structured content.

pypdf cannot create PDFs from scratch. It can merge, split, and modify existing PDFs, but for creation you need a separate library like reportlab or fpdf2.

PDF Oxide can create PDFs from Markdown or HTML:

from pdf_oxide import Pdf

pdf = Pdf.from_markdown("# Invoice\n\n| Item | Price |\n|------|-------|\n| Widget | $9.99 |")
pdf.save("invoice.pdf")

Encryption

All three libraries support reading encrypted PDFs and writing encrypted output.

PyMuPDF:

import fitz

doc = fitz.open("encrypted.pdf")
doc.authenticate("password")
text = doc[0].get_text()

pypdf:

from pypdf import PdfReader

reader = PdfReader("encrypted.pdf")
reader.decrypt("password")
text = reader.pages[0].extract_text()

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)

Feature Summary

Feature	PyMuPDF	pypdf	PDF Oxide
Text extraction	Yes	Yes	Yes
Character positions	Yes	Partial	Yes
Image extraction	Yes	Yes	Yes
Form fields (AcroForm)	Read + Write	Read + Write	Read + Write
XFA forms	No	No	Yes
PDF creation	Manual	No	Markdown/HTML
Markdown output	No (pymupdf4llm)	No	Built-in
HTML output	No	No	Built-in
Rendering	Yes	No	Yes
OCR	Tesseract	No	Built-in (PaddleOCR)
Search	Yes	No	Regex + spatial
Encryption	Read + Write	Read + Write	Read + Write
PDF/A validation	No	No	Yes
SVG export	Yes	No	No
Merge/split	Yes	Yes	Yes

When to Choose Each Library

Choose pypdf if:

You need a pure-Python solution with no compiled C or Rust extensions
You are doing simple PDF manipulation (merge, split, rotate, encrypt/decrypt)
Speed is not critical for your use case
You want the smallest possible install footprint (~1 MB)
You need broad Python version support (3.6+)

Choose PyMuPDF if:

You already have a commercial MuPDF license from Artifex
You need SVG export from PDF pages
Your project is already licensed under AGPL-3.0
You depend on MuPDF-specific rendering behavior

Choose PDF Oxide if:

You need maximum text extraction speed (5.8x faster than PyMuPDF, 15x faster than pypdf)
You want MIT licensing for commercial or closed-source use
You need built-in Markdown or HTML output for LLM/RAG pipelines
You need XFA form support
You want built-in OCR without external system dependencies
You want 100% reliability on valid PDFs

Installation

# PyMuPDF
pip install pymupdf

# pypdf
pip install pypdf

# PDF Oxide
pip install pdf_oxide

All three are available via pip. PyMuPDF ships a ~20 MB wheel with bundled MuPDF. pypdf is pure Python at ~1 MB. PDF Oxide ships pre-built wheels (~5 MB) for Linux (x86_64, aarch64), macOS (x86_64, arm64), and Windows (x86_64).

The Verdict

If you’re choosing between PyMuPDF and pypdf, you’re choosing between speed and licensing freedom. PDF Oxide gives you both — faster than PyMuPDF, more permissive than pypdf, with features neither library offers.

What matters to you	Best choice
Maximum speed	PDF Oxide (0.8ms)
Permissive license	PDF Oxide (MIT) or pypdf (BSD)
Speed + permissive license	PDF Oxide — the only option
Markdown/HTML output	PDF Oxide — built-in
XFA forms	PDF Oxide — only library that supports them
100% reliability	PDF Oxide — 100% pass rate
OCR without Tesseract	PDF Oxide — built-in PaddleOCR
SVG export	PyMuPDF
Pure Python, no binaries	pypdf

Get started in 10 seconds:

pip install pdf_oxide

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)

PDF Oxide vs PyMuPDF — detailed comparison
PDF Oxide vs pypdf — detailed comparison
vs All Python PDF Libraries — full ecosystem comparison
Performance Benchmarks — methodology and results

PyMuPDF vs pypdf — Which Python PDF Library?

Quick Comparison

Licensing: AGPL vs BSD vs MIT

PyMuPDF — AGPL-3.0

pypdf — BSD-3

PDF Oxide — MIT

Licensing Summary

Speed Benchmarks

Text Extraction Speed

Reliability

What This Means in Practice

Feature Comparison

Text Extraction

Markdown Conversion

HTML Conversion

Form Fields

Image Extraction

Rendering

OCR

PDF Creation

Encryption

Feature Summary

When to Choose Each Library

Choose pypdf if:

Choose PyMuPDF if:

Choose PDF Oxide if:

Installation

The Verdict

Related Pages