PyMuPDF vs pypdf — Which Python PDF Library?
PyMuPDF and pypdf are two of the most popular Python PDF libraries, but both have significant trade-offs. PyMuPDF is fast but locked behind AGPL-3.0 licensing. pypdf is permissively licensed but 15× slower. This page compares them head-to-head — and shows why PDF Oxide is a better choice than either.
The short answer: PDF Oxide is 5.8× faster than PyMuPDF, 15× faster than pypdf, MIT-licensed, and has more features than both — including built-in Markdown/HTML output, XFA form support, and OCR with no system dependencies.
Quick Comparison
| PyMuPDF | pypdf | PDF Oxide | |
|---|---|---|---|
| License | AGPL-3.0 | BSD-3 | MIT |
| Language | C (MuPDF) | Pure Python | Rust + PyO3 |
| Mean extraction time | 4.6ms | 12.1ms | 0.8ms |
| p99 extraction time | 28ms | 97ms | 9ms |
| Pass rate (3,830 PDFs) | 99.3% | 98.4% | 100% |
| Text extraction | Yes | Yes | Yes |
| Character positions | Yes | Partial | Yes |
| Image extraction | Yes | Yes | Yes |
| Form fields | Read + Write | Read + Write | Read + Write |
| PDF creation | Yes | Limited (merge only) | Yes (Markdown/HTML) |
| Markdown output | No | No | Yes |
| HTML output | No | No | Yes |
| Rendering | Yes | No | Yes |
| OCR | Tesseract | No | Built-in (PaddleOCR) |
| Install size | ~20 MB | ~1 MB | ~5 MB |
| Encryption | Read + Write | Read + Write | Read + Write |
| Search | Yes | No | Regex + spatial |
| Python versions | 3.8–3.12 | 3.6+ | 3.8–3.14 |
PyMuPDF is faster and more feature-rich than pypdf, but its AGPL license is a dealbreaker for many commercial projects. pypdf is lighter and BSD-licensed, but significantly slower and more limited in extraction capabilities. PDF Oxide combines the speed advantage of a native engine with the licensing freedom of a permissive license.
Licensing: AGPL vs BSD vs MIT
The licensing difference between PyMuPDF and pypdf is often the deciding factor for teams choosing between them.
PyMuPDF — AGPL-3.0
PyMuPDF wraps MuPDF, which is licensed under AGPL-3.0. This is a strong copyleft license. If you distribute any software that uses PyMuPDF — including SaaS applications, Docker containers, web services, desktop apps, or CLI tools — your entire application must be released under AGPL-3.0. That means publishing your full source code under the same license.
The alternative is purchasing a commercial license from Artifex, the company behind MuPDF. Artifex does not publish pricing publicly; you must contact their sales team for a quote. Commercial licenses are typically annual and priced per application.
AGPL affects you if:
- You ship a product that includes PyMuPDF (desktop app, mobile app, Electron)
- You run a SaaS or web service that processes PDFs with PyMuPDF
- You distribute Docker images that contain PyMuPDF
- You provide an API that uses PyMuPDF internally
AGPL does not affect you if:
- Your project is already open-sourced under an AGPL-compatible license
- You use PyMuPDF only for internal tooling that is never distributed
pypdf — BSD-3
pypdf uses the BSD 3-Clause license, which is permissive. You can use pypdf in commercial products, closed-source software, and SaaS applications without any obligation to open-source your code. The only requirement is retaining the copyright notice in redistributions.
PDF Oxide — MIT
PDF Oxide is MIT licensed — the most permissive common open-source license. Use it in any context (commercial, proprietary, SaaS, open source) with no restrictions beyond including the license text.
Licensing Summary
| Use Case | PyMuPDF (AGPL) | pypdf (BSD) | PDF Oxide (MIT) |
|---|---|---|---|
| Commercial product | Requires license | Yes | Yes |
| Closed-source SaaS | Requires license | Yes | Yes |
| Docker distribution | Requires license | Yes | Yes |
| Internal tools | Yes | Yes | Yes |
| Open-source (AGPL-compatible) | Yes | Yes | Yes |
| Open-source (MIT/BSD/Apache) | No | Yes | Yes |
For commercial projects where licensing compliance matters, pypdf and PDF Oxide are both safe choices. PyMuPDF requires either open-sourcing your application or purchasing a commercial license.
Speed Benchmarks
All benchmarks were run on the same 3,830-PDF corpus — three independent, publicly available test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs) covering every PDF specification version (1.0–2.0), encrypted files, CJK encodings, complex layouts, and malformed documents.
Text Extraction Speed
| Library | Mean | p99 | Relative to PDF Oxide |
|---|---|---|---|
| PDF Oxide | 0.8ms | 9ms | 1x |
| PyMuPDF | 4.6ms | 28ms | 5.8x slower |
| pypdf | 12.1ms | 97ms | 15.1x slower |
PyMuPDF is 2.6x faster than pypdf because it delegates parsing to MuPDF’s C engine. pypdf does everything in pure Python — parsing, font decoding, text assembly — which means every operation pays the interpreter overhead.
PDF Oxide is faster than both because its Rust core handles all PDF parsing, font decoding, and text layout natively via PyO3, with only the final result crossing the Python boundary. There is no subprocess overhead, no C library bridging through ctypes, and no interpreter bottleneck.
Reliability
| Library | Valid PDFs Passed | Pass Rate |
|---|---|---|
| PDF Oxide | 3,823 / 3,823 | 100% |
| PyMuPDF | 3,796 / 3,823 | 99.3% |
| pypdf | 3,762 / 3,823 | 98.4% |
PyMuPDF fails on 27 valid PDFs in the corpus. pypdf fails on 61. In both cases, these are valid PDF files that the library either crashes on or returns empty/incorrect text from. PDF Oxide handles all 3,823 valid PDFs without failure.
The 7 non-passing files in the full 3,830-file corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams) and are excluded from pass-rate calculations for all libraries.
What This Means in Practice
For a pipeline processing thousands of PDFs daily, PyMuPDF’s 99.3% pass rate means roughly 7 failures per 1,000 documents. pypdf’s 98.4% means 16 failures per 1,000. These are documents you need to handle with fallback logic, manual review, or simply accept as lost data.
PDF Oxide’s 100% pass rate on the test corpus means fewer edge cases to handle in production.
Feature Comparison
Text Extraction
All three libraries support basic text extraction. The API styles differ:
PyMuPDF:
import fitz
doc = fitz.open("report.pdf")
page = doc[0]
text = page.get_text()
print(text)
pypdf:
from pypdf import PdfReader
reader = PdfReader("report.pdf")
text = reader.pages[0].extract_text()
print(text)
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)
PyMuPDF uses a page-object model (doc[0] returns a page). pypdf uses a reader/pages pattern. PDF Oxide uses page indices directly.
For character-level extraction (positions, font sizes, bounding boxes), PyMuPDF provides get_text("dict") which returns a nested dict structure. pypdf offers partial character position data. PDF Oxide provides extract_chars() with per-character bounding boxes and font metadata.
Markdown Conversion
This is a significant differentiator. Many LLM and RAG pipelines need Markdown output from PDFs.
PyMuPDF:
# PyMuPDF has no built-in Markdown conversion.
# You need pymupdf4llm, a separate package:
import pymupdf4llm
md = pymupdf4llm.to_markdown("paper.pdf")
pymupdf4llm works but is 69x slower than PDF Oxide’s built-in Markdown conversion (55.5ms mean vs 0.8ms). It is also a separate dependency with its own maintenance cycle.
pypdf:
# pypdf has no Markdown conversion.
# You would need an external tool chain (e.g., extract text,
# then use a separate library to structure it as Markdown).
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)
PDF Oxide’s Markdown conversion is built-in, handles heading detection, preserves table structure, and runs at the same speed as plain text extraction.
HTML Conversion
PyMuPDF: No built-in HTML output.
pypdf: No HTML output.
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
html = doc.to_html(0)
print(html)
Form Fields
All three libraries support reading and writing form fields (AcroForm).
PyMuPDF:
import fitz
doc = fitz.open("form.pdf")
page = doc[0]
for widget in page.widgets():
print(f"{widget.field_name}: {widget.field_value}")
pypdf:
from pypdf import PdfReader
reader = PdfReader("form.pdf")
fields = reader.get_fields()
for name, field in fields.items():
print(f"{name}: {field.get('/V', '')}")
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("form.pdf")
fields = doc.get_form_fields()
for field in fields:
print(f"{field.name}: {field.value}")
One notable difference: PDF Oxide supports XFA forms (XML Forms Architecture), which are used in many government and enterprise PDF forms. Neither PyMuPDF nor pypdf handles XFA form data extraction.
Image Extraction
PyMuPDF:
import fitz
doc = fitz.open("report.pdf")
page = doc[0]
for i, img in enumerate(page.get_images()):
xref = img[0]
base_image = doc.extract_image(xref)
with open(f"image_{i}.{base_image['ext']}", "wb") as f:
f.write(base_image["image"])
pypdf:
from pypdf import PdfReader
reader = PdfReader("report.pdf")
page = reader.pages[0]
for i, image in enumerate(page.images):
with open(f"image_{i}.{image.name.split('.')[-1]}", "wb") as f:
f.write(image.data)
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
with open(f"image_{i}.{img['format']}", "wb") as f:
f.write(img["data"])
All three handle embedded image extraction. PyMuPDF’s approach requires a two-step xref lookup. pypdf and PDF Oxide offer more streamlined APIs.
Rendering
PyMuPDF can render PDF pages to images (PNG, JPEG) using MuPDF’s rendering engine. pypdf cannot render pages at all. PDF Oxide includes a built-in rendering engine.
OCR
PyMuPDF integrates with Tesseract for OCR on scanned PDFs. pypdf has no OCR support. PDF Oxide has built-in OCR via PaddleOCR, requiring no external system dependencies.
PDF Creation
PyMuPDF can create PDFs but requires manual placement of text, images, and shapes on pages — there is no high-level API for creating PDFs from structured content.
pypdf cannot create PDFs from scratch. It can merge, split, and modify existing PDFs, but for creation you need a separate library like reportlab or fpdf2.
PDF Oxide can create PDFs from Markdown or HTML:
from pdf_oxide import Pdf
pdf = Pdf.from_markdown("# Invoice\n\n| Item | Price |\n|------|-------|\n| Widget | $9.99 |")
pdf.save("invoice.pdf")
Encryption
All three libraries support reading encrypted PDFs and writing encrypted output.
PyMuPDF:
import fitz
doc = fitz.open("encrypted.pdf")
doc.authenticate("password")
text = doc[0].get_text()
pypdf:
from pypdf import PdfReader
reader = PdfReader("encrypted.pdf")
reader.decrypt("password")
text = reader.pages[0].extract_text()
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)
Feature Summary
| Feature | PyMuPDF | pypdf | PDF Oxide |
|---|---|---|---|
| Text extraction | Yes | Yes | Yes |
| Character positions | Yes | Partial | Yes |
| Image extraction | Yes | Yes | Yes |
| Form fields (AcroForm) | Read + Write | Read + Write | Read + Write |
| XFA forms | No | No | Yes |
| PDF creation | Manual | No | Markdown/HTML |
| Markdown output | No (pymupdf4llm) | No | Built-in |
| HTML output | No | No | Built-in |
| Rendering | Yes | No | Yes |
| OCR | Tesseract | No | Built-in (PaddleOCR) |
| Search | Yes | No | Regex + spatial |
| Encryption | Read + Write | Read + Write | Read + Write |
| PDF/A validation | No | No | Yes |
| SVG export | Yes | No | No |
| Merge/split | Yes | Yes | Yes |
When to Choose Each Library
Choose pypdf if:
- You need a pure-Python solution with no compiled C or Rust extensions
- You are doing simple PDF manipulation (merge, split, rotate, encrypt/decrypt)
- Speed is not critical for your use case
- You want the smallest possible install footprint (~1 MB)
- You need broad Python version support (3.6+)
Choose PyMuPDF if:
- You already have a commercial MuPDF license from Artifex
- You need SVG export from PDF pages
- Your project is already licensed under AGPL-3.0
- You depend on MuPDF-specific rendering behavior
Choose PDF Oxide if:
- You need maximum text extraction speed (5.8x faster than PyMuPDF, 15x faster than pypdf)
- You want MIT licensing for commercial or closed-source use
- You need built-in Markdown or HTML output for LLM/RAG pipelines
- You need XFA form support
- You want built-in OCR without external system dependencies
- You want 100% reliability on valid PDFs
Installation
# PyMuPDF
pip install pymupdf
# pypdf
pip install pypdf
# PDF Oxide
pip install pdf_oxide
All three are available via pip. PyMuPDF ships a ~20 MB wheel with bundled MuPDF. pypdf is pure Python at ~1 MB. PDF Oxide ships pre-built wheels (~5 MB) for Linux (x86_64, aarch64), macOS (x86_64, arm64), and Windows (x86_64).
The Verdict
If you’re choosing between PyMuPDF and pypdf, you’re choosing between speed and licensing freedom. PDF Oxide gives you both — faster than PyMuPDF, more permissive than pypdf, with features neither library offers.
| What matters to you | Best choice |
|---|---|
| Maximum speed | PDF Oxide (0.8ms) |
| Permissive license | PDF Oxide (MIT) or pypdf (BSD) |
| Speed + permissive license | PDF Oxide — the only option |
| Markdown/HTML output | PDF Oxide — built-in |
| XFA forms | PDF Oxide — only library that supports them |
| 100% reliability | PDF Oxide — 100% pass rate |
| OCR without Tesseract | PDF Oxide — built-in PaddleOCR |
| SVG export | PyMuPDF |
| Pure Python, no binaries | pypdf |
Get started in 10 seconds:
pip install pdf_oxide
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
Related Pages
- PDF Oxide vs PyMuPDF — detailed comparison
- PDF Oxide vs pypdf — detailed comparison
- vs All Python PDF Libraries — full ecosystem comparison
- Performance Benchmarks — methodology and results