Skip to content

PyMuPDF vs pypdf — Which Python PDF Library?

PyMuPDF and pypdf are two of the most popular Python PDF libraries, but both have significant trade-offs. PyMuPDF is fast but locked behind AGPL-3.0 licensing. pypdf is permissively licensed but 15× slower. This page compares them head-to-head — and shows why PDF Oxide is a better choice than either.

The short answer: PDF Oxide is 5.8× faster than PyMuPDF, 15× faster than pypdf, MIT-licensed, and has more features than both — including built-in Markdown/HTML output, XFA form support, and OCR with no system dependencies.

Quick Comparison

PyMuPDF pypdf PDF Oxide
License AGPL-3.0 BSD-3 MIT
Language C (MuPDF) Pure Python Rust + PyO3
Mean extraction time 4.6ms 12.1ms 0.8ms
p99 extraction time 28ms 97ms 9ms
Pass rate (3,830 PDFs) 99.3% 98.4% 100%
Text extraction Yes Yes Yes
Character positions Yes Partial Yes
Image extraction Yes Yes Yes
Form fields Read + Write Read + Write Read + Write
PDF creation Yes Limited (merge only) Yes (Markdown/HTML)
Markdown output No No Yes
HTML output No No Yes
Rendering Yes No Yes
OCR Tesseract No Built-in (PaddleOCR)
Install size ~20 MB ~1 MB ~5 MB
Encryption Read + Write Read + Write Read + Write
Search Yes No Regex + spatial
Python versions 3.8–3.12 3.6+ 3.8–3.14

PyMuPDF is faster and more feature-rich than pypdf, but its AGPL license is a dealbreaker for many commercial projects. pypdf is lighter and BSD-licensed, but significantly slower and more limited in extraction capabilities. PDF Oxide combines the speed advantage of a native engine with the licensing freedom of a permissive license.

Licensing: AGPL vs BSD vs MIT

The licensing difference between PyMuPDF and pypdf is often the deciding factor for teams choosing between them.

PyMuPDF — AGPL-3.0

PyMuPDF wraps MuPDF, which is licensed under AGPL-3.0. This is a strong copyleft license. If you distribute any software that uses PyMuPDF — including SaaS applications, Docker containers, web services, desktop apps, or CLI tools — your entire application must be released under AGPL-3.0. That means publishing your full source code under the same license.

The alternative is purchasing a commercial license from Artifex, the company behind MuPDF. Artifex does not publish pricing publicly; you must contact their sales team for a quote. Commercial licenses are typically annual and priced per application.

AGPL affects you if:

  • You ship a product that includes PyMuPDF (desktop app, mobile app, Electron)
  • You run a SaaS or web service that processes PDFs with PyMuPDF
  • You distribute Docker images that contain PyMuPDF
  • You provide an API that uses PyMuPDF internally

AGPL does not affect you if:

  • Your project is already open-sourced under an AGPL-compatible license
  • You use PyMuPDF only for internal tooling that is never distributed

pypdf — BSD-3

pypdf uses the BSD 3-Clause license, which is permissive. You can use pypdf in commercial products, closed-source software, and SaaS applications without any obligation to open-source your code. The only requirement is retaining the copyright notice in redistributions.

PDF Oxide — MIT

PDF Oxide is MIT licensed — the most permissive common open-source license. Use it in any context (commercial, proprietary, SaaS, open source) with no restrictions beyond including the license text.

Licensing Summary

Use Case PyMuPDF (AGPL) pypdf (BSD) PDF Oxide (MIT)
Commercial product Requires license Yes Yes
Closed-source SaaS Requires license Yes Yes
Docker distribution Requires license Yes Yes
Internal tools Yes Yes Yes
Open-source (AGPL-compatible) Yes Yes Yes
Open-source (MIT/BSD/Apache) No Yes Yes

For commercial projects where licensing compliance matters, pypdf and PDF Oxide are both safe choices. PyMuPDF requires either open-sourcing your application or purchasing a commercial license.

Speed Benchmarks

All benchmarks were run on the same 3,830-PDF corpus — three independent, publicly available test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs) covering every PDF specification version (1.0–2.0), encrypted files, CJK encodings, complex layouts, and malformed documents.

Text Extraction Speed

Library Mean p99 Relative to PDF Oxide
PDF Oxide 0.8ms 9ms 1x
PyMuPDF 4.6ms 28ms 5.8x slower
pypdf 12.1ms 97ms 15.1x slower

PyMuPDF is 2.6x faster than pypdf because it delegates parsing to MuPDF’s C engine. pypdf does everything in pure Python — parsing, font decoding, text assembly — which means every operation pays the interpreter overhead.

PDF Oxide is faster than both because its Rust core handles all PDF parsing, font decoding, and text layout natively via PyO3, with only the final result crossing the Python boundary. There is no subprocess overhead, no C library bridging through ctypes, and no interpreter bottleneck.

Reliability

Library Valid PDFs Passed Pass Rate
PDF Oxide 3,823 / 3,823 100%
PyMuPDF 3,796 / 3,823 99.3%
pypdf 3,762 / 3,823 98.4%

PyMuPDF fails on 27 valid PDFs in the corpus. pypdf fails on 61. In both cases, these are valid PDF files that the library either crashes on or returns empty/incorrect text from. PDF Oxide handles all 3,823 valid PDFs without failure.

The 7 non-passing files in the full 3,830-file corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams) and are excluded from pass-rate calculations for all libraries.

What This Means in Practice

For a pipeline processing thousands of PDFs daily, PyMuPDF’s 99.3% pass rate means roughly 7 failures per 1,000 documents. pypdf’s 98.4% means 16 failures per 1,000. These are documents you need to handle with fallback logic, manual review, or simply accept as lost data.

PDF Oxide’s 100% pass rate on the test corpus means fewer edge cases to handle in production.

Feature Comparison

Text Extraction

All three libraries support basic text extraction. The API styles differ:

PyMuPDF:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
text = page.get_text()
print(text)

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
text = reader.pages[0].extract_text()
print(text)

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

PyMuPDF uses a page-object model (doc[0] returns a page). pypdf uses a reader/pages pattern. PDF Oxide uses page indices directly.

For character-level extraction (positions, font sizes, bounding boxes), PyMuPDF provides get_text("dict") which returns a nested dict structure. pypdf offers partial character position data. PDF Oxide provides extract_chars() with per-character bounding boxes and font metadata.

Markdown Conversion

This is a significant differentiator. Many LLM and RAG pipelines need Markdown output from PDFs.

PyMuPDF:

# PyMuPDF has no built-in Markdown conversion.
# You need pymupdf4llm, a separate package:
import pymupdf4llm

md = pymupdf4llm.to_markdown("paper.pdf")

pymupdf4llm works but is 69x slower than PDF Oxide’s built-in Markdown conversion (55.5ms mean vs 0.8ms). It is also a separate dependency with its own maintenance cycle.

pypdf:

# pypdf has no Markdown conversion.
# You would need an external tool chain (e.g., extract text,
# then use a separate library to structure it as Markdown).

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

PDF Oxide’s Markdown conversion is built-in, handles heading detection, preserves table structure, and runs at the same speed as plain text extraction.

HTML Conversion

PyMuPDF: No built-in HTML output.

pypdf: No HTML output.

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
html = doc.to_html(0)
print(html)

Form Fields

All three libraries support reading and writing form fields (AcroForm).

PyMuPDF:

import fitz

doc = fitz.open("form.pdf")
page = doc[0]
for widget in page.widgets():
    print(f"{widget.field_name}: {widget.field_value}")

pypdf:

from pypdf import PdfReader

reader = PdfReader("form.pdf")
fields = reader.get_fields()
for name, field in fields.items():
    print(f"{name}: {field.get('/V', '')}")

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")
fields = doc.get_form_fields()
for field in fields:
    print(f"{field.name}: {field.value}")

One notable difference: PDF Oxide supports XFA forms (XML Forms Architecture), which are used in many government and enterprise PDF forms. Neither PyMuPDF nor pypdf handles XFA form data extraction.

Image Extraction

PyMuPDF:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
for i, img in enumerate(page.get_images()):
    xref = img[0]
    base_image = doc.extract_image(xref)
    with open(f"image_{i}.{base_image['ext']}", "wb") as f:
        f.write(base_image["image"])

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
page = reader.pages[0]
for i, image in enumerate(page.images):
    with open(f"image_{i}.{image.name.split('.')[-1]}", "wb") as f:
        f.write(image.data)

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

All three handle embedded image extraction. PyMuPDF’s approach requires a two-step xref lookup. pypdf and PDF Oxide offer more streamlined APIs.

Rendering

PyMuPDF can render PDF pages to images (PNG, JPEG) using MuPDF’s rendering engine. pypdf cannot render pages at all. PDF Oxide includes a built-in rendering engine.

OCR

PyMuPDF integrates with Tesseract for OCR on scanned PDFs. pypdf has no OCR support. PDF Oxide has built-in OCR via PaddleOCR, requiring no external system dependencies.

PDF Creation

PyMuPDF can create PDFs but requires manual placement of text, images, and shapes on pages — there is no high-level API for creating PDFs from structured content.

pypdf cannot create PDFs from scratch. It can merge, split, and modify existing PDFs, but for creation you need a separate library like reportlab or fpdf2.

PDF Oxide can create PDFs from Markdown or HTML:

from pdf_oxide import Pdf

pdf = Pdf.from_markdown("# Invoice\n\n| Item | Price |\n|------|-------|\n| Widget | $9.99 |")
pdf.save("invoice.pdf")

Encryption

All three libraries support reading encrypted PDFs and writing encrypted output.

PyMuPDF:

import fitz

doc = fitz.open("encrypted.pdf")
doc.authenticate("password")
text = doc[0].get_text()

pypdf:

from pypdf import PdfReader

reader = PdfReader("encrypted.pdf")
reader.decrypt("password")
text = reader.pages[0].extract_text()

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)

Feature Summary

Feature PyMuPDF pypdf PDF Oxide
Text extraction Yes Yes Yes
Character positions Yes Partial Yes
Image extraction Yes Yes Yes
Form fields (AcroForm) Read + Write Read + Write Read + Write
XFA forms No No Yes
PDF creation Manual No Markdown/HTML
Markdown output No (pymupdf4llm) No Built-in
HTML output No No Built-in
Rendering Yes No Yes
OCR Tesseract No Built-in (PaddleOCR)
Search Yes No Regex + spatial
Encryption Read + Write Read + Write Read + Write
PDF/A validation No No Yes
SVG export Yes No No
Merge/split Yes Yes Yes

When to Choose Each Library

Choose pypdf if:

  • You need a pure-Python solution with no compiled C or Rust extensions
  • You are doing simple PDF manipulation (merge, split, rotate, encrypt/decrypt)
  • Speed is not critical for your use case
  • You want the smallest possible install footprint (~1 MB)
  • You need broad Python version support (3.6+)

Choose PyMuPDF if:

  • You already have a commercial MuPDF license from Artifex
  • You need SVG export from PDF pages
  • Your project is already licensed under AGPL-3.0
  • You depend on MuPDF-specific rendering behavior

Choose PDF Oxide if:

  • You need maximum text extraction speed (5.8x faster than PyMuPDF, 15x faster than pypdf)
  • You want MIT licensing for commercial or closed-source use
  • You need built-in Markdown or HTML output for LLM/RAG pipelines
  • You need XFA form support
  • You want built-in OCR without external system dependencies
  • You want 100% reliability on valid PDFs

Installation

# PyMuPDF
pip install pymupdf

# pypdf
pip install pypdf

# PDF Oxide
pip install pdf_oxide

All three are available via pip. PyMuPDF ships a ~20 MB wheel with bundled MuPDF. pypdf is pure Python at ~1 MB. PDF Oxide ships pre-built wheels (~5 MB) for Linux (x86_64, aarch64), macOS (x86_64, arm64), and Windows (x86_64).

The Verdict

If you’re choosing between PyMuPDF and pypdf, you’re choosing between speed and licensing freedom. PDF Oxide gives you both — faster than PyMuPDF, more permissive than pypdf, with features neither library offers.

What matters to you Best choice
Maximum speed PDF Oxide (0.8ms)
Permissive license PDF Oxide (MIT) or pypdf (BSD)
Speed + permissive license PDF Oxide — the only option
Markdown/HTML output PDF Oxide — built-in
XFA forms PDF Oxide — only library that supports them
100% reliability PDF Oxide — 100% pass rate
OCR without Tesseract PDF Oxide — built-in PaddleOCR
SVG export PyMuPDF
Pure Python, no binaries pypdf

Get started in 10 seconds:

pip install pdf_oxide
from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)