vs Python PDF Libraries
PDF Oxide compared with PyMuPDF (fitz), pypdfium2, pypdf, pdfplumber, pdfminer, and more. This page covers performance, feature coverage, licensing, and API differences to help you choose the right Python PDF library for text extraction.
Summary
| PDF Oxide | PyMuPDF | pypdfium2 | pypdf | pdfplumber | pdfminer | |
|---|---|---|---|---|---|---|
| Mean extraction time | 0.8ms | 4.6ms | 4.1ms | 12.1ms | 23.2ms | 16.8ms |
| Pass rate (3,830 PDFs) | 100% | 99.3% | 99.2% | 98.4% | 98.8% | 98.8% |
| License | MIT | AGPL-3.0 | Apache-2.0 | BSD-3 | MIT | MIT |
| Language | Rust + PyO3 | C (MuPDF) | C (PDFium) | Pure Python | Pure Python | Pure Python |
| Text extraction | Yes | Yes | Yes | Yes | Yes | Yes |
| Character positions | Yes | Yes | Yes | Partial | Yes | Yes |
| Image extraction | Yes | Yes | Yes | Yes | No | No |
| Form fields | Read + Write | Read + Write | Read only | Read + Write | Read only | No |
| PDF creation | Yes | Yes | No | Limited | No | No |
| PDF editing | Yes | Yes | No | Yes | No | No |
| Markdown output | Yes | No | No | No | No | No |
| HTML output | Yes | No | No | No | No | No |
| Encryption | Read + Write | Read + Write | Read only | Read + Write | No | No |
| PDF/A validation | Yes | No | No | No | No | No |
| Rendering | Yes | Yes | Yes | No | No | No |
| Search | Regex + spatial | Yes | Yes | No | No | No |
| Python versions | 3.8–3.14 | 3.8–3.12 | 3.8+ | 3.6+ | 3.8+ | 3.6+ |
| Install size | ~5 MB wheel | ~20 MB wheel | ~3 MB wheel | ~1 MB | ~1 MB | ~1 MB |
Performance Comparison
Mean text extraction time per PDF, benchmarked on the full 3,830-PDF corpus — three independent, publicly available test suites that together cover every PDF specification version (1.0–2.0), encrypted files, malformed documents, CJK encodings, complex layouts, and security edge cases. See full corpus details for what each suite tests and why these results are reproducible.
| Library | Mean | Relative | p99 | Pass Rate |
|---|---|---|---|---|
| PDF Oxide | 0.8ms | 1× | 9ms | 100% |
| PyMuPDF | 4.6ms | 5.8× | 28ms | 99.3% |
| pypdfium2 | 4.1ms | 5.1× | 42ms | 99.2% |
| pymupdf4llm | 55.5ms | 69× | 280ms | 99.1% |
| pdftext | 7.3ms | 9.1× | 82ms | 99.0% |
| pdfminer | 16.8ms | 21× | 124ms | 98.8% |
| pdfplumber | 23.2ms | 29× | 189ms | 98.8% |
| markitdown | 108.8ms | 136× | 378ms | 98.6% |
| pypdf | 12.1ms | 15.1× | 97ms | 98.4% |
PDF Oxide achieves its speed through a native Rust core compiled to a Python extension module via PyO3. There is no subprocess overhead or C library bridging — the Rust code runs directly in the Python process.
Reliability
PDF Oxide processes 3,823 of 3,823 valid PDFs without failure — a 100% pass rate. The 7 non-passing files in the 3,830-file corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams).
| Library | Valid PDFs Passed | Pass Rate |
|---|---|---|
| PDF Oxide | 3,823 / 3,823 | 100% |
| PyMuPDF | 3,796 / 3,823 | 99.3% |
| pypdfium2 | 3,792 / 3,823 | 99.2% |
| pymupdf4llm | 3,787 / 3,823 | 99.1% |
| pdftext | 3,784 / 3,823 | 99.0% |
| pdfminer | 3,777 / 3,823 | 98.8% |
| pdfplumber | 3,777 / 3,823 | 98.8% |
| markitdown | 3,771 / 3,823 | 98.6% |
| pypdf | 3,762 / 3,823 | 98.4% |
Text Quality
PDF Oxide achieves 99.5% text parity compared to PyMuPDF and pypdfium2 across the full corpus. Quality was measured by comparing extracted text output character-by-character. The remaining 0.5% difference is in whitespace normalization and ligature handling where PDF Oxide produces cleaner output.
License Comparison
| Library | License | Commercial Use | Copyleft |
|---|---|---|---|
| PDF Oxide | MIT | Unrestricted | No |
| pypdfium2 | Apache-2.0 | Unrestricted | No |
| PyMuPDF | AGPL-3.0 | Requires commercial license ($) | Yes |
| pypdf | BSD-3 | Unrestricted | No |
| pdfplumber | MIT | Unrestricted | No |
| pdfminer | MIT | Unrestricted | No |
| pdftext | GPL-3.0 | Requires open source | Yes |
PyMuPDF uses MuPDF under the AGPL-3.0 license. If you distribute software that uses PyMuPDF, your software must also be released under AGPL-3.0 — or you must purchase a commercial license from Artifex. This applies to SaaS products, web applications, and any distributed binaries.
PDF Oxide is MIT-licensed with no restrictions. Use it in proprietary products, SaaS platforms, or closed-source applications without any licensing obligations.
| Use Case | PDF Oxide (MIT) | PyMuPDF (AGPL) | pypdfium2 (Apache) | pypdf (BSD) | pdfplumber (MIT) | pdfminer (MIT) |
|---|---|---|---|---|---|---|
| Commercial product | Yes | Requires license | Yes | Yes | Yes | Yes |
| Closed source | Yes | No (unless licensed) | Yes | Yes | Yes | Yes |
| SaaS/cloud | Yes | Requires license | Yes | Yes | Yes | Yes |
| Internal tools | Yes | Yes | Yes | Yes | Yes | Yes |
API Comparison
Text Extraction
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)
PyMuPDF:
import fitz
doc = fitz.open("report.pdf")
page = doc[0]
text = page.get_text()
print(text)
pypdf:
from pypdf import PdfReader
reader = PdfReader("report.pdf")
page = reader.pages[0]
text = page.extract_text()
print(text)
pdfplumber:
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
page = pdf.pages[0]
text = page.extract_text()
print(text)
pdfminer:
from pdfminer.high_level import extract_text
text = extract_text("report.pdf", page_numbers=[0])
print(text)
Character-Level Extraction
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
for ch in chars:
print(f"'{ch.char}' at ({ch.bbox[0]:.1f}, {ch.bbox[1]:.1f}) "
f"size={ch.font_size:.1f}")
PyMuPDF:
import fitz
doc = fitz.open("report.pdf")
page = doc[0]
blocks = page.get_text("dict")["blocks"]
for block in blocks:
if "lines" in block:
for line in block["lines"]:
for span in line["spans"]:
print(f"'{span['text']}' size={span['size']:.1f}")
pdfplumber:
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
page = pdf.pages[0]
for char in page.chars:
print(f"'{char['text']}' at ({char['x0']:.1f}, {char['top']:.1f}) "
f"size={char['size']:.1f}")
pdfminer:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTChar
for page_layout in extract_pages("report.pdf"):
for element in page_layout:
if hasattr(element, '__iter__'):
for text_line in element:
if hasattr(text_line, '__iter__'):
for char in text_line:
if isinstance(char, LTChar):
print(f"'{char.get_text()}' at ({char.x0:.1f}, {char.y0:.1f}) "
f"size={char.size:.1f}")
Image Extraction
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
with open(f"image_{i}.{img['format']}", "wb") as f:
f.write(img["data"])
PyMuPDF:
import fitz
doc = fitz.open("report.pdf")
page = doc[0]
for i, img in enumerate(page.get_images()):
xref = img[0]
base_image = doc.extract_image(xref)
with open(f"image_{i}.{base_image['ext']}", "wb") as f:
f.write(base_image["image"])
pypdf:
from pypdf import PdfReader
reader = PdfReader("report.pdf")
page = reader.pages[0]
for i, image in enumerate(page.images):
with open(f"image_{i}.{image.name.split('.')[-1]}", "wb") as f:
f.write(image.data)
PDF Creation
PDF Oxide:
from pdf_oxide import Pdf
pdf = Pdf.from_markdown("# Hello World\n\nThis is a PDF.")
pdf.save("output.pdf")
# Also supports HTML
pdf = Pdf.from_html("<h1>Hello</h1><p>World</p>")
pdf.save("output.pdf")
PyMuPDF:
import fitz
doc = fitz.open()
page = doc.new_page()
text_point = fitz.Point(72, 72)
page.insert_text(text_point, "Hello World", fontsize=24)
doc.save("output.pdf")
pypdf:
# pypdf can merge/modify PDFs but cannot create from scratch with text content.
# Use reportlab or fpdf2 for creation, then merge with pypdf.
Encrypted PDFs
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)
PyMuPDF:
import fitz
doc = fitz.open("encrypted.pdf")
doc.authenticate("password")
page = doc[0]
text = page.get_text()
pypdf:
from pypdf import PdfReader
reader = PdfReader("encrypted.pdf")
reader.decrypt("password")
text = reader.pages[0].extract_text()
Markdown and HTML Output
PDF Oxide (unique feature):
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
# Convert to Markdown with heading detection
md = doc.to_markdown(0, detect_headings=True)
print(md)
# Convert to HTML
html = doc.to_html(0)
print(html)
No other Python PDF library provides built-in Markdown or HTML conversion.
Library Profiles
PDF Oxide
Strengths:
- Fastest text extraction in benchmarks due to Rust core — 5.8× faster than PyMuPDF
- 100% pass rate on 3,830-PDF corpus — highest reliability of any tested library
- Unified API for extraction, creation, and editing in a single library
- Built-in Markdown and HTML export with heading detection
- MIT licensed with no copyleft restrictions
- Native compliance validation (PDF/A, PDF/UA, PDF/X)
- Pre-built wheels for all major platforms and Python 3.8–3.14
- No system dependencies — the wheel includes everything
Limitations:
- Newer library with a smaller community
- Table extraction is basic compared to pdfplumber’s algorithms
- Rendering engine is less mature than MuPDF
PyMuPDF (fitz)
Strengths:
- Mature and battle-tested (backed by MuPDF, in development since 2005)
- Excellent rendering quality for complex PDFs
- Built-in OCR integration (Tesseract)
- Rich feature set: SVG export, page manipulation, table detection
Limitations:
- AGPL-3.0 license requires open-sourcing your application or purchasing a commercial license
- Large wheel size (~20 MB) due to bundled MuPDF
- No built-in Markdown export
- No compliance validation
pypdfium2
Strengths:
- Fast (backed by Google’s PDFium engine)
- Apache-2.0 license — permissive for commercial use
- Good rendering quality
Limitations:
- Limited text extraction API compared to PDF Oxide or PyMuPDF
- No PDF creation or editing
- No form field support beyond read-only
pypdf
Strengths:
- Pure Python — installs anywhere, no compiled dependencies
- Lightweight and well-maintained
- Good for PDF manipulation (merge, split, rotate, encrypt)
- Large community and extensive documentation
Limitations:
- 15× slower than PDF Oxide for text extraction
- Text extraction quality struggles with complex layouts
- No rendering, no Markdown/HTML export, no table extraction
pdfplumber
Strengths:
- Best table extraction of any Python PDF library
- Excellent character-level positioning data
- Visual debugging tools (annotated page images)
- MIT licensed
Limitations:
- Pure Python — 29× slower than PDF Oxide
- Read-only — no PDF creation or editing
- No encryption or rendering
pdfminer
Strengths:
- Detailed character and layout analysis
- Good CJK text support
- Foundation for pdfplumber and other tools
- MIT licensed
Limitations:
- 21× slower than PDF Oxide (pure Python, unoptimized)
- Read-only, no creation or editing
- Verbose API for common tasks
- Less actively maintained
When to Use Each
| Use Case | Recommended Library |
|---|---|
| Fast text extraction | PDF Oxide |
| Commercial / proprietary product | PDF Oxide, pypdfium2, pypdf, pdfplumber, or pdfminer |
| PyMuPDF alternative (MIT licensed) | PDF Oxide |
| PDF creation from Markdown/HTML | PDF Oxide |
| Compliance validation (PDF/A, PDF/X) | PDF Oxide |
| Table extraction from invoices | pdfplumber |
| Visual debugging of extraction | pdfplumber |
| Existing MuPDF investment | PyMuPDF (if AGPL-compatible) |
| Minimal dependencies | pypdf (pure Python) |
| Detailed layout analysis | pdfminer |
| OCR for scanned documents | PyMuPDF |
Installation
# PDF Oxide
pip install pdf_oxide
# PyMuPDF
pip install pymupdf
# pypdfium2
pip install pypdfium2
# pypdf
pip install pypdf
# pdfplumber
pip install pdfplumber
# pdfminer
pip install pdfminer.six
PDF Oxide ships pre-built wheels for Linux (x86_64, aarch64), macOS (x86_64, arm64), and Windows (x86_64). No compiler or system libraries required.
Related Pages
- Performance Benchmarks – full corpus benchmark results
- Getting Started with Python – installation and first extraction
- Python API Reference – complete Python API
- vs Rust PDF Libraries – Rust ecosystem comparison