pdfplumber vs PyMuPDF – Speed, Tables, and Licensing
pdfplumber and PyMuPDF are popular Python PDF libraries, but both force you into trade-offs. pdfplumber is great for tables but 29× slower than necessary. PyMuPDF is fast but locked behind AGPL-3.0 licensing that blocks commercial use. This page compares both — and shows why PDF Oxide is a better choice for most use cases.
The short answer: PDF Oxide is 29× faster than pdfplumber, 5.8× faster than PyMuPDF, MIT-licensed, and handles text, images, forms, encryption, Markdown output, and OCR — all in one library. The only area where pdfplumber still leads is complex table extraction with visual debugging.
Quick Comparison
| pdfplumber | PyMuPDF | PDF Oxide | |
|---|---|---|---|
| License | MIT | AGPL-3.0 | MIT |
| Language | Pure Python | C (MuPDF) | Rust + PyO3 |
| Mean extraction time | 23.2ms | 4.6ms | 0.8ms |
| p99 extraction time | 189ms | 28ms | 9ms |
| Pass rate (3,830 PDFs) | 98.8% | 99.3% | 100% |
| Text extraction | Yes | Yes | Yes |
| Character positions | Yes | Yes | Yes |
| Table extraction | Advanced | Basic | Basic |
| Image extraction | No | Yes | Yes |
| Visual debugging | Yes | No | No |
| PDF creation | No | Yes | Yes |
| PDF editing | No | Yes | Yes |
| Markdown output | No | No | Yes |
| HTML output | No | No | Yes |
| Form fields | Read only | Read + Write | Read + Write |
| Encryption | No | Read + Write | Read + Write |
| Rendering | No | Yes | Yes |
| OCR | No | Tesseract | Built-in (PaddleOCR) |
| Install size | ~1 MB | ~20 MB | ~5 MB |
| Python versions | 3.8+ | 3.8–3.12 | 3.8–3.14 |
Speed Benchmarks
All three libraries benchmarked on the same corpus of 3,830 PDFs from three independent public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). The corpus covers every PDF specification version (1.0–2.0), encrypted files, malformed documents, CJK encodings, and complex layouts.
| Metric | pdfplumber | PyMuPDF | PDF Oxide |
|---|---|---|---|
| Mean extraction time | 23.2ms | 4.6ms | 0.8ms |
| p99 extraction time | 189ms | 28ms | 9ms |
| Relative to PDF Oxide | 29x slower | 5.8x slower | 1x |
| Pass rate (valid PDFs) | 98.8% (3,777/3,823) | 99.3% (3,796/3,823) | 100% (3,823/3,823) |
PyMuPDF is roughly 5x faster than pdfplumber because it delegates all parsing to the MuPDF C library. pdfplumber builds on pdfminer for parsing, then adds its own spatial analysis layer – both written in pure Python. PDF Oxide handles all parsing, font decoding, and text assembly in compiled Rust running directly in the Python process via PyO3, which accounts for its 5.8x advantage over PyMuPDF and 29x advantage over pdfplumber.
What the Numbers Mean in Practice
| Workload | pdfplumber | PyMuPDF | PDF Oxide |
|---|---|---|---|
| 100 PDFs | 2.3 seconds | 0.46 seconds | 0.08 seconds |
| 1,000 PDFs | 23 seconds | 4.6 seconds | 0.8 seconds |
| 10,000 PDFs | 3.9 minutes | 46 seconds | 8 seconds |
| 100,000 PDFs | 39 minutes | 7.7 minutes | 80 seconds |
For one-off scripts processing a handful of files, the speed difference is irrelevant. For production pipelines processing thousands of documents daily, the gap between 39 minutes and 80 seconds changes architecture decisions.
Table Extraction
Table extraction is the primary reason developers choose pdfplumber over PyMuPDF. This is where pdfplumber genuinely excels.
pdfplumber: Structured Table Parsing
pdfplumber provides dedicated table extraction with configurable line detection, cell merging, and visual debugging:
import pdfplumber
with pdfplumber.open("invoice.pdf") as pdf:
page = pdf.pages[0]
# Extract all tables as structured data
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
# Fine-tune detection with custom settings
tables = page.extract_tables({
"vertical_strategy": "text",
"horizontal_strategy": "lines",
"snap_tolerance": 5,
})
# Visual debugging: render page with detected table boundaries
im = page.to_image()
im.debug_tablefinder()
im.save("debug.png")
pdfplumber returns structured row/column data and handles merged cells, spanning headers, and borderless tables. The visual debugging overlay is invaluable for tuning extraction parameters on tricky layouts.
PyMuPDF: Basic Table Detection
PyMuPDF added table detection in recent versions, but it is less mature than pdfplumber’s algorithms:
import fitz
doc = fitz.open("invoice.pdf")
page = doc[0]
# PyMuPDF's built-in table finder (added in v1.23)
tabs = page.find_tables()
for table in tabs:
df = table.to_pandas() # requires pandas
print(df)
PyMuPDF’s table extraction works for simple grid-based tables with visible borders. It struggles with borderless layouts, multi-level headers, and cells spanning multiple rows or columns – exactly the cases where pdfplumber is strongest.
PDF Oxide: Markdown Table Output
PDF Oxide converts tables to Markdown syntax as part of its structured output pipeline:
from pdf_oxide import PdfDocument
doc = PdfDocument("invoice.pdf")
# Tables are detected and converted to Markdown table format
md = doc.to_markdown(0, detect_headings=True)
print(md)
# Also available as HTML with table tags
html = doc.to_html(0)
print(html)
PDF Oxide’s table detection is functional for standard grid layouts and produces clean Markdown or HTML output. For complex tables with merged cells, borderless designs, or spanning headers, pdfplumber’s dedicated algorithms remain more robust.
Table Extraction Summary
| Capability | pdfplumber | PyMuPDF | PDF Oxide |
|---|---|---|---|
| Simple bordered tables | Yes | Yes | Yes |
| Borderless tables | Yes | Limited | Limited |
| Merged cells | Yes | Limited | Limited |
| Multi-level headers | Yes | No | No |
| Configurable detection | Yes | Limited | No |
| Visual debugging | Yes | No | No |
| Output format | Python lists | pandas DataFrames | Markdown / HTML |
| Speed | Slow (pure Python) | Fast | Fastest |
If complex table extraction is your only use case, pdfplumber is the best tool. If you need tables alongside fast text extraction, image extraction, or PDF creation, PDF Oxide covers more ground.
Text Extraction
For plain text extraction, both libraries get the job done but differ in speed and API design.
pdfplumber
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
page = pdf.pages[0]
text = page.extract_text()
print(text)
PyMuPDF
import fitz
doc = fitz.open("report.pdf")
page = doc[0]
text = page.get_text()
print(text)
PDF Oxide
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)
All three produce comparable text output for well-formed PDFs. PDF Oxide achieves 99.5% text parity with PyMuPDF across the full corpus, with the remaining 0.5% difference in whitespace normalization and ligature handling.
Character-Level Positioning
Both pdfplumber and PyMuPDF provide character-level position data, which is important for spatial analysis, bounding box detection, and custom layout reconstruction.
pdfplumber
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
page = pdf.pages[0]
for char in page.chars[:10]:
print(f"'{char['text']}' at ({char['x0']:.1f}, {char['top']:.1f}) "
f"size={char['size']:.1f}")
PyMuPDF
import fitz
doc = fitz.open("report.pdf")
page = doc[0]
blocks = page.get_text("dict")["blocks"]
for block in blocks:
if "lines" in block:
for line in block["lines"]:
for span in line["spans"]:
print(f"'{span['text']}' size={span['size']:.1f}")
PDF Oxide
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
for ch in chars[:10]:
print(f"'{ch.char}' at ({ch.x:.1f}, {ch.y:.1f}) size={ch.font_size:.1f}")
pdfplumber returns per-character dictionaries with rich metadata. PyMuPDF returns nested block/line/span structures. PDF Oxide returns flat character objects with position and font data.
Licensing
This is the most consequential difference between pdfplumber and PyMuPDF for commercial projects.
| pdfplumber | PyMuPDF | PDF Oxide | |
|---|---|---|---|
| License | MIT | AGPL-3.0 | MIT |
| Commercial product | Yes | Requires commercial license | Yes |
| Closed-source SaaS | Yes | Requires commercial license | Yes |
| Docker distribution | Yes | Requires commercial license | Yes |
| Internal tools | Yes | Yes | Yes |
| Open-source project | Yes | Yes (if AGPL-compatible) | Yes |
PyMuPDF’s AGPL Problem
PyMuPDF wraps MuPDF, which is AGPL-3.0 licensed. If you distribute software that includes PyMuPDF – including SaaS, web apps, and Docker containers – your code must be open-sourced under AGPL or you must buy a commercial license from Artifex.
Artifex does not publish commercial license pricing publicly. You must contact their sales team for a quote. Licenses are typically per-application, renewed annually, with no free tier or startup exception.
pdfplumber and PDF Oxide Are Both MIT
Both pdfplumber and PDF Oxide are MIT licensed. Use either in any project – commercial, proprietary, SaaS, or open source – with no obligations. If licensing is your primary concern and you are choosing between pdfplumber and PyMuPDF, pdfplumber (or PDF Oxide) is the safer choice.
Encrypted PDFs
Encryption handling is a significant gap in pdfplumber’s feature set and a common pain point for developers working with password-protected documents.
pdfplumber: No Encryption Support
pdfplumber cannot open encrypted or password-protected PDFs at all. If you pass an encrypted PDF to pdfplumber, it raises an error. You must decrypt the file first using another tool:
import pdfplumber
# This will fail on encrypted PDFs:
with pdfplumber.open("encrypted.pdf") as pdf:
# raises pdfminer.pdfparser.PDFSyntaxError or similar
pass
A common workaround is to use PyMuPDF or pypdf to decrypt the file first, then pass it to pdfplumber for table extraction – adding another dependency to your pipeline.
PyMuPDF: Full Encryption Support
import fitz
doc = fitz.open("encrypted.pdf")
doc.authenticate("password")
page = doc[0]
text = page.get_text()
PyMuPDF supports both user and owner passwords, AES-128 and AES-256 encryption, and can create encrypted PDFs.
PDF Oxide: Full Encryption Support
from pdf_oxide import PdfDocument
doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)
PDF Oxide handles all standard PDF encryption methods (RC4, AES-128, AES-256) for both reading and writing. No additional dependencies or preprocessing required.
Image Extraction
Another gap in pdfplumber’s feature set. pdfplumber does not extract embedded images from PDFs.
PyMuPDF
import fitz
doc = fitz.open("report.pdf")
page = doc[0]
for i, img in enumerate(page.get_images()):
xref = img[0]
base_image = doc.extract_image(xref)
with open(f"image_{i}.{base_image['ext']}", "wb") as f:
f.write(base_image["image"])
PDF Oxide
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
with open(f"image_{i}.{img['format']}", "wb") as f:
f.write(img["data"])
If your pipeline requires extracting both text and images from PDFs, pdfplumber cannot handle the image side. You need PyMuPDF, PDF Oxide, or pypdfium2 for that.
Markdown and HTML Output
Neither pdfplumber nor PyMuPDF provides built-in Markdown or HTML conversion. This is a unique feature of PDF Oxide.
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
# Markdown with heading detection and table formatting
md = doc.to_markdown(0, detect_headings=True)
print(md)
# HTML with semantic tags
html = doc.to_html(0)
print(html)
For LLM pipelines, RAG systems, and document conversion workflows, structured Markdown output eliminates the need for a separate conversion step. PyMuPDF users typically rely on the separate pymupdf4llm package, which is 69x slower than PDF Oxide’s built-in conversion.
When to Choose Each Library
Choose pdfplumber if:
- Complex table extraction is your primary use case. pdfplumber’s table algorithms handle merged cells, borderless tables, and spanning headers better than any other Python library.
- You need visual debugging. pdfplumber can render annotated page images showing detected lines, characters, and table boundaries – invaluable for tuning extraction on tricky documents.
- You want a pure-Python solution. No compiled dependencies, installs anywhere Python runs.
- Speed is not a concern. If you process fewer than a hundred files at a time, the 23ms mean is perfectly acceptable.
Choose PyMuPDF if:
- You already have a commercial MuPDF license and depend on MuPDF-specific rendering or SVG export.
- You need high-fidelity rendering. MuPDF’s rendering engine is mature and handles complex PDFs well.
- Your project is AGPL-compatible. If you are building open-source software under AGPL or a compatible license, PyMuPDF’s licensing is not a concern.
- You need OCR via Tesseract. PyMuPDF has built-in Tesseract integration for scanned documents.
Choose PDF Oxide if:
- You need speed and broad feature coverage. 0.8ms mean extraction – 5.8x faster than PyMuPDF, 29x faster than pdfplumber – with text, images, forms, creation, and encryption in one library.
- You want MIT licensing without sacrificing speed. pdfplumber is MIT but slow. PyMuPDF is fast but AGPL. PDF Oxide is both MIT and fast.
- You need Markdown or HTML output. Built-in structured conversion for LLM pipelines and RAG systems.
- You need encrypted PDF support with a permissive license. pdfplumber cannot handle encryption. PyMuPDF can but requires AGPL compliance. PDF Oxide handles encryption under MIT.
- You want a single library for extraction, creation, and editing. Both pdfplumber and PyMuPDF require additional tools for parts of the PDF workflow. PDF Oxide covers extraction, creation, editing, rendering, and validation.
Use PDF Oxide + pdfplumber together:
For pipelines that need fast text extraction, image extraction, and complex table parsing, use PDF Oxide for the general pipeline and pdfplumber for tables:
from pdf_oxide import PdfDocument
import pdfplumber
# Fast text and image extraction with PDF Oxide
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
images = doc.extract_images(0)
# Complex table extraction with pdfplumber
with pdfplumber.open("report.pdf") as pdf:
tables = pdf.pages[0].extract_tables()
Installation
# pdfplumber
pip install pdfplumber
# PyMuPDF
pip install pymupdf
# PDF Oxide
pip install pdf_oxide
All three install via pip. pdfplumber and PDF Oxide are MIT licensed. PyMuPDF is AGPL-3.0 – review the licensing implications before adding it to a commercial project.
The Verdict
pdfplumber and PyMuPDF both solve parts of the problem. PDF Oxide solves the whole thing.
| What matters to you | Best choice |
|---|---|
| Maximum speed | PDF Oxide (0.8ms – 29× faster than pdfplumber) |
| Complex table extraction | pdfplumber (visual debugging, merged cells) |
| Permissive license + speed | PDF Oxide – pdfplumber is MIT but slow, PyMuPDF is fast but AGPL |
| Encrypted PDFs | PDF Oxide or PyMuPDF – pdfplumber cannot decrypt |
| Image extraction | PDF Oxide or PyMuPDF – pdfplumber has no image support |
| Markdown/HTML output | PDF Oxide – only library with built-in conversion |
| OCR without Tesseract | PDF Oxide – built-in PaddleOCR |
| One library for everything | PDF Oxide – extraction, creation, editing, encryption, OCR |
Unless your entire workflow is complex table extraction (borderless tables, merged cells, visual debugging), PDF Oxide replaces both pdfplumber and PyMuPDF — faster, more features, MIT-licensed.
Get started in 10 seconds:
pip install pdf_oxide
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
text = doc.extract_text(0) # 29× faster than pdfplumber
md = doc.to_markdown(0) # built-in, no separate package
images = doc.extract_images(0) # pdfplumber can't do this
Related Pages
- PDF Oxide vs PyMuPDF – detailed comparison with migration guide
- PDF Oxide vs pdfplumber – detailed comparison with code examples
- vs Python PDF Libraries – all Python libraries compared
- Performance Benchmarks – full corpus benchmark methodology
- Extract Tables from PDF – table extraction guide
- Getting Started with Python – installation and first extraction