What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

pdfplumber vs PyMuPDF – Speed, Tables, and Licensing

pdfplumber and PyMuPDF are popular Python PDF libraries, but both force you into trade-offs. pdfplumber is great for tables but 29× slower than necessary. PyMuPDF is fast but locked behind AGPL-3.0 licensing that blocks commercial use. This page compares both — and shows why PDF Oxide is a better choice for most use cases.

The short answer: PDF Oxide is 29× faster than pdfplumber, 5.8× faster than PyMuPDF, MIT-licensed, and handles text, images, forms, encryption, Markdown output, and OCR — all in one library. The only area where pdfplumber still leads is complex table extraction with visual debugging.

Quick Comparison

	pdfplumber	PyMuPDF	PDF Oxide
License	MIT	AGPL-3.0	MIT
Language	Pure Python	C (MuPDF)	Rust + PyO3
Mean extraction time	23.2ms	4.6ms	0.8ms
p99 extraction time	189ms	28ms	9ms
Pass rate (3,830 PDFs)	98.8%	99.3%	100%
Text extraction	Yes	Yes	Yes
Character positions	Yes	Yes	Yes
Table extraction	Advanced	Basic	Basic
Image extraction	No	Yes	Yes
Visual debugging	Yes	No	No
PDF creation	No	Yes	Yes
PDF editing	No	Yes	Yes
Markdown output	No	No	Yes
HTML output	No	No	Yes
Form fields	Read only	Read + Write	Read + Write
Encryption	No	Read + Write	Read + Write
Rendering	No	Yes	Yes
OCR	No	Tesseract	Built-in (PaddleOCR)
Install size	~1 MB	~20 MB	~5 MB
Python versions	3.8+	3.8–3.12	3.8–3.14

Speed Benchmarks

All three libraries benchmarked on the same corpus of 3,830 PDFs from three independent public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). The corpus covers every PDF specification version (1.0–2.0), encrypted files, malformed documents, CJK encodings, and complex layouts.

Metric	pdfplumber	PyMuPDF	PDF Oxide
Mean extraction time	23.2ms	4.6ms	0.8ms
p99 extraction time	189ms	28ms	9ms
Relative to PDF Oxide	29x slower	5.8x slower	1x
Pass rate (valid PDFs)	98.8% (3,777/3,823)	99.3% (3,796/3,823)	100% (3,823/3,823)

PyMuPDF is roughly 5x faster than pdfplumber because it delegates all parsing to the MuPDF C library. pdfplumber builds on pdfminer for parsing, then adds its own spatial analysis layer – both written in pure Python. PDF Oxide handles all parsing, font decoding, and text assembly in compiled Rust running directly in the Python process via PyO3, which accounts for its 5.8x advantage over PyMuPDF and 29x advantage over pdfplumber.

What the Numbers Mean in Practice

Workload	pdfplumber	PyMuPDF	PDF Oxide
100 PDFs	2.3 seconds	0.46 seconds	0.08 seconds
1,000 PDFs	23 seconds	4.6 seconds	0.8 seconds
10,000 PDFs	3.9 minutes	46 seconds	8 seconds
100,000 PDFs	39 minutes	7.7 minutes	80 seconds

For one-off scripts processing a handful of files, the speed difference is irrelevant. For production pipelines processing thousands of documents daily, the gap between 39 minutes and 80 seconds changes architecture decisions.

Table Extraction

Table extraction is the primary reason developers choose pdfplumber over PyMuPDF. This is where pdfplumber genuinely excels.

pdfplumber: Structured Table Parsing

pdfplumber provides dedicated table extraction with configurable line detection, cell merging, and visual debugging:

import pdfplumber

with pdfplumber.open("invoice.pdf") as pdf:
    page = pdf.pages[0]

    # Extract all tables as structured data
    tables = page.extract_tables()
    for table in tables:
        for row in table:
            print(row)

    # Fine-tune detection with custom settings
    tables = page.extract_tables({
        "vertical_strategy": "text",
        "horizontal_strategy": "lines",
        "snap_tolerance": 5,
    })

    # Visual debugging: render page with detected table boundaries
    im = page.to_image()
    im.debug_tablefinder()
    im.save("debug.png")

pdfplumber returns structured row/column data and handles merged cells, spanning headers, and borderless tables. The visual debugging overlay is invaluable for tuning extraction parameters on tricky layouts.

PyMuPDF: Basic Table Detection

PyMuPDF added table detection in recent versions, but it is less mature than pdfplumber’s algorithms:

import fitz

doc = fitz.open("invoice.pdf")
page = doc[0]

# PyMuPDF's built-in table finder (added in v1.23)
tabs = page.find_tables()
for table in tabs:
    df = table.to_pandas()  # requires pandas
    print(df)

PyMuPDF’s table extraction works for simple grid-based tables with visible borders. It struggles with borderless layouts, multi-level headers, and cells spanning multiple rows or columns – exactly the cases where pdfplumber is strongest.

PDF Oxide: Markdown Table Output

PDF Oxide converts tables to Markdown syntax as part of its structured output pipeline:

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")

# Tables are detected and converted to Markdown table format
md = doc.to_markdown(0, detect_headings=True)
print(md)

# Also available as HTML with table tags
html = doc.to_html(0)
print(html)

PDF Oxide’s table detection is functional for standard grid layouts and produces clean Markdown or HTML output. For complex tables with merged cells, borderless designs, or spanning headers, pdfplumber’s dedicated algorithms remain more robust.

Table Extraction Summary

Capability	pdfplumber	PyMuPDF	PDF Oxide
Simple bordered tables	Yes	Yes	Yes
Borderless tables	Yes	Limited	Limited
Merged cells	Yes	Limited	Limited
Multi-level headers	Yes	No	No
Configurable detection	Yes	Limited	No
Visual debugging	Yes	No	No
Output format	Python lists	pandas DataFrames	Markdown / HTML
Speed	Slow (pure Python)	Fast	Fastest

If complex table extraction is your only use case, pdfplumber is the best tool. If you need tables alongside fast text extraction, image extraction, or PDF creation, PDF Oxide covers more ground.

Text Extraction

For plain text extraction, both libraries get the job done but differ in speed and API design.

pdfplumber

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    text = page.extract_text()
    print(text)

PyMuPDF

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
text = page.get_text()
print(text)

PDF Oxide

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

All three produce comparable text output for well-formed PDFs. PDF Oxide achieves 99.5% text parity with PyMuPDF across the full corpus, with the remaining 0.5% difference in whitespace normalization and ligature handling.

Character-Level Positioning

Both pdfplumber and PyMuPDF provide character-level position data, which is important for spatial analysis, bounding box detection, and custom layout reconstruction.

pdfplumber

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    for char in page.chars[:10]:
        print(f"'{char['text']}' at ({char['x0']:.1f}, {char['top']:.1f}) "
              f"size={char['size']:.1f}")

PyMuPDF

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
blocks = page.get_text("dict")["blocks"]
for block in blocks:
    if "lines" in block:
        for line in block["lines"]:
            for span in line["spans"]:
                print(f"'{span['text']}' size={span['size']:.1f}")

PDF Oxide

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
for ch in chars[:10]:
    print(f"'{ch.char}' at ({ch.x:.1f}, {ch.y:.1f}) size={ch.font_size:.1f}")

pdfplumber returns per-character dictionaries with rich metadata. PyMuPDF returns nested block/line/span structures. PDF Oxide returns flat character objects with position and font data.

Licensing

This is the most consequential difference between pdfplumber and PyMuPDF for commercial projects.

	pdfplumber	PyMuPDF	PDF Oxide
License	MIT	AGPL-3.0	MIT
Commercial product	Yes	Requires commercial license	Yes
Closed-source SaaS	Yes	Requires commercial license	Yes
Docker distribution	Yes	Requires commercial license	Yes
Internal tools	Yes	Yes	Yes
Open-source project	Yes	Yes (if AGPL-compatible)	Yes

PyMuPDF’s AGPL Problem

PyMuPDF wraps MuPDF, which is AGPL-3.0 licensed. If you distribute software that includes PyMuPDF – including SaaS, web apps, and Docker containers – your code must be open-sourced under AGPL or you must buy a commercial license from Artifex.

Artifex does not publish commercial license pricing publicly. You must contact their sales team for a quote. Licenses are typically per-application, renewed annually, with no free tier or startup exception.

pdfplumber and PDF Oxide Are Both MIT

Both pdfplumber and PDF Oxide are MIT licensed. Use either in any project – commercial, proprietary, SaaS, or open source – with no obligations. If licensing is your primary concern and you are choosing between pdfplumber and PyMuPDF, pdfplumber (or PDF Oxide) is the safer choice.

Encrypted PDFs

Encryption handling is a significant gap in pdfplumber’s feature set and a common pain point for developers working with password-protected documents.

pdfplumber: No Encryption Support

pdfplumber cannot open encrypted or password-protected PDFs at all. If you pass an encrypted PDF to pdfplumber, it raises an error. You must decrypt the file first using another tool:

import pdfplumber

# This will fail on encrypted PDFs:
with pdfplumber.open("encrypted.pdf") as pdf:
    # raises pdfminer.pdfparser.PDFSyntaxError or similar
    pass

A common workaround is to use PyMuPDF or pypdf to decrypt the file first, then pass it to pdfplumber for table extraction – adding another dependency to your pipeline.

PyMuPDF: Full Encryption Support

import fitz

doc = fitz.open("encrypted.pdf")
doc.authenticate("password")
page = doc[0]
text = page.get_text()

PyMuPDF supports both user and owner passwords, AES-128 and AES-256 encryption, and can create encrypted PDFs.

PDF Oxide: Full Encryption Support

from pdf_oxide import PdfDocument

doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)

PDF Oxide handles all standard PDF encryption methods (RC4, AES-128, AES-256) for both reading and writing. No additional dependencies or preprocessing required.

Image Extraction

Another gap in pdfplumber’s feature set. pdfplumber does not extract embedded images from PDFs.

PyMuPDF

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
for i, img in enumerate(page.get_images()):
    xref = img[0]
    base_image = doc.extract_image(xref)
    with open(f"image_{i}.{base_image['ext']}", "wb") as f:
        f.write(base_image["image"])

PDF Oxide

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

If your pipeline requires extracting both text and images from PDFs, pdfplumber cannot handle the image side. You need PyMuPDF, PDF Oxide, or pypdfium2 for that.

Markdown and HTML Output

Neither pdfplumber nor PyMuPDF provides built-in Markdown or HTML conversion. This is a unique feature of PDF Oxide.

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")

# Markdown with heading detection and table formatting
md = doc.to_markdown(0, detect_headings=True)
print(md)

# HTML with semantic tags
html = doc.to_html(0)
print(html)

For LLM pipelines, RAG systems, and document conversion workflows, structured Markdown output eliminates the need for a separate conversion step. PyMuPDF users typically rely on the separate pymupdf4llm package, which is 69x slower than PDF Oxide’s built-in conversion.

When to Choose Each Library

Choose pdfplumber if:

Complex table extraction is your primary use case. pdfplumber’s table algorithms handle merged cells, borderless tables, and spanning headers better than any other Python library.
You need visual debugging. pdfplumber can render annotated page images showing detected lines, characters, and table boundaries – invaluable for tuning extraction on tricky documents.
You want a pure-Python solution. No compiled dependencies, installs anywhere Python runs.
Speed is not a concern. If you process fewer than a hundred files at a time, the 23ms mean is perfectly acceptable.

Choose PyMuPDF if:

You already have a commercial MuPDF license and depend on MuPDF-specific rendering or SVG export.
You need high-fidelity rendering. MuPDF’s rendering engine is mature and handles complex PDFs well.
Your project is AGPL-compatible. If you are building open-source software under AGPL or a compatible license, PyMuPDF’s licensing is not a concern.
You need OCR via Tesseract. PyMuPDF has built-in Tesseract integration for scanned documents.

Choose PDF Oxide if:

You need speed and broad feature coverage. 0.8ms mean extraction – 5.8x faster than PyMuPDF, 29x faster than pdfplumber – with text, images, forms, creation, and encryption in one library.
You want MIT licensing without sacrificing speed. pdfplumber is MIT but slow. PyMuPDF is fast but AGPL. PDF Oxide is both MIT and fast.
You need Markdown or HTML output. Built-in structured conversion for LLM pipelines and RAG systems.
You need encrypted PDF support with a permissive license. pdfplumber cannot handle encryption. PyMuPDF can but requires AGPL compliance. PDF Oxide handles encryption under MIT.
You want a single library for extraction, creation, and editing. Both pdfplumber and PyMuPDF require additional tools for parts of the PDF workflow. PDF Oxide covers extraction, creation, editing, rendering, and validation.

Use PDF Oxide + pdfplumber together:

For pipelines that need fast text extraction, image extraction, and complex table parsing, use PDF Oxide for the general pipeline and pdfplumber for tables:

from pdf_oxide import PdfDocument
import pdfplumber

# Fast text and image extraction with PDF Oxide
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
images = doc.extract_images(0)

# Complex table extraction with pdfplumber
with pdfplumber.open("report.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()

Installation

# pdfplumber
pip install pdfplumber

# PyMuPDF
pip install pymupdf

# PDF Oxide
pip install pdf_oxide

All three install via pip. pdfplumber and PDF Oxide are MIT licensed. PyMuPDF is AGPL-3.0 – review the licensing implications before adding it to a commercial project.

The Verdict

pdfplumber and PyMuPDF both solve parts of the problem. PDF Oxide solves the whole thing.

What matters to you	Best choice
Maximum speed	PDF Oxide (0.8ms – 29× faster than pdfplumber)
Complex table extraction	pdfplumber (visual debugging, merged cells)
Permissive license + speed	PDF Oxide – pdfplumber is MIT but slow, PyMuPDF is fast but AGPL
Encrypted PDFs	PDF Oxide or PyMuPDF – pdfplumber cannot decrypt
Image extraction	PDF Oxide or PyMuPDF – pdfplumber has no image support
Markdown/HTML output	PDF Oxide – only library with built-in conversion
OCR without Tesseract	PDF Oxide – built-in PaddleOCR
One library for everything	PDF Oxide – extraction, creation, editing, encryption, OCR

Unless your entire workflow is complex table extraction (borderless tables, merged cells, visual debugging), PDF Oxide replaces both pdfplumber and PyMuPDF — faster, more features, MIT-licensed.

Get started in 10 seconds:

pip install pdf_oxide

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)       # 29× faster than pdfplumber
md = doc.to_markdown(0)          # built-in, no separate package
images = doc.extract_images(0)   # pdfplumber can't do this

PDF Oxide vs PyMuPDF – detailed comparison with migration guide
PDF Oxide vs pdfplumber – detailed comparison with code examples
vs Python PDF Libraries – all Python libraries compared
Performance Benchmarks – full corpus benchmark methodology
Extract Tables from PDF – table extraction guide
Getting Started with Python – installation and first extraction

pdfplumber vs PyMuPDF – Speed, Tables, and Licensing

Quick Comparison

Speed Benchmarks

What the Numbers Mean in Practice

Table Extraction

pdfplumber: Structured Table Parsing

PyMuPDF: Basic Table Detection

PDF Oxide: Markdown Table Output

Table Extraction Summary

Text Extraction

pdfplumber

PyMuPDF

PDF Oxide

Character-Level Positioning

pdfplumber

PyMuPDF

PDF Oxide

Licensing

PyMuPDF’s AGPL Problem

pdfplumber and PDF Oxide Are Both MIT

Encrypted PDFs

pdfplumber: No Encryption Support

PyMuPDF: Full Encryption Support

PDF Oxide: Full Encryption Support

Image Extraction

PyMuPDF

PDF Oxide

Markdown and HTML Output

When to Choose Each Library

Choose pdfplumber if:

Choose PyMuPDF if:

Choose PDF Oxide if:

Use PDF Oxide + pdfplumber together:

Installation

The Verdict

Related Pages