Skip to content

pdfplumber vs PyMuPDF – Speed, Tables, and Licensing

pdfplumber and PyMuPDF are popular Python PDF libraries, but both force you into trade-offs. pdfplumber is great for tables but 29× slower than necessary. PyMuPDF is fast but locked behind AGPL-3.0 licensing that blocks commercial use. This page compares both — and shows why PDF Oxide is a better choice for most use cases.

The short answer: PDF Oxide is 29× faster than pdfplumber, 5.8× faster than PyMuPDF, MIT-licensed, and handles text, images, forms, encryption, Markdown output, and OCR — all in one library. The only area where pdfplumber still leads is complex table extraction with visual debugging.

Quick Comparison

pdfplumber PyMuPDF PDF Oxide
License MIT AGPL-3.0 MIT
Language Pure Python C (MuPDF) Rust + PyO3
Mean extraction time 23.2ms 4.6ms 0.8ms
p99 extraction time 189ms 28ms 9ms
Pass rate (3,830 PDFs) 98.8% 99.3% 100%
Text extraction Yes Yes Yes
Character positions Yes Yes Yes
Table extraction Advanced Basic Basic
Image extraction No Yes Yes
Visual debugging Yes No No
PDF creation No Yes Yes
PDF editing No Yes Yes
Markdown output No No Yes
HTML output No No Yes
Form fields Read only Read + Write Read + Write
Encryption No Read + Write Read + Write
Rendering No Yes Yes
OCR No Tesseract Built-in (PaddleOCR)
Install size ~1 MB ~20 MB ~5 MB
Python versions 3.8+ 3.8–3.12 3.8–3.14

Speed Benchmarks

All three libraries benchmarked on the same corpus of 3,830 PDFs from three independent public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). The corpus covers every PDF specification version (1.0–2.0), encrypted files, malformed documents, CJK encodings, and complex layouts.

Metric pdfplumber PyMuPDF PDF Oxide
Mean extraction time 23.2ms 4.6ms 0.8ms
p99 extraction time 189ms 28ms 9ms
Relative to PDF Oxide 29x slower 5.8x slower 1x
Pass rate (valid PDFs) 98.8% (3,777/3,823) 99.3% (3,796/3,823) 100% (3,823/3,823)

PyMuPDF is roughly 5x faster than pdfplumber because it delegates all parsing to the MuPDF C library. pdfplumber builds on pdfminer for parsing, then adds its own spatial analysis layer – both written in pure Python. PDF Oxide handles all parsing, font decoding, and text assembly in compiled Rust running directly in the Python process via PyO3, which accounts for its 5.8x advantage over PyMuPDF and 29x advantage over pdfplumber.

What the Numbers Mean in Practice

Workload pdfplumber PyMuPDF PDF Oxide
100 PDFs 2.3 seconds 0.46 seconds 0.08 seconds
1,000 PDFs 23 seconds 4.6 seconds 0.8 seconds
10,000 PDFs 3.9 minutes 46 seconds 8 seconds
100,000 PDFs 39 minutes 7.7 minutes 80 seconds

For one-off scripts processing a handful of files, the speed difference is irrelevant. For production pipelines processing thousands of documents daily, the gap between 39 minutes and 80 seconds changes architecture decisions.

Table Extraction

Table extraction is the primary reason developers choose pdfplumber over PyMuPDF. This is where pdfplumber genuinely excels.

pdfplumber: Structured Table Parsing

pdfplumber provides dedicated table extraction with configurable line detection, cell merging, and visual debugging:

import pdfplumber

with pdfplumber.open("invoice.pdf") as pdf:
    page = pdf.pages[0]

    # Extract all tables as structured data
    tables = page.extract_tables()
    for table in tables:
        for row in table:
            print(row)

    # Fine-tune detection with custom settings
    tables = page.extract_tables({
        "vertical_strategy": "text",
        "horizontal_strategy": "lines",
        "snap_tolerance": 5,
    })

    # Visual debugging: render page with detected table boundaries
    im = page.to_image()
    im.debug_tablefinder()
    im.save("debug.png")

pdfplumber returns structured row/column data and handles merged cells, spanning headers, and borderless tables. The visual debugging overlay is invaluable for tuning extraction parameters on tricky layouts.

PyMuPDF: Basic Table Detection

PyMuPDF added table detection in recent versions, but it is less mature than pdfplumber’s algorithms:

import fitz

doc = fitz.open("invoice.pdf")
page = doc[0]

# PyMuPDF's built-in table finder (added in v1.23)
tabs = page.find_tables()
for table in tabs:
    df = table.to_pandas()  # requires pandas
    print(df)

PyMuPDF’s table extraction works for simple grid-based tables with visible borders. It struggles with borderless layouts, multi-level headers, and cells spanning multiple rows or columns – exactly the cases where pdfplumber is strongest.

PDF Oxide: Markdown Table Output

PDF Oxide converts tables to Markdown syntax as part of its structured output pipeline:

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")

# Tables are detected and converted to Markdown table format
md = doc.to_markdown(0, detect_headings=True)
print(md)

# Also available as HTML with table tags
html = doc.to_html(0)
print(html)

PDF Oxide’s table detection is functional for standard grid layouts and produces clean Markdown or HTML output. For complex tables with merged cells, borderless designs, or spanning headers, pdfplumber’s dedicated algorithms remain more robust.

Table Extraction Summary

Capability pdfplumber PyMuPDF PDF Oxide
Simple bordered tables Yes Yes Yes
Borderless tables Yes Limited Limited
Merged cells Yes Limited Limited
Multi-level headers Yes No No
Configurable detection Yes Limited No
Visual debugging Yes No No
Output format Python lists pandas DataFrames Markdown / HTML
Speed Slow (pure Python) Fast Fastest

If complex table extraction is your only use case, pdfplumber is the best tool. If you need tables alongside fast text extraction, image extraction, or PDF creation, PDF Oxide covers more ground.

Text Extraction

For plain text extraction, both libraries get the job done but differ in speed and API design.

pdfplumber

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    text = page.extract_text()
    print(text)

PyMuPDF

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
text = page.get_text()
print(text)

PDF Oxide

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

All three produce comparable text output for well-formed PDFs. PDF Oxide achieves 99.5% text parity with PyMuPDF across the full corpus, with the remaining 0.5% difference in whitespace normalization and ligature handling.

Character-Level Positioning

Both pdfplumber and PyMuPDF provide character-level position data, which is important for spatial analysis, bounding box detection, and custom layout reconstruction.

pdfplumber

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    for char in page.chars[:10]:
        print(f"'{char['text']}' at ({char['x0']:.1f}, {char['top']:.1f}) "
              f"size={char['size']:.1f}")

PyMuPDF

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
blocks = page.get_text("dict")["blocks"]
for block in blocks:
    if "lines" in block:
        for line in block["lines"]:
            for span in line["spans"]:
                print(f"'{span['text']}' size={span['size']:.1f}")

PDF Oxide

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
for ch in chars[:10]:
    print(f"'{ch.char}' at ({ch.x:.1f}, {ch.y:.1f}) size={ch.font_size:.1f}")

pdfplumber returns per-character dictionaries with rich metadata. PyMuPDF returns nested block/line/span structures. PDF Oxide returns flat character objects with position and font data.

Licensing

This is the most consequential difference between pdfplumber and PyMuPDF for commercial projects.

pdfplumber PyMuPDF PDF Oxide
License MIT AGPL-3.0 MIT
Commercial product Yes Requires commercial license Yes
Closed-source SaaS Yes Requires commercial license Yes
Docker distribution Yes Requires commercial license Yes
Internal tools Yes Yes Yes
Open-source project Yes Yes (if AGPL-compatible) Yes

PyMuPDF’s AGPL Problem

PyMuPDF wraps MuPDF, which is AGPL-3.0 licensed. If you distribute software that includes PyMuPDF – including SaaS, web apps, and Docker containers – your code must be open-sourced under AGPL or you must buy a commercial license from Artifex.

Artifex does not publish commercial license pricing publicly. You must contact their sales team for a quote. Licenses are typically per-application, renewed annually, with no free tier or startup exception.

pdfplumber and PDF Oxide Are Both MIT

Both pdfplumber and PDF Oxide are MIT licensed. Use either in any project – commercial, proprietary, SaaS, or open source – with no obligations. If licensing is your primary concern and you are choosing between pdfplumber and PyMuPDF, pdfplumber (or PDF Oxide) is the safer choice.

Encrypted PDFs

Encryption handling is a significant gap in pdfplumber’s feature set and a common pain point for developers working with password-protected documents.

pdfplumber: No Encryption Support

pdfplumber cannot open encrypted or password-protected PDFs at all. If you pass an encrypted PDF to pdfplumber, it raises an error. You must decrypt the file first using another tool:

import pdfplumber

# This will fail on encrypted PDFs:
with pdfplumber.open("encrypted.pdf") as pdf:
    # raises pdfminer.pdfparser.PDFSyntaxError or similar
    pass

A common workaround is to use PyMuPDF or pypdf to decrypt the file first, then pass it to pdfplumber for table extraction – adding another dependency to your pipeline.

PyMuPDF: Full Encryption Support

import fitz

doc = fitz.open("encrypted.pdf")
doc.authenticate("password")
page = doc[0]
text = page.get_text()

PyMuPDF supports both user and owner passwords, AES-128 and AES-256 encryption, and can create encrypted PDFs.

PDF Oxide: Full Encryption Support

from pdf_oxide import PdfDocument

doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)

PDF Oxide handles all standard PDF encryption methods (RC4, AES-128, AES-256) for both reading and writing. No additional dependencies or preprocessing required.

Image Extraction

Another gap in pdfplumber’s feature set. pdfplumber does not extract embedded images from PDFs.

PyMuPDF

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
for i, img in enumerate(page.get_images()):
    xref = img[0]
    base_image = doc.extract_image(xref)
    with open(f"image_{i}.{base_image['ext']}", "wb") as f:
        f.write(base_image["image"])

PDF Oxide

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

If your pipeline requires extracting both text and images from PDFs, pdfplumber cannot handle the image side. You need PyMuPDF, PDF Oxide, or pypdfium2 for that.

Markdown and HTML Output

Neither pdfplumber nor PyMuPDF provides built-in Markdown or HTML conversion. This is a unique feature of PDF Oxide.

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")

# Markdown with heading detection and table formatting
md = doc.to_markdown(0, detect_headings=True)
print(md)

# HTML with semantic tags
html = doc.to_html(0)
print(html)

For LLM pipelines, RAG systems, and document conversion workflows, structured Markdown output eliminates the need for a separate conversion step. PyMuPDF users typically rely on the separate pymupdf4llm package, which is 69x slower than PDF Oxide’s built-in conversion.

When to Choose Each Library

Choose pdfplumber if:

  • Complex table extraction is your primary use case. pdfplumber’s table algorithms handle merged cells, borderless tables, and spanning headers better than any other Python library.
  • You need visual debugging. pdfplumber can render annotated page images showing detected lines, characters, and table boundaries – invaluable for tuning extraction on tricky documents.
  • You want a pure-Python solution. No compiled dependencies, installs anywhere Python runs.
  • Speed is not a concern. If you process fewer than a hundred files at a time, the 23ms mean is perfectly acceptable.

Choose PyMuPDF if:

  • You already have a commercial MuPDF license and depend on MuPDF-specific rendering or SVG export.
  • You need high-fidelity rendering. MuPDF’s rendering engine is mature and handles complex PDFs well.
  • Your project is AGPL-compatible. If you are building open-source software under AGPL or a compatible license, PyMuPDF’s licensing is not a concern.
  • You need OCR via Tesseract. PyMuPDF has built-in Tesseract integration for scanned documents.

Choose PDF Oxide if:

  • You need speed and broad feature coverage. 0.8ms mean extraction – 5.8x faster than PyMuPDF, 29x faster than pdfplumber – with text, images, forms, creation, and encryption in one library.
  • You want MIT licensing without sacrificing speed. pdfplumber is MIT but slow. PyMuPDF is fast but AGPL. PDF Oxide is both MIT and fast.
  • You need Markdown or HTML output. Built-in structured conversion for LLM pipelines and RAG systems.
  • You need encrypted PDF support with a permissive license. pdfplumber cannot handle encryption. PyMuPDF can but requires AGPL compliance. PDF Oxide handles encryption under MIT.
  • You want a single library for extraction, creation, and editing. Both pdfplumber and PyMuPDF require additional tools for parts of the PDF workflow. PDF Oxide covers extraction, creation, editing, rendering, and validation.

Use PDF Oxide + pdfplumber together:

For pipelines that need fast text extraction, image extraction, and complex table parsing, use PDF Oxide for the general pipeline and pdfplumber for tables:

from pdf_oxide import PdfDocument
import pdfplumber

# Fast text and image extraction with PDF Oxide
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
images = doc.extract_images(0)

# Complex table extraction with pdfplumber
with pdfplumber.open("report.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()

Installation

# pdfplumber
pip install pdfplumber

# PyMuPDF
pip install pymupdf

# PDF Oxide
pip install pdf_oxide

All three install via pip. pdfplumber and PDF Oxide are MIT licensed. PyMuPDF is AGPL-3.0 – review the licensing implications before adding it to a commercial project.

The Verdict

pdfplumber and PyMuPDF both solve parts of the problem. PDF Oxide solves the whole thing.

What matters to you Best choice
Maximum speed PDF Oxide (0.8ms – 29× faster than pdfplumber)
Complex table extraction pdfplumber (visual debugging, merged cells)
Permissive license + speed PDF Oxide – pdfplumber is MIT but slow, PyMuPDF is fast but AGPL
Encrypted PDFs PDF Oxide or PyMuPDF – pdfplumber cannot decrypt
Image extraction PDF Oxide or PyMuPDF – pdfplumber has no image support
Markdown/HTML output PDF Oxide – only library with built-in conversion
OCR without Tesseract PDF Oxide – built-in PaddleOCR
One library for everything PDF Oxide – extraction, creation, editing, encryption, OCR

Unless your entire workflow is complex table extraction (borderless tables, merged cells, visual debugging), PDF Oxide replaces both pdfplumber and PyMuPDF — faster, more features, MIT-licensed.

Get started in 10 seconds:

pip install pdf_oxide
from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)       # 29× faster than pdfplumber
md = doc.to_markdown(0)          # built-in, no separate package
images = doc.extract_images(0)   # pdfplumber can't do this