Skip to content

vs Python PDF Libraries

PDF Oxide compared with PyMuPDF (fitz), pypdfium2, pypdf, pdfplumber, pdfminer, and more. This page covers performance, feature coverage, licensing, and API differences to help you choose the right Python PDF library for text extraction.

Summary

PDF Oxide PyMuPDF pypdfium2 pypdf pdfplumber pdfminer
Mean extraction time 0.8ms 4.6ms 4.1ms 12.1ms 23.2ms 16.8ms
Pass rate (3,830 PDFs) 100% 99.3% 99.2% 98.4% 98.8% 98.8%
License MIT AGPL-3.0 Apache-2.0 BSD-3 MIT MIT
Language Rust + PyO3 C (MuPDF) C (PDFium) Pure Python Pure Python Pure Python
Text extraction Yes Yes Yes Yes Yes Yes
Character positions Yes Yes Yes Partial Yes Yes
Image extraction Yes Yes Yes Yes No No
Form fields Read + Write Read + Write Read only Read + Write Read only No
PDF creation Yes Yes No Limited No No
PDF editing Yes Yes No Yes No No
Markdown output Yes No No No No No
HTML output Yes No No No No No
Encryption Read + Write Read + Write Read only Read + Write No No
PDF/A validation Yes No No No No No
Rendering Yes Yes Yes No No No
Search Regex + spatial Yes Yes No No No
Python versions 3.8–3.14 3.8–3.12 3.8+ 3.6+ 3.8+ 3.6+
Install size ~5 MB wheel ~20 MB wheel ~3 MB wheel ~1 MB ~1 MB ~1 MB

Performance Comparison

Mean text extraction time per PDF, benchmarked on the full 3,830-PDF corpus — three independent, publicly available test suites that together cover every PDF specification version (1.0–2.0), encrypted files, malformed documents, CJK encodings, complex layouts, and security edge cases. See full corpus details for what each suite tests and why these results are reproducible.

Library Mean Relative p99 Pass Rate
PDF Oxide 0.8ms 9ms 100%
PyMuPDF 4.6ms 5.8× 28ms 99.3%
pypdfium2 4.1ms 5.1× 42ms 99.2%
pymupdf4llm 55.5ms 69× 280ms 99.1%
pdftext 7.3ms 9.1× 82ms 99.0%
pdfminer 16.8ms 21× 124ms 98.8%
pdfplumber 23.2ms 29× 189ms 98.8%
markitdown 108.8ms 136× 378ms 98.6%
pypdf 12.1ms 15.1× 97ms 98.4%

PDF Oxide achieves its speed through a native Rust core compiled to a Python extension module via PyO3. There is no subprocess overhead or C library bridging — the Rust code runs directly in the Python process.

Reliability

PDF Oxide processes 3,823 of 3,823 valid PDFs without failure — a 100% pass rate. The 7 non-passing files in the 3,830-file corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams).

Library Valid PDFs Passed Pass Rate
PDF Oxide 3,823 / 3,823 100%
PyMuPDF 3,796 / 3,823 99.3%
pypdfium2 3,792 / 3,823 99.2%
pymupdf4llm 3,787 / 3,823 99.1%
pdftext 3,784 / 3,823 99.0%
pdfminer 3,777 / 3,823 98.8%
pdfplumber 3,777 / 3,823 98.8%
markitdown 3,771 / 3,823 98.6%
pypdf 3,762 / 3,823 98.4%

Text Quality

PDF Oxide achieves 99.5% text parity compared to PyMuPDF and pypdfium2 across the full corpus. Quality was measured by comparing extracted text output character-by-character. The remaining 0.5% difference is in whitespace normalization and ligature handling where PDF Oxide produces cleaner output.

License Comparison

Library License Commercial Use Copyleft
PDF Oxide MIT Unrestricted No
pypdfium2 Apache-2.0 Unrestricted No
PyMuPDF AGPL-3.0 Requires commercial license ($) Yes
pypdf BSD-3 Unrestricted No
pdfplumber MIT Unrestricted No
pdfminer MIT Unrestricted No
pdftext GPL-3.0 Requires open source Yes

PyMuPDF uses MuPDF under the AGPL-3.0 license. If you distribute software that uses PyMuPDF, your software must also be released under AGPL-3.0 — or you must purchase a commercial license from Artifex. This applies to SaaS products, web applications, and any distributed binaries.

PDF Oxide is MIT-licensed with no restrictions. Use it in proprietary products, SaaS platforms, or closed-source applications without any licensing obligations.

Use Case PDF Oxide (MIT) PyMuPDF (AGPL) pypdfium2 (Apache) pypdf (BSD) pdfplumber (MIT) pdfminer (MIT)
Commercial product Yes Requires license Yes Yes Yes Yes
Closed source Yes No (unless licensed) Yes Yes Yes Yes
SaaS/cloud Yes Requires license Yes Yes Yes Yes
Internal tools Yes Yes Yes Yes Yes Yes

API Comparison

Text Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

PyMuPDF:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
text = page.get_text()
print(text)

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
page = reader.pages[0]
text = page.extract_text()
print(text)

pdfplumber:

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    text = page.extract_text()
    print(text)

pdfminer:

from pdfminer.high_level import extract_text

text = extract_text("report.pdf", page_numbers=[0])
print(text)

Character-Level Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
for ch in chars:
    print(f"'{ch.char}' at ({ch.bbox[0]:.1f}, {ch.bbox[1]:.1f}) "
          f"size={ch.font_size:.1f}")

PyMuPDF:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
blocks = page.get_text("dict")["blocks"]
for block in blocks:
    if "lines" in block:
        for line in block["lines"]:
            for span in line["spans"]:
                print(f"'{span['text']}' size={span['size']:.1f}")

pdfplumber:

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    for char in page.chars:
        print(f"'{char['text']}' at ({char['x0']:.1f}, {char['top']:.1f}) "
              f"size={char['size']:.1f}")

pdfminer:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTChar

for page_layout in extract_pages("report.pdf"):
    for element in page_layout:
        if hasattr(element, '__iter__'):
            for text_line in element:
                if hasattr(text_line, '__iter__'):
                    for char in text_line:
                        if isinstance(char, LTChar):
                            print(f"'{char.get_text()}' at ({char.x0:.1f}, {char.y0:.1f}) "
                                  f"size={char.size:.1f}")

Image Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

PyMuPDF:

import fitz

doc = fitz.open("report.pdf")
page = doc[0]
for i, img in enumerate(page.get_images()):
    xref = img[0]
    base_image = doc.extract_image(xref)
    with open(f"image_{i}.{base_image['ext']}", "wb") as f:
        f.write(base_image["image"])

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
page = reader.pages[0]
for i, image in enumerate(page.images):
    with open(f"image_{i}.{image.name.split('.')[-1]}", "wb") as f:
        f.write(image.data)

PDF Creation

PDF Oxide:

from pdf_oxide import Pdf

pdf = Pdf.from_markdown("# Hello World\n\nThis is a PDF.")
pdf.save("output.pdf")

# Also supports HTML
pdf = Pdf.from_html("<h1>Hello</h1><p>World</p>")
pdf.save("output.pdf")

PyMuPDF:

import fitz

doc = fitz.open()
page = doc.new_page()
text_point = fitz.Point(72, 72)
page.insert_text(text_point, "Hello World", fontsize=24)
doc.save("output.pdf")

pypdf:

# pypdf can merge/modify PDFs but cannot create from scratch with text content.
# Use reportlab or fpdf2 for creation, then merge with pypdf.

Encrypted PDFs

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)

PyMuPDF:

import fitz

doc = fitz.open("encrypted.pdf")
doc.authenticate("password")
page = doc[0]
text = page.get_text()

pypdf:

from pypdf import PdfReader

reader = PdfReader("encrypted.pdf")
reader.decrypt("password")
text = reader.pages[0].extract_text()

Markdown and HTML Output

PDF Oxide (unique feature):

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")

# Convert to Markdown with heading detection
md = doc.to_markdown(0, detect_headings=True)
print(md)

# Convert to HTML
html = doc.to_html(0)
print(html)

No other Python PDF library provides built-in Markdown or HTML conversion.

Library Profiles

PDF Oxide

Strengths:

  • Fastest text extraction in benchmarks due to Rust core — 5.8× faster than PyMuPDF
  • 100% pass rate on 3,830-PDF corpus — highest reliability of any tested library
  • Unified API for extraction, creation, and editing in a single library
  • Built-in Markdown and HTML export with heading detection
  • MIT licensed with no copyleft restrictions
  • Native compliance validation (PDF/A, PDF/UA, PDF/X)
  • Pre-built wheels for all major platforms and Python 3.8–3.14
  • No system dependencies — the wheel includes everything

Limitations:

  • Newer library with a smaller community
  • Table extraction is basic compared to pdfplumber’s algorithms
  • Rendering engine is less mature than MuPDF

PyMuPDF (fitz)

Strengths:

  • Mature and battle-tested (backed by MuPDF, in development since 2005)
  • Excellent rendering quality for complex PDFs
  • Built-in OCR integration (Tesseract)
  • Rich feature set: SVG export, page manipulation, table detection

Limitations:

  • AGPL-3.0 license requires open-sourcing your application or purchasing a commercial license
  • Large wheel size (~20 MB) due to bundled MuPDF
  • No built-in Markdown export
  • No compliance validation

pypdfium2

Strengths:

  • Fast (backed by Google’s PDFium engine)
  • Apache-2.0 license — permissive for commercial use
  • Good rendering quality

Limitations:

  • Limited text extraction API compared to PDF Oxide or PyMuPDF
  • No PDF creation or editing
  • No form field support beyond read-only

pypdf

Strengths:

  • Pure Python — installs anywhere, no compiled dependencies
  • Lightweight and well-maintained
  • Good for PDF manipulation (merge, split, rotate, encrypt)
  • Large community and extensive documentation

Limitations:

  • 15× slower than PDF Oxide for text extraction
  • Text extraction quality struggles with complex layouts
  • No rendering, no Markdown/HTML export, no table extraction

pdfplumber

Strengths:

  • Best table extraction of any Python PDF library
  • Excellent character-level positioning data
  • Visual debugging tools (annotated page images)
  • MIT licensed

Limitations:

  • Pure Python — 29× slower than PDF Oxide
  • Read-only — no PDF creation or editing
  • No encryption or rendering

pdfminer

Strengths:

  • Detailed character and layout analysis
  • Good CJK text support
  • Foundation for pdfplumber and other tools
  • MIT licensed

Limitations:

  • 21× slower than PDF Oxide (pure Python, unoptimized)
  • Read-only, no creation or editing
  • Verbose API for common tasks
  • Less actively maintained

When to Use Each

Use Case Recommended Library
Fast text extraction PDF Oxide
Commercial / proprietary product PDF Oxide, pypdfium2, pypdf, pdfplumber, or pdfminer
PyMuPDF alternative (MIT licensed) PDF Oxide
PDF creation from Markdown/HTML PDF Oxide
Compliance validation (PDF/A, PDF/X) PDF Oxide
Table extraction from invoices pdfplumber
Visual debugging of extraction pdfplumber
Existing MuPDF investment PyMuPDF (if AGPL-compatible)
Minimal dependencies pypdf (pure Python)
Detailed layout analysis pdfminer
OCR for scanned documents PyMuPDF

Installation

# PDF Oxide
pip install pdf_oxide

# PyMuPDF
pip install pymupdf

# pypdfium2
pip install pypdfium2

# pypdf
pip install pypdf

# pdfplumber
pip install pdfplumber

# pdfminer
pip install pdfminer.six

PDF Oxide ships pre-built wheels for Linux (x86_64, aarch64), macOS (x86_64, arm64), and Windows (x86_64). No compiler or system libraries required.