What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide vs pdfplumber

PDF Oxide is 29× faster than pdfplumber for text extraction while offering a broader feature set. pdfplumber has more mature table extraction algorithms. This page helps you choose the right tool for your use case.

Key Differences

Speed. pdfplumber is pure Python (built on pdfminer). PDF Oxide’s Rust core extracts text at 0.8ms mean vs 23.2ms — 29× faster.

Reliability. PDF Oxide passes 100% of 3,830 test PDFs. pdfplumber passes 98.8% — 46 failures on valid PDFs.

Tables. pdfplumber has the best table extraction of any Python PDF library. PDF Oxide’s table detection is functional but less mature for complex multi-row, multi-column layouts with merged cells.

Scope. pdfplumber is read-only. PDF Oxide adds creation, editing, encryption, rendering, and Markdown/HTML output.

Quick Comparison

	PDF Oxide	pdfplumber
Mean extraction time	0.8ms	23.2ms
Pass rate (3,830 PDFs)	100%	98.8%
License	MIT	MIT
Language	Rust + PyO3	Pure Python
Text extraction	Yes	Yes
Character positions	Yes	Yes
Table extraction	Basic	Advanced
Image extraction	Yes	No
Visual debugging	No	Yes
Markdown output	Yes	No
HTML output	Yes	No
PDF creation	Yes	No
PDF editing	Yes	No
Encryption	Read + Write	No
Rendering	Yes	No
Form fields	Read + Write	Read only

Side-by-Side Code

Text Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

pdfplumber:

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    text = page.extract_text()
    print(text)

Character-Level Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
for ch in chars[:10]:
    print(f"'{ch.char}' at ({ch.x:.1f}, {ch.y:.1f}) size={ch.font_size:.1f}")

pdfplumber:

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    for char in page.chars[:10]:
        print(f"'{char['text']}' at ({char['x0']:.1f}, {char['top']:.1f}) "
              f"size={char['size']:.1f}")

Table Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
md = doc.to_markdown(0, detect_headings=True)
# Tables are converted to Markdown table syntax
print(md)

pdfplumber:

import pdfplumber

with pdfplumber.open("invoice.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()
    for table in tables:
        for row in table:
            print(row)

pdfplumber’s extract_tables() returns structured row/column data with configurable line detection. For complex tables with merged cells, spanning headers, or borderless layouts, pdfplumber’s algorithms are more robust.

Benchmark Details

Metric	PDF Oxide	pdfplumber
Mean extraction time	0.8ms	23.2ms
p99 extraction time	9ms	189ms
Pass rate (valid PDFs)	100% (3,823/3,823)	98.8% (3,777/3,823)

The 29× speed difference comes from pdfplumber’s pure-Python architecture. pdfplumber builds on pdfminer for parsing, then adds its own spatial analysis layer — both written in Python. PDF Oxide handles all parsing, font decoding, and text assembly in compiled Rust.

See full benchmark methodology for corpus details.

When to Use Each

Choose PDF Oxide if:

Speed matters. Processing thousands of PDFs where 29× faster means minutes vs hours.
You need more than extraction. Creation, editing, encryption, rendering, or Markdown output.
You want maximum reliability. 100% pass rate vs 98.8%.
You need image extraction. pdfplumber doesn’t extract images.
Batch processing pipelines. 0.8ms per PDF means 3,830 PDFs in 3.1 seconds.

Choose pdfplumber if:

Complex table extraction is your primary use case. pdfplumber’s table algorithms handle merged cells, borderless tables, and spanning headers better.
You need visual debugging. pdfplumber can render annotated page images showing detected lines, characters, and table boundaries.
You prefer pure Python. No compiled dependencies, installs anywhere.

Use both:

For pipelines that need fast text extraction and complex table parsing, use PDF Oxide for text and pdfplumber for tables:

from pdf_oxide import PdfDocument
import pdfplumber

# Fast text extraction with PDF Oxide
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)

# Complex table extraction with pdfplumber
with pdfplumber.open("report.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()

Performance Benchmarks — full corpus results
vs Python PDF Libraries — all Python libraries compared
Extract Tables from PDF — table extraction guide
Getting Started with Python — installation and first extraction

PDF Oxide vs pdfplumber

Key Differences

Quick Comparison

Side-by-Side Code

Text Extraction

Character-Level Extraction

Table Extraction

Benchmark Details

When to Use Each

Choose PDF Oxide if:

Choose pdfplumber if:

Use both:

Related Pages