Skip to content

PDF Oxide vs pdfplumber

PDF Oxide is 29× faster than pdfplumber for text extraction while offering a broader feature set. pdfplumber has more mature table extraction algorithms. This page helps you choose the right tool for your use case.

Key Differences

Speed. pdfplumber is pure Python (built on pdfminer). PDF Oxide’s Rust core extracts text at 0.8ms mean vs 23.2ms — 29× faster.

Reliability. PDF Oxide passes 100% of 3,830 test PDFs. pdfplumber passes 98.8% — 46 failures on valid PDFs.

Tables. pdfplumber has the best table extraction of any Python PDF library. PDF Oxide’s table detection is functional but less mature for complex multi-row, multi-column layouts with merged cells.

Scope. pdfplumber is read-only. PDF Oxide adds creation, editing, encryption, rendering, and Markdown/HTML output.

Quick Comparison

PDF Oxide pdfplumber
Mean extraction time 0.8ms 23.2ms
Pass rate (3,830 PDFs) 100% 98.8%
License MIT MIT
Language Rust + PyO3 Pure Python
Text extraction Yes Yes
Character positions Yes Yes
Table extraction Basic Advanced
Image extraction Yes No
Visual debugging No Yes
Markdown output Yes No
HTML output Yes No
PDF creation Yes No
PDF editing Yes No
Encryption Read + Write No
Rendering Yes No
Form fields Read + Write Read only

Side-by-Side Code

Text Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

pdfplumber:

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    text = page.extract_text()
    print(text)

Character-Level Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
chars = doc.extract_chars(0)
for ch in chars[:10]:
    print(f"'{ch.char}' at ({ch.x:.1f}, {ch.y:.1f}) size={ch.font_size:.1f}")

pdfplumber:

import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    for char in page.chars[:10]:
        print(f"'{char['text']}' at ({char['x0']:.1f}, {char['top']:.1f}) "
              f"size={char['size']:.1f}")

Table Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
md = doc.to_markdown(0, detect_headings=True)
# Tables are converted to Markdown table syntax
print(md)

pdfplumber:

import pdfplumber

with pdfplumber.open("invoice.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()
    for table in tables:
        for row in table:
            print(row)

pdfplumber’s extract_tables() returns structured row/column data with configurable line detection. For complex tables with merged cells, spanning headers, or borderless layouts, pdfplumber’s algorithms are more robust.

Benchmark Details

Metric PDF Oxide pdfplumber
Mean extraction time 0.8ms 23.2ms
p99 extraction time 9ms 189ms
Pass rate (valid PDFs) 100% (3,823/3,823) 98.8% (3,777/3,823)

The 29× speed difference comes from pdfplumber’s pure-Python architecture. pdfplumber builds on pdfminer for parsing, then adds its own spatial analysis layer — both written in Python. PDF Oxide handles all parsing, font decoding, and text assembly in compiled Rust.

See full benchmark methodology for corpus details.

When to Use Each

Choose PDF Oxide if:

  • Speed matters. Processing thousands of PDFs where 29× faster means minutes vs hours.
  • You need more than extraction. Creation, editing, encryption, rendering, or Markdown output.
  • You want maximum reliability. 100% pass rate vs 98.8%.
  • You need image extraction. pdfplumber doesn’t extract images.
  • Batch processing pipelines. 0.8ms per PDF means 3,830 PDFs in 3.1 seconds.

Choose pdfplumber if:

  • Complex table extraction is your primary use case. pdfplumber’s table algorithms handle merged cells, borderless tables, and spanning headers better.
  • You need visual debugging. pdfplumber can render annotated page images showing detected lines, characters, and table boundaries.
  • You prefer pure Python. No compiled dependencies, installs anywhere.

Use both:

For pipelines that need fast text extraction and complex table parsing, use PDF Oxide for text and pdfplumber for tables:

from pdf_oxide import PdfDocument
import pdfplumber

# Fast text extraction with PDF Oxide
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)

# Complex table extraction with pdfplumber
with pdfplumber.open("report.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()