Skip to content

PDF Oxide vs pypdf

PDF Oxide is 15× faster than pypdf with a higher pass rate, built-in rendering, and Markdown/HTML conversion. If you need more than basic PDF manipulation, PDF Oxide does in one library what pypdf requires multiple packages to achieve.

Why Consider PDF Oxide Over pypdf

Speed. pypdf is pure Python. PDF Oxide uses a Rust core compiled via PyO3, running directly in the Python process. Mean text extraction: 0.8ms vs 12.1ms — a 15× difference.

Reliability. PDF Oxide passes 100% of 3,830 test PDFs. pypdf passes 98.4% — 61 failures on valid PDFs.

Features. pypdf is a PDF manipulation library (merge, split, rotate, encrypt). For text extraction, rendering, Markdown output, or form creation, you need additional packages. PDF Oxide covers all of these in a single install.

Quick Comparison

PDF Oxide pypdf
Mean extraction time 0.8ms 12.1ms
Pass rate (3,830 PDFs) 100% 98.4%
License MIT BSD-3
Language Rust + PyO3 Pure Python
Text extraction Yes Yes
Character positions Yes Partial
Image extraction Yes Yes
Markdown output Yes No
HTML output Yes No
PDF creation Yes (Markdown/HTML/images) Limited (merge only)
Form fields Read + Write Read + Write
Encryption Read + Write Read + Write
Rendering Yes No
OCR Built-in No
Search Regex + spatial No
Install size ~5 MB ~1 MB

Side-by-Side Code

Text Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
text = reader.pages[0].extract_text()
print(text)

Extract All Pages

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")
for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(f"--- Page {i + 1} ---")
    print(text)

pypdf:

from pypdf import PdfReader

reader = PdfReader("book.pdf")
for page in reader.pages:
    text = page.extract_text()
    print(text)

Image Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

pypdf:

from pypdf import PdfReader

reader = PdfReader("report.pdf")
page = reader.pages[0]
for i, image in enumerate(page.images):
    with open(f"image_{i}.{image.name.split('.')[-1]}", "wb") as f:
        f.write(image.data)

Encrypted PDFs

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("encrypted.pdf", password="secret")
text = doc.extract_text(0)

pypdf:

from pypdf import PdfReader

reader = PdfReader("encrypted.pdf")
reader.decrypt("secret")
text = reader.pages[0].extract_text()

Markdown Conversion

PDF Oxide (built-in):

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

pypdf:

# pypdf has no Markdown conversion.
# You would need a separate tool chain.

Benchmark Details

Metric PDF Oxide pypdf
Mean extraction time 0.8ms 12.1ms
p99 extraction time 9ms 97ms
Pass rate (valid PDFs) 100% (3,823/3,823) 98.4% (3,762/3,823)

pypdf’s pure-Python implementation means every operation runs in the interpreter. PDF Oxide’s Rust core handles parsing, font decoding, and text assembly natively, with only the final result crossing the Python boundary.

See full benchmark methodology for corpus details.

Feature Gap

pypdf excels at PDF manipulation — merge, split, rotate, and encrypt. But it lacks:

Feature PDF Oxide pypdf
Markdown conversion doc.to_markdown(0) Not available
HTML conversion doc.to_html(0) Not available
PDF creation from content Pdf.from_markdown(), Pdf.from_html() Not available
Rendering to images Yes Not available
OCR for scanned PDFs Built-in PaddleOCR Not available
Text search doc.search("query") Not available
Character-level bounding boxes doc.extract_chars(0) Partial
PDF/A validation Yes Not available

If your workflow is purely merge/split/rotate, pypdf’s lightweight pure-Python approach is a reasonable choice. For anything involving text extraction quality, creation, or conversion, PDF Oxide is the more complete option.

When to Stay with pypdf

  • You need a pure-Python dependency with zero compiled extensions
  • Your use case is strictly merge/split/rotate/encrypt with no text extraction
  • You need pypdf’s specific PDF manipulation methods for legacy integration