Skip to content

PDF Oxide vs pypdfium2

Both PDF Oxide and pypdfium2 are fast, natively-compiled Python PDF libraries. pypdfium2 wraps Google’s PDFium engine; PDF Oxide is built on a Rust core. The key difference is scope: pypdfium2 is primarily a reader and renderer, while PDF Oxide covers the full PDF lifecycle.

Key Differences

Speed. Both are fast. PDF Oxide is slightly faster: 0.8ms mean vs 4.1ms (5.1× difference). Both are dramatically faster than pure-Python libraries.

Features. pypdfium2 is read-only with rendering. PDF Oxide adds creation, editing, form writing, encryption, Markdown/HTML output, and OCR.

Reliability. PDF Oxide passes 100% of valid PDFs. pypdfium2 passes 99.2% — 31 failures.

License. Both are permissive. PDF Oxide is MIT; pypdfium2 is Apache-2.0. No AGPL concerns with either.

Quick Comparison

PDF Oxide pypdfium2
Mean extraction time 0.8ms 4.1ms
Pass rate (3,830 PDFs) 100% 99.2%
License MIT Apache-2.0
Language Rust + PyO3 C (PDFium)
Text extraction Yes Yes
Character positions Yes Yes
Image extraction Yes Yes
Markdown output Yes No
HTML output Yes No
PDF creation Yes No
PDF editing Yes No
Form fields Read + Write Read only
Encryption Read + Write Read only
Rendering Yes Yes
OCR Built-in No
Search Regex + spatial Yes

Side-by-Side Code

Text Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

pypdfium2:

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
textpage = page.get_textpage()
text = textpage.get_text_range()
print(text)

Image Extraction

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

pypdfium2:

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
for i, obj in enumerate(page.get_objects()):
    if obj.type == pdfium.FPDF_PAGEOBJ_IMAGE:
        bitmap = obj.get_bitmap()
        bitmap.to_pil().save(f"image_{i}.png")

PDF Creation

PDF Oxide:

from pdf_oxide import Pdf

pdf = Pdf.from_markdown("# Report\n\nQuarterly results are in.")
pdf.save("report.pdf")

pypdfium2:

# pypdfium2 cannot create PDFs.
# It is a read-only library with rendering capabilities.

Rendering

PDF Oxide:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
image = doc.render_page(0, dpi=150)
image.save("page.png")

pypdfium2:

import pypdfium2 as pdfium

pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
bitmap = page.render(scale=150/72)
bitmap.to_pil().save("page.png")

Benchmark Details

Metric PDF Oxide pypdfium2
Mean extraction time 0.8ms 4.1ms
p99 extraction time 9ms 42ms
Pass rate (valid PDFs) 100% (3,823/3,823) 99.2% (3,792/3,823)

Both libraries use native code (Rust and C respectively), but PDF Oxide’s text extraction pipeline is optimized specifically for this task — single-pass extraction with pre-allocated buffers and cached page trees.

See full benchmark methodology for corpus details.

Feature Completeness

The biggest difference between these libraries is scope. pypdfium2 is a reader with rendering; PDF Oxide covers the full PDF lifecycle:

Capability PDF Oxide pypdfium2
Read and extract Yes Yes
Render pages Yes Yes
Create PDFs Yes (Markdown, HTML, images) No
Edit existing PDFs Yes (text, images, annotations) No
Fill form fields Yes No
Write encryption Yes (AES-256) No
Markdown/HTML output Yes No
OCR scanned pages Yes (PaddleOCR via ONNX) No
PDF/A validation Yes No

If you only need to read and render PDFs, pypdfium2 is a solid choice. If you need any write capability — creation, editing, form filling, or encryption — PDF Oxide is the single-library solution.

pypdfium2 License (Apache-2.0)

pypdfium2 is licensed under Apache-2.0, which allows commercial use. However, it wraps Google’s PDFium (the Chromium PDF engine), which has its own BSD-style license. Both are permissive.

Key considerations:

  • Apache-2.0 — permissive, allows commercial use, requires attribution
  • PDFium dependency — binary includes Chromium’s PDFium engine (~15 MB)
  • Google’s release cycle — pypdfium2 depends on PDFium releases from the Chromium project
  • No Python API stability guarantee — the API follows PDFium’s C API closely

PDF Oxide is MIT licensed — even more permissive than Apache-2.0, with no attribution requirements for binary distribution.

When to Use Each

Choose PDF Oxide if:

  • You need more than read/render (creation, editing, forms, encryption)
  • You want Markdown or HTML conversion
  • You want built-in OCR for scanned documents
  • You need the highest reliability (100% vs 99.2%)
  • Speed is critical and the 5× difference matters at scale

Choose pypdfium2 if:

  • You only need to read and render PDFs
  • You prefer PDFium’s specific rendering output
  • You want a smaller dependency footprint