PDF Oxide vs pypdfium2
Both PDF Oxide and pypdfium2 are fast, natively-compiled Python PDF libraries. pypdfium2 wraps Google’s PDFium engine; PDF Oxide is built on a Rust core. The key difference is scope: pypdfium2 is primarily a reader and renderer, while PDF Oxide covers the full PDF lifecycle.
Key Differences
Speed. Both are fast. PDF Oxide is slightly faster: 0.8ms mean vs 4.1ms (5.1× difference). Both are dramatically faster than pure-Python libraries.
Features. pypdfium2 is read-only with rendering. PDF Oxide adds creation, editing, form writing, encryption, Markdown/HTML output, and OCR.
Reliability. PDF Oxide passes 100% of valid PDFs. pypdfium2 passes 99.2% — 31 failures.
License. Both are permissive. PDF Oxide is MIT; pypdfium2 is Apache-2.0. No AGPL concerns with either.
Quick Comparison
| PDF Oxide | pypdfium2 | |
|---|---|---|
| Mean extraction time | 0.8ms | 4.1ms |
| Pass rate (3,830 PDFs) | 100% | 99.2% |
| License | MIT | Apache-2.0 |
| Language | Rust + PyO3 | C (PDFium) |
| Text extraction | Yes | Yes |
| Character positions | Yes | Yes |
| Image extraction | Yes | Yes |
| Markdown output | Yes | No |
| HTML output | Yes | No |
| PDF creation | Yes | No |
| PDF editing | Yes | No |
| Form fields | Read + Write | Read only |
| Encryption | Read + Write | Read only |
| Rendering | Yes | Yes |
| OCR | Built-in | No |
| Search | Regex + spatial | Yes |
Side-by-Side Code
Text Extraction
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)
pypdfium2:
import pypdfium2 as pdfium
pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
textpage = page.get_textpage()
text = textpage.get_text_range()
print(text)
Image Extraction
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
with open(f"image_{i}.{img['format']}", "wb") as f:
f.write(img["data"])
pypdfium2:
import pypdfium2 as pdfium
pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
for i, obj in enumerate(page.get_objects()):
if obj.type == pdfium.FPDF_PAGEOBJ_IMAGE:
bitmap = obj.get_bitmap()
bitmap.to_pil().save(f"image_{i}.png")
PDF Creation
PDF Oxide:
from pdf_oxide import Pdf
pdf = Pdf.from_markdown("# Report\n\nQuarterly results are in.")
pdf.save("report.pdf")
pypdfium2:
# pypdfium2 cannot create PDFs.
# It is a read-only library with rendering capabilities.
Rendering
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
image = doc.render_page(0, dpi=150)
image.save("page.png")
pypdfium2:
import pypdfium2 as pdfium
pdf = pdfium.PdfDocument("report.pdf")
page = pdf[0]
bitmap = page.render(scale=150/72)
bitmap.to_pil().save("page.png")
Benchmark Details
| Metric | PDF Oxide | pypdfium2 |
|---|---|---|
| Mean extraction time | 0.8ms | 4.1ms |
| p99 extraction time | 9ms | 42ms |
| Pass rate (valid PDFs) | 100% (3,823/3,823) | 99.2% (3,792/3,823) |
Both libraries use native code (Rust and C respectively), but PDF Oxide’s text extraction pipeline is optimized specifically for this task — single-pass extraction with pre-allocated buffers and cached page trees.
See full benchmark methodology for corpus details.
Feature Completeness
The biggest difference between these libraries is scope. pypdfium2 is a reader with rendering; PDF Oxide covers the full PDF lifecycle:
| Capability | PDF Oxide | pypdfium2 |
|---|---|---|
| Read and extract | Yes | Yes |
| Render pages | Yes | Yes |
| Create PDFs | Yes (Markdown, HTML, images) | No |
| Edit existing PDFs | Yes (text, images, annotations) | No |
| Fill form fields | Yes | No |
| Write encryption | Yes (AES-256) | No |
| Markdown/HTML output | Yes | No |
| OCR scanned pages | Yes (PaddleOCR via ONNX) | No |
| PDF/A validation | Yes | No |
If you only need to read and render PDFs, pypdfium2 is a solid choice. If you need any write capability — creation, editing, form filling, or encryption — PDF Oxide is the single-library solution.
pypdfium2 License (Apache-2.0)
pypdfium2 is licensed under Apache-2.0, which allows commercial use. However, it wraps Google’s PDFium (the Chromium PDF engine), which has its own BSD-style license. Both are permissive.
Key considerations:
- Apache-2.0 — permissive, allows commercial use, requires attribution
- PDFium dependency — binary includes Chromium’s PDFium engine (~15 MB)
- Google’s release cycle — pypdfium2 depends on PDFium releases from the Chromium project
- No Python API stability guarantee — the API follows PDFium’s C API closely
PDF Oxide is MIT licensed — even more permissive than Apache-2.0, with no attribution requirements for binary distribution.
When to Use Each
Choose PDF Oxide if:
- You need more than read/render (creation, editing, forms, encryption)
- You want Markdown or HTML conversion
- You want built-in OCR for scanned documents
- You need the highest reliability (100% vs 99.2%)
- Speed is critical and the 5× difference matters at scale
Choose pypdfium2 if:
- You only need to read and render PDFs
- You prefer PDFium’s specific rendering output
- You want a smaller dependency footprint
Related Pages
- Performance Benchmarks — full corpus results
- vs Python PDF Libraries — all Python libraries compared
- Getting Started with Python — installation and first extraction