PDF Oxide vs pypdf
PDF Oxide is 15× faster than pypdf with a higher pass rate, built-in rendering, and Markdown/HTML conversion. If you need more than basic PDF manipulation, PDF Oxide does in one library what pypdf requires multiple packages to achieve.
Why Consider PDF Oxide Over pypdf
Speed. pypdf is pure Python. PDF Oxide uses a Rust core compiled via PyO3, running directly in the Python process. Mean text extraction: 0.8ms vs 12.1ms — a 15× difference.
Reliability. PDF Oxide passes 100% of 3,830 test PDFs. pypdf passes 98.4% — 61 failures on valid PDFs.
Features. pypdf is a PDF manipulation library (merge, split, rotate, encrypt). For text extraction, rendering, Markdown output, or form creation, you need additional packages. PDF Oxide covers all of these in a single install.
Quick Comparison
| PDF Oxide | pypdf | |
|---|---|---|
| Mean extraction time | 0.8ms | 12.1ms |
| Pass rate (3,830 PDFs) | 100% | 98.4% |
| License | MIT | BSD-3 |
| Language | Rust + PyO3 | Pure Python |
| Text extraction | Yes | Yes |
| Character positions | Yes | Partial |
| Image extraction | Yes | Yes |
| Markdown output | Yes | No |
| HTML output | Yes | No |
| PDF creation | Yes (Markdown/HTML/images) | Limited (merge only) |
| Form fields | Read + Write | Read + Write |
| Encryption | Read + Write | Read + Write |
| Rendering | Yes | No |
| OCR | Built-in | No |
| Search | Regex + spatial | No |
| Install size | ~5 MB | ~1 MB |
Side-by-Side Code
Text Extraction
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)
pypdf:
from pypdf import PdfReader
reader = PdfReader("report.pdf")
text = reader.pages[0].extract_text()
print(text)
Extract All Pages
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("book.pdf")
for i in range(doc.page_count()):
text = doc.extract_text(i)
print(f"--- Page {i + 1} ---")
print(text)
pypdf:
from pypdf import PdfReader
reader = PdfReader("book.pdf")
for page in reader.pages:
text = page.extract_text()
print(text)
Image Extraction
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
with open(f"image_{i}.{img['format']}", "wb") as f:
f.write(img["data"])
pypdf:
from pypdf import PdfReader
reader = PdfReader("report.pdf")
page = reader.pages[0]
for i, image in enumerate(page.images):
with open(f"image_{i}.{image.name.split('.')[-1]}", "wb") as f:
f.write(image.data)
Encrypted PDFs
PDF Oxide:
from pdf_oxide import PdfDocument
doc = PdfDocument("encrypted.pdf", password="secret")
text = doc.extract_text(0)
pypdf:
from pypdf import PdfReader
reader = PdfReader("encrypted.pdf")
reader.decrypt("secret")
text = reader.pages[0].extract_text()
Markdown Conversion
PDF Oxide (built-in):
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)
pypdf:
# pypdf has no Markdown conversion.
# You would need a separate tool chain.
Benchmark Details
| Metric | PDF Oxide | pypdf |
|---|---|---|
| Mean extraction time | 0.8ms | 12.1ms |
| p99 extraction time | 9ms | 97ms |
| Pass rate (valid PDFs) | 100% (3,823/3,823) | 98.4% (3,762/3,823) |
pypdf’s pure-Python implementation means every operation runs in the interpreter. PDF Oxide’s Rust core handles parsing, font decoding, and text assembly natively, with only the final result crossing the Python boundary.
See full benchmark methodology for corpus details.
Feature Gap
pypdf excels at PDF manipulation — merge, split, rotate, and encrypt. But it lacks:
| Feature | PDF Oxide | pypdf |
|---|---|---|
| Markdown conversion | doc.to_markdown(0) |
Not available |
| HTML conversion | doc.to_html(0) |
Not available |
| PDF creation from content | Pdf.from_markdown(), Pdf.from_html() |
Not available |
| Rendering to images | Yes | Not available |
| OCR for scanned PDFs | Built-in PaddleOCR | Not available |
| Text search | doc.search("query") |
Not available |
| Character-level bounding boxes | doc.extract_chars(0) |
Partial |
| PDF/A validation | Yes | Not available |
If your workflow is purely merge/split/rotate, pypdf’s lightweight pure-Python approach is a reasonable choice. For anything involving text extraction quality, creation, or conversion, PDF Oxide is the more complete option.
When to Stay with pypdf
- You need a pure-Python dependency with zero compiled extensions
- Your use case is strictly merge/split/rotate/encrypt with no text extraction
- You need pypdf’s specific PDF manipulation methods for legacy integration
Related Pages
- Performance Benchmarks — full corpus results
- vs Python PDF Libraries — all Python libraries compared
- Getting Started with Python — installation and first extraction