vs Rust PDF Libraries
PDF Oxide compared with the most-used Rust PDF crates: lopdf, printpdf, pdf-rs, and pdf_extract. Each targets a different level of abstraction and a different set of use cases.
Summary
| PDF Oxide | lopdf | printpdf | pdf-rs | pdf_extract | |
|---|---|---|---|---|---|
| API level | High-level | Low-level | Mid-level (creation) | Low-level (read) | Mid-level (read) |
| Read PDFs | Yes | Yes | No | Yes | Yes |
| Write PDFs | Yes | Yes | Yes | No | No |
| Text extraction | Yes (high-level) | Manual | No | Manual | Yes (basic) |
| Image extraction | Yes (high-level) | Manual | No | Manual | No |
| Form fields | Read + Write | Manual | No | Read only | No |
| PDF creation | Yes | Yes | Yes | No | No |
| Markdown/HTML input | Yes | No | No | No | No |
| Editing existing PDFs | Yes | Yes (low-level) | No | No | No |
| Annotations | Read + Write | Manual | No | Read only | No |
| Encryption | Read + Write | No | No | No | No |
| PDF/A validation | Yes | No | No | No | No |
| Rendering | Yes (tiny-skia) | No | No | Partial | No |
| Python bindings | Yes | No | No | No | No |
| License | MIT | MIT | MIT | MIT | Apache-2.0 |
All libraries are permissively licensed. The differences are in scope and abstraction level.
Performance Comparison
Full Corpus Benchmark (3,830 PDFs)
Tested on the full 3,830-PDF corpus — three independent, publicly available test suites covering PDF specification compliance (veraPDF, 2,907 files), real-world browser rendering edge cases (Mozilla pdf.js, 897 files), and security/robustness stress tests including malformed structures and fuzz-generated corruption (DARPA SafeDocs, 26 files). See full corpus details.
| Library | Mean | p99 | Pass Rate | Text Extraction | Notes |
|---|---|---|---|---|---|
| PDF Oxide | 0.8ms | 9ms | 100% | Built-in, production-grade | Unicode, CJK, reading order |
| oxidize_pdf | 13.5ms | 11ms | 99.1% | Basic | 48s max outlier |
| unpdf | 2.8ms | 10ms | 95.1% | Basic | 185 failures on full corpus |
| pdf_extract | 4.08ms | 37ms | 91.5% | Basic | Missing complex layouts |
| lopdf | 0.3ms | 2ms | 80.2% | No built-in extraction | Fails on 20% of PDFs |
lopdf is faster on the PDFs it can parse — but it fails on 20% of the corpus and provides no text extraction. You would need to build font decoding, CMap resolution, and spacing analysis yourself.
pdf_extract provides basic text extraction but has a 91.5% pass rate and struggles with complex layouts, CJK text, and tagged PDFs. oxidize_pdf has decent reliability (99.1%) but is 17× slower than pdf_oxide on mean extraction time, with a 48-second worst-case outlier. unpdf processes the full corpus but fails on 185 PDFs.
PDF Oxide is the only Rust crate that combines 100% reliability with production-grade text extraction.
API Design Comparison
PDF Oxide: High-Level, Task-Oriented
PDF Oxide provides purpose-built methods for common tasks. You work with text, images, and form fields — not PDF objects and dictionaries.
use pdf_oxide::PdfDocument;
let mut doc = PdfDocument::open("report.pdf")?;
// Text extraction -- one call
let text = doc.extract_text(0)?;
println!("{}", text);
// Styled spans with font metadata
let spans = doc.extract_spans(0)?;
for span in &spans {
println!("'{}' font={} size={:.1}pt", span.text, span.font_name, span.font_size);
}
// Image extraction
let images = doc.extract_images(0)?;
for img in &images {
println!("{}x{} {:?}", img.width, img.height, img.format);
}
// Form fields
let fields = doc.extract_form_fields()?;
for field in &fields {
println!("{}: {:?}", field.name, field.value);
}
PDF creation is equally straightforward:
use pdf_oxide::api::Pdf;
// From Markdown
let pdf = Pdf::from_markdown("# Report\n\n| A | B |\n|---|---|\n| 1 | 2 |")?;
pdf.save("report.pdf")?;
// From HTML
let pdf = Pdf::from_html("<h1>Report</h1><p>Content here.</p>")?;
pdf.save("report.pdf")?;
lopdf: Low-Level Object Manipulation
lopdf gives you direct access to PDF objects, streams, and the cross-reference table. You must understand the PDF specification to use it effectively. There is no built-in text extraction — you navigate dictionaries and decode streams yourself.
use lopdf::Document;
let doc = Document::load("report.pdf")?;
// Get page dictionary
let page_id = doc.page_iter().next().unwrap();
let page = doc.get_dictionary(page_id)?;
// Get content stream -- manual work
let contents = page.get("Contents")?;
let stream = doc.get_object(contents.as_reference()?)?;
// To extract text you must:
// 1. Parse the content stream operators
// 2. Resolve font references from /Resources
// 3. Decode CMap/ToUnicode mappings
// 4. Apply text matrix transformations
// 5. Handle encoding differences
//
// lopdf does not provide any of this -- it is raw object access
println!("Page has {} objects", doc.objects.len());
lopdf is the right tool when you need to manipulate PDF structure directly: merging documents, rewriting object streams, or building specialized PDF processors.
printpdf: PDF Creation Only
printpdf is a creation-only library. It cannot read or parse existing PDFs. It provides a typed API for building PDF documents from scratch with text, images, and vector graphics.
use printpdf::*;
let (doc, page1, layer1) = PdfDocument::new(
"Report", Mm(210.0), Mm(297.0), "Layer 1"
);
let current_layer = doc.get_page(page1).get_layer(layer1);
// Add text -- requires manual font loading
let font = doc.add_builtin_font(BuiltinFont::Helvetica)?;
current_layer.use_text("Hello World", 24.0, Mm(10.0), Mm(280.0), &font);
// Save
doc.save(&mut std::io::BufWriter::new(
std::fs::File::create("output.pdf")?,
))?;
// Cannot read existing PDFs
// Cannot extract text, images, or form fields
printpdf is the right tool when you only need to generate new PDFs and want a clean, focused creation API.
pdf-rs: Low-Level PDF Reading
pdf-rs parses PDF structure into Rust types but provides minimal high-level functionality. You get typed access to PDF objects but must still handle text decoding, font resolution, and content stream parsing.
use pdf::file::FileOptions;
let file = FileOptions::cached().open("report.pdf")?;
// Access page objects
let page = file.get_page(0)?;
let media_box = page.media_box()?;
println!("Page size: {:?}", media_box);
// Content stream access -- low-level
if let Some(ref contents) = page.contents {
// Returns raw operations -- you must interpret them
// No built-in text assembly, font decoding, or layout analysis
}
// Cannot write or modify PDFs
pdf-rs is the right tool when you need a type-safe PDF parser for analysis, validation, or building a custom renderer.
Feature Comparison by Task
Text Extraction
| Library | Built-in | Quality | Effort Required |
|---|---|---|---|
| PDF Oxide | Yes | Production-grade (Unicode, CJK, reading order) | One method call |
| pdf_extract | Yes | Basic (misses complex layouts) | One method call |
| lopdf | No | N/A | Hundreds of lines of custom code |
| printpdf | No | N/A | Not possible (write-only) |
| pdf-rs | No | N/A | Significant custom code required |
PDF Oxide handles CMap/ToUnicode decoding, font metric-based spacing, structure tree reading order, and ligature reconstruction. Implementing equivalent functionality on top of lopdf or pdf-rs requires thousands of lines of code and deep PDF specification knowledge.
PDF Creation
| Library | Approach | Markdown/HTML Input | Tables | Barcodes |
|---|---|---|---|---|
| PDF Oxide | High-level + low-level | Yes | Yes | Yes |
| lopdf | Raw object construction | No | No | No |
| printpdf | Typed layer API | No | No | No |
| pdf-rs | N/A (read-only) | N/A | N/A | N/A |
Encryption
| Library | Read Encrypted | Write Encrypted | Algorithms |
|---|---|---|---|
| PDF Oxide | Yes | Yes | RC4-40, RC4-128, AES-128, AES-256 |
| lopdf | No | No | – |
| printpdf | No | No | – |
| pdf-rs | Partial | No | RC4 only |
Compliance
| Library | PDF/A | PDF/X | PDF/UA |
|---|---|---|---|
| PDF Oxide | Validate + Convert | Validate | Validate |
| lopdf | No | No | No |
| printpdf | Partial (PDF/A-1b output) | No | No |
| pdf-rs | No | No | No |
Dependency Footprint
| Library | Dependencies | Compile Time | Binary Size |
|---|---|---|---|
| PDF Oxide | ~40 (core) | ~30s | ~4 MB |
| lopdf | ~15 | ~10s | ~1 MB |
| printpdf | ~20 | ~15s | ~2 MB |
| pdf-rs | ~25 | ~20s | ~2 MB |
PDF Oxide has more dependencies because it includes font parsing, image decoding, content stream interpretation, and encryption — features that the other libraries leave to the user or omit entirely. With all optional features (rendering, barcodes, office), the count rises to ~100.
Combining Libraries
Since all are permissively licensed, you can combine them in a single project:
[dependencies]
pdf_oxide = "0.3"
lopdf = "0.32" # Optional: raw object access for edge cases
Common patterns:
- PDF Oxide + lopdf: Use PDF Oxide for extraction and creation, fall back to lopdf for edge cases requiring raw object manipulation.
- PDF Oxide + printpdf: Use PDF Oxide for reading and printpdf for specialized creation workflows.
Use Case Matrix
“I need to extract text from PDFs”
| Crate | Suitable? | Notes |
|---|---|---|
| PDF Oxide | Yes | Best extraction quality, 100% pass rate, reading order, font metadata |
| pdf_extract | Partial | Basic extraction, 91.5% pass rate |
| lopdf | No | No text extraction |
| printpdf | No | Cannot read PDFs |
| pdf-rs | Partial | Basic parsing, no high-level text extraction |
“I need to create PDFs”
| Crate | Suitable? | Notes |
|---|---|---|
| PDF Oxide | Yes | High-level (Markdown/HTML) and low-level APIs |
| lopdf | Partial | Low-level object construction |
| printpdf | Yes | Clean creation API, no reading |
| pdf-rs | No | Read-only |
“I need to edit existing PDFs”
| Crate | Suitable? | Notes |
|---|---|---|
| PDF Oxide | Yes | DOM-like editing, annotations, forms |
| lopdf | Partial | Low-level object manipulation |
| printpdf | No | Cannot read PDFs |
| pdf-rs | No | Read-only |
“I need the full lifecycle (extract + create + edit)”
| Crate | Suitable? | Notes |
|---|---|---|
| PDF Oxide | Yes | Only crate covering all three |
| lopdf + printpdf | Partial | Two crates, no text extraction |
| pdf-rs + printpdf | Partial | Two crates, no editing |
When to Use Each
Choose PDF Oxide if you need more than one PDF capability (extraction + creation, or extraction + editing) and want a single, well-tested dependency with 100% reliability.
Choose lopdf if you need low-level PDF structure manipulation and are comfortable working with the PDF spec directly. Good for merging, splitting, and batch PDF processing.
Choose printpdf if you only create PDFs and never need to read them. The cleanest API for report and document generation.
Choose pdf-rs if you need a spec-compliant parser for PDF analysis or are building your own rendering pipeline.
Choose pdf_extract if you need basic text extraction and don’t require high reliability or complex layout support.
Related Pages
- Performance Benchmarks – full corpus benchmark results
- Getting Started with Rust – installation and first extraction
- Rust API Reference – complete Rust API
- vs Python PDF Libraries – Python ecosystem comparison