PDF Oxide vs lopdf
lopdf is a low-level Rust crate for direct PDF object manipulation. PDF Oxide is a high-level library with built-in text extraction, creation, and editing. They target fundamentally different use cases.
Key Differences
Abstraction level. lopdf gives you raw PDF objects — dictionaries, streams, and cross-reference tables. There is no text extraction, no font decoding, no image export. PDF Oxide provides purpose-built methods: extract_text(), extract_images(), to_markdown().
Reliability. lopdf fails to parse 20% of the 3,830-PDF test corpus. Of the PDFs it does parse, 57% produce empty output because lopdf has no text extraction — you get the objects but no text. PDF Oxide passes 100%.
Speed on parseable PDFs. lopdf is faster at raw object parsing: 0.3ms mean vs PDF Oxide’s 0.8ms. But lopdf does no text extraction work — you’d need to build font decoding, CMap resolution, spacing analysis, and reading order yourself.
Quick Comparison
| PDF Oxide | lopdf | |
|---|---|---|
| API level | High-level | Low-level |
| Text extraction | Built-in (production-grade) | None |
| Pass rate (3,830 PDFs) | 100% | 80.2% |
| Mean parse time | 0.8ms | 0.3ms |
| Image extraction | Built-in | Manual (raw streams) |
| Form fields | Read + Write | Manual (raw dictionaries) |
| PDF creation | Yes (Markdown/HTML) | Yes (raw objects) |
| Markdown/HTML output | Yes | No |
| Encryption | Read + Write | No |
| Rendering | Yes | No |
| PDF/A validation | Yes | No |
| License | MIT | MIT |
What lopdf Can’t Do
lopdf provides access to PDF objects, but text extraction requires interpreting those objects according to the PDF specification. Here’s what you’d need to build yourself:
- Content stream parsing — parse PostScript-like operators (Tj, TJ, Tm, Tf, etc.)
- Font resolution — look up /Font resources, resolve indirect references
- CMap/ToUnicode decoding — convert glyph IDs to Unicode characters
- Font metric spacing — calculate character widths from font descriptors
- Text matrix transforms — apply Tm, Td, T* operators to position text
- Reading order — determine the correct order for multi-column layouts
- Ligature reconstruction — handle fi, fl, ffi ligatures
- CJK encoding — decode Chinese, Japanese, Korean text encodings
This is thousands of lines of code and deep knowledge of ISO 32000. PDF Oxide handles all of it internally.
Side-by-Side Code
Text Extraction
PDF Oxide:
use pdf_oxide::PdfDocument;
let mut doc = PdfDocument::open("report.pdf")?;
let text = doc.extract_text(0)?;
println!("{}", text);
lopdf:
use lopdf::Document;
let doc = Document::load("report.pdf")?;
// lopdf does not provide text extraction.
// You get access to PDF objects only:
let page_id = doc.page_iter().next().unwrap();
let page = doc.get_dictionary(page_id)?;
let contents = page.get("Contents")?;
let stream = doc.get_object(contents.as_reference()?)?;
// To get actual text, you must:
// 1. Parse content stream operators
// 2. Resolve font references from /Resources
// 3. Decode CMap/ToUnicode mappings
// 4. Apply text matrix transformations
// 5. Handle encoding differences
// ... (hundreds to thousands of lines of code)
PDF Creation
PDF Oxide:
use pdf_oxide::api::Pdf;
let pdf = Pdf::from_markdown("# Report\n\n| Q1 | Q2 |\n|---|---|\n| $1M | $2M |")?;
pdf.save("report.pdf")?;
lopdf:
use lopdf::{Document, Object, Stream, dictionary};
let mut doc = Document::with_version("1.5");
// Create font dictionary
let font_id = doc.add_object(dictionary! {
"Type" => "Font",
"Subtype" => "Type1",
"BaseFont" => "Helvetica",
});
// Create resources
let resources_id = doc.add_object(dictionary! {
"Font" => dictionary! { "F1" => font_id },
});
// Create content stream (raw PostScript operators)
let content = Stream::new(
dictionary! {},
b"BT /F1 12 Tf 72 720 Td (Hello World) Tj ET".to_vec(),
);
let content_id = doc.add_object(content);
// Create page
let page_id = doc.add_object(dictionary! {
"Type" => "Page",
"MediaBox" => vec![0.into(), 0.into(), 612.into(), 792.into()],
"Contents" => content_id,
"Resources" => resources_id,
});
// Wire up page tree
let pages_id = doc.add_object(dictionary! {
"Type" => "Pages",
"Kids" => vec![page_id.into()],
"Count" => 1,
});
doc.add_object(dictionary! {
"Type" => "Catalog",
"Pages" => pages_id,
});
doc.save("report.pdf")?;
Encrypted PDFs
PDF Oxide:
use pdf_oxide::PdfDocument;
let doc = PdfDocument::open_with_password("encrypted.pdf", "secret")?;
let text = doc.extract_text(0)?;
println!("{}", text);
lopdf:
// lopdf does not support encrypted PDFs.
// Loading an encrypted PDF will fail or produce undecrypted streams.
Reliability Comparison
| Metric | PDF Oxide | lopdf |
|---|---|---|
| PDFs parsed successfully | 3,823 / 3,823 (100%) | 3,071 / 3,823 (80.2%) |
| PDFs with text output | 3,823 / 3,823 | ~1,320 / 3,823 (estimated) |
| Encrypted PDF support | Yes | No |
| Malformed PDF recovery | Yes | No |
lopdf’s 80.2% pass rate means it fails on roughly 1 in 5 PDFs. The failures occur on encrypted documents, PDFs with non-standard xref tables, and documents using cross-reference streams. PDF Oxide handles all of these with lenient parsing and fallback strategies.
When to Use Each
Choose PDF Oxide if:
- You need text extraction, image extraction, or any content-level operation
- You want a single crate for read + write + create
- You need to handle all PDFs reliably (encrypted, malformed, complex)
- You need Markdown/HTML output, rendering, or OCR
- You want compliance validation (PDF/A, PDF/X, PDF/UA)
Choose lopdf if:
- You need direct access to PDF objects for custom processing
- You’re building a specialized PDF tool that works at the object level
- You need to merge documents by manipulating object trees directly
- Your PDFs are simple and well-formed (not encrypted, standard xref tables)
Combine both:
Use PDF Oxide for high-level operations and lopdf for edge cases requiring raw object access:
[dependencies]
pdf_oxide = "0.3"
lopdf = "0.32"
Related Pages
- Performance Benchmarks — full corpus results
- vs Rust PDF Libraries — all Rust crates compared
- Getting Started with Rust — installation and first extraction