Skip to content

vs Rust PDF Libraries

PDF Oxide compared with the most-used Rust PDF crates: lopdf, printpdf, pdf-rs, and pdf_extract. Each targets a different level of abstraction and a different set of use cases.

Summary

PDF Oxide lopdf printpdf pdf-rs pdf_extract
API level High-level Low-level Mid-level (creation) Low-level (read) Mid-level (read)
Read PDFs Yes Yes No Yes Yes
Write PDFs Yes Yes Yes No No
Text extraction Yes (high-level) Manual No Manual Yes (basic)
Image extraction Yes (high-level) Manual No Manual No
Form fields Read + Write Manual No Read only No
PDF creation Yes Yes Yes No No
Markdown/HTML input Yes No No No No
Editing existing PDFs Yes Yes (low-level) No No No
Annotations Read + Write Manual No Read only No
Encryption Read + Write No No No No
PDF/A validation Yes No No No No
Rendering Yes (tiny-skia) No No Partial No
Python bindings Yes No No No No
License MIT MIT MIT MIT Apache-2.0

All libraries are permissively licensed. The differences are in scope and abstraction level.

Performance Comparison

Full Corpus Benchmark (3,830 PDFs)

Tested on the full 3,830-PDF corpus — three independent, publicly available test suites covering PDF specification compliance (veraPDF, 2,907 files), real-world browser rendering edge cases (Mozilla pdf.js, 897 files), and security/robustness stress tests including malformed structures and fuzz-generated corruption (DARPA SafeDocs, 26 files). See full corpus details.

Library Mean p99 Pass Rate Text Extraction Notes
PDF Oxide 0.8ms 9ms 100% Built-in, production-grade Unicode, CJK, reading order
oxidize_pdf 13.5ms 11ms 99.1% Basic 48s max outlier
unpdf 2.8ms 10ms 95.1% Basic 185 failures on full corpus
pdf_extract 4.08ms 37ms 91.5% Basic Missing complex layouts
lopdf 0.3ms 2ms 80.2% No built-in extraction Fails on 20% of PDFs

lopdf is faster on the PDFs it can parse — but it fails on 20% of the corpus and provides no text extraction. You would need to build font decoding, CMap resolution, and spacing analysis yourself.

pdf_extract provides basic text extraction but has a 91.5% pass rate and struggles with complex layouts, CJK text, and tagged PDFs. oxidize_pdf has decent reliability (99.1%) but is 17× slower than pdf_oxide on mean extraction time, with a 48-second worst-case outlier. unpdf processes the full corpus but fails on 185 PDFs.

PDF Oxide is the only Rust crate that combines 100% reliability with production-grade text extraction.

API Design Comparison

PDF Oxide: High-Level, Task-Oriented

PDF Oxide provides purpose-built methods for common tasks. You work with text, images, and form fields — not PDF objects and dictionaries.

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;

// Text extraction -- one call
let text = doc.extract_text(0)?;
println!("{}", text);

// Styled spans with font metadata
let spans = doc.extract_spans(0)?;
for span in &spans {
    println!("'{}' font={} size={:.1}pt", span.text, span.font_name, span.font_size);
}

// Image extraction
let images = doc.extract_images(0)?;
for img in &images {
    println!("{}x{} {:?}", img.width, img.height, img.format);
}

// Form fields
let fields = doc.extract_form_fields()?;
for field in &fields {
    println!("{}: {:?}", field.name, field.value);
}

PDF creation is equally straightforward:

use pdf_oxide::api::Pdf;

// From Markdown
let pdf = Pdf::from_markdown("# Report\n\n| A | B |\n|---|---|\n| 1 | 2 |")?;
pdf.save("report.pdf")?;

// From HTML
let pdf = Pdf::from_html("<h1>Report</h1><p>Content here.</p>")?;
pdf.save("report.pdf")?;

lopdf: Low-Level Object Manipulation

lopdf gives you direct access to PDF objects, streams, and the cross-reference table. You must understand the PDF specification to use it effectively. There is no built-in text extraction — you navigate dictionaries and decode streams yourself.

use lopdf::Document;

let doc = Document::load("report.pdf")?;

// Get page dictionary
let page_id = doc.page_iter().next().unwrap();
let page = doc.get_dictionary(page_id)?;

// Get content stream -- manual work
let contents = page.get("Contents")?;
let stream = doc.get_object(contents.as_reference()?)?;

// To extract text you must:
// 1. Parse the content stream operators
// 2. Resolve font references from /Resources
// 3. Decode CMap/ToUnicode mappings
// 4. Apply text matrix transformations
// 5. Handle encoding differences
//
// lopdf does not provide any of this -- it is raw object access
println!("Page has {} objects", doc.objects.len());

lopdf is the right tool when you need to manipulate PDF structure directly: merging documents, rewriting object streams, or building specialized PDF processors.

printpdf: PDF Creation Only

printpdf is a creation-only library. It cannot read or parse existing PDFs. It provides a typed API for building PDF documents from scratch with text, images, and vector graphics.

use printpdf::*;

let (doc, page1, layer1) = PdfDocument::new(
    "Report", Mm(210.0), Mm(297.0), "Layer 1"
);

let current_layer = doc.get_page(page1).get_layer(layer1);

// Add text -- requires manual font loading
let font = doc.add_builtin_font(BuiltinFont::Helvetica)?;
current_layer.use_text("Hello World", 24.0, Mm(10.0), Mm(280.0), &font);

// Save
doc.save(&mut std::io::BufWriter::new(
    std::fs::File::create("output.pdf")?,
))?;

// Cannot read existing PDFs
// Cannot extract text, images, or form fields

printpdf is the right tool when you only need to generate new PDFs and want a clean, focused creation API.

pdf-rs: Low-Level PDF Reading

pdf-rs parses PDF structure into Rust types but provides minimal high-level functionality. You get typed access to PDF objects but must still handle text decoding, font resolution, and content stream parsing.

use pdf::file::FileOptions;

let file = FileOptions::cached().open("report.pdf")?;

// Access page objects
let page = file.get_page(0)?;
let media_box = page.media_box()?;
println!("Page size: {:?}", media_box);

// Content stream access -- low-level
if let Some(ref contents) = page.contents {
    // Returns raw operations -- you must interpret them
    // No built-in text assembly, font decoding, or layout analysis
}

// Cannot write or modify PDFs

pdf-rs is the right tool when you need a type-safe PDF parser for analysis, validation, or building a custom renderer.

Feature Comparison by Task

Text Extraction

Library Built-in Quality Effort Required
PDF Oxide Yes Production-grade (Unicode, CJK, reading order) One method call
pdf_extract Yes Basic (misses complex layouts) One method call
lopdf No N/A Hundreds of lines of custom code
printpdf No N/A Not possible (write-only)
pdf-rs No N/A Significant custom code required

PDF Oxide handles CMap/ToUnicode decoding, font metric-based spacing, structure tree reading order, and ligature reconstruction. Implementing equivalent functionality on top of lopdf or pdf-rs requires thousands of lines of code and deep PDF specification knowledge.

PDF Creation

Library Approach Markdown/HTML Input Tables Barcodes
PDF Oxide High-level + low-level Yes Yes Yes
lopdf Raw object construction No No No
printpdf Typed layer API No No No
pdf-rs N/A (read-only) N/A N/A N/A

Encryption

Library Read Encrypted Write Encrypted Algorithms
PDF Oxide Yes Yes RC4-40, RC4-128, AES-128, AES-256
lopdf No No
printpdf No No
pdf-rs Partial No RC4 only

Compliance

Library PDF/A PDF/X PDF/UA
PDF Oxide Validate + Convert Validate Validate
lopdf No No No
printpdf Partial (PDF/A-1b output) No No
pdf-rs No No No

Dependency Footprint

Library Dependencies Compile Time Binary Size
PDF Oxide ~40 (core) ~30s ~4 MB
lopdf ~15 ~10s ~1 MB
printpdf ~20 ~15s ~2 MB
pdf-rs ~25 ~20s ~2 MB

PDF Oxide has more dependencies because it includes font parsing, image decoding, content stream interpretation, and encryption — features that the other libraries leave to the user or omit entirely. With all optional features (rendering, barcodes, office), the count rises to ~100.

Combining Libraries

Since all are permissively licensed, you can combine them in a single project:

[dependencies]
pdf_oxide = "0.3"
lopdf = "0.32"        # Optional: raw object access for edge cases

Common patterns:

  • PDF Oxide + lopdf: Use PDF Oxide for extraction and creation, fall back to lopdf for edge cases requiring raw object manipulation.
  • PDF Oxide + printpdf: Use PDF Oxide for reading and printpdf for specialized creation workflows.

Use Case Matrix

“I need to extract text from PDFs”

Crate Suitable? Notes
PDF Oxide Yes Best extraction quality, 100% pass rate, reading order, font metadata
pdf_extract Partial Basic extraction, 91.5% pass rate
lopdf No No text extraction
printpdf No Cannot read PDFs
pdf-rs Partial Basic parsing, no high-level text extraction

“I need to create PDFs”

Crate Suitable? Notes
PDF Oxide Yes High-level (Markdown/HTML) and low-level APIs
lopdf Partial Low-level object construction
printpdf Yes Clean creation API, no reading
pdf-rs No Read-only

“I need to edit existing PDFs”

Crate Suitable? Notes
PDF Oxide Yes DOM-like editing, annotations, forms
lopdf Partial Low-level object manipulation
printpdf No Cannot read PDFs
pdf-rs No Read-only

“I need the full lifecycle (extract + create + edit)”

Crate Suitable? Notes
PDF Oxide Yes Only crate covering all three
lopdf + printpdf Partial Two crates, no text extraction
pdf-rs + printpdf Partial Two crates, no editing

When to Use Each

Choose PDF Oxide if you need more than one PDF capability (extraction + creation, or extraction + editing) and want a single, well-tested dependency with 100% reliability.

Choose lopdf if you need low-level PDF structure manipulation and are comfortable working with the PDF spec directly. Good for merging, splitting, and batch PDF processing.

Choose printpdf if you only create PDFs and never need to read them. The cleanest API for report and document generation.

Choose pdf-rs if you need a spec-compliant parser for PDF analysis or are building your own rendering pipeline.

Choose pdf_extract if you need basic text extraction and don’t require high reliability or complex layout support.