Skip to content

PDF Oxide vs lopdf

lopdf is a low-level Rust crate for direct PDF object manipulation. PDF Oxide is a high-level library with built-in text extraction, creation, and editing. They target fundamentally different use cases.

Key Differences

Abstraction level. lopdf gives you raw PDF objects — dictionaries, streams, and cross-reference tables. There is no text extraction, no font decoding, no image export. PDF Oxide provides purpose-built methods: extract_text(), extract_images(), to_markdown().

Reliability. lopdf fails to parse 20% of the 3,830-PDF test corpus. Of the PDFs it does parse, 57% produce empty output because lopdf has no text extraction — you get the objects but no text. PDF Oxide passes 100%.

Speed on parseable PDFs. lopdf is faster at raw object parsing: 0.3ms mean vs PDF Oxide’s 0.8ms. But lopdf does no text extraction work — you’d need to build font decoding, CMap resolution, spacing analysis, and reading order yourself.

Quick Comparison

PDF Oxide lopdf
API level High-level Low-level
Text extraction Built-in (production-grade) None
Pass rate (3,830 PDFs) 100% 80.2%
Mean parse time 0.8ms 0.3ms
Image extraction Built-in Manual (raw streams)
Form fields Read + Write Manual (raw dictionaries)
PDF creation Yes (Markdown/HTML) Yes (raw objects)
Markdown/HTML output Yes No
Encryption Read + Write No
Rendering Yes No
PDF/A validation Yes No
License MIT MIT

What lopdf Can’t Do

lopdf provides access to PDF objects, but text extraction requires interpreting those objects according to the PDF specification. Here’s what you’d need to build yourself:

  1. Content stream parsing — parse PostScript-like operators (Tj, TJ, Tm, Tf, etc.)
  2. Font resolution — look up /Font resources, resolve indirect references
  3. CMap/ToUnicode decoding — convert glyph IDs to Unicode characters
  4. Font metric spacing — calculate character widths from font descriptors
  5. Text matrix transforms — apply Tm, Td, T* operators to position text
  6. Reading order — determine the correct order for multi-column layouts
  7. Ligature reconstruction — handle fi, fl, ffi ligatures
  8. CJK encoding — decode Chinese, Japanese, Korean text encodings

This is thousands of lines of code and deep knowledge of ISO 32000. PDF Oxide handles all of it internally.

Side-by-Side Code

Text Extraction

PDF Oxide:

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;
let text = doc.extract_text(0)?;
println!("{}", text);

lopdf:

use lopdf::Document;

let doc = Document::load("report.pdf")?;

// lopdf does not provide text extraction.
// You get access to PDF objects only:
let page_id = doc.page_iter().next().unwrap();
let page = doc.get_dictionary(page_id)?;
let contents = page.get("Contents")?;
let stream = doc.get_object(contents.as_reference()?)?;

// To get actual text, you must:
// 1. Parse content stream operators
// 2. Resolve font references from /Resources
// 3. Decode CMap/ToUnicode mappings
// 4. Apply text matrix transformations
// 5. Handle encoding differences
// ... (hundreds to thousands of lines of code)

PDF Creation

PDF Oxide:

use pdf_oxide::api::Pdf;

let pdf = Pdf::from_markdown("# Report\n\n| Q1 | Q2 |\n|---|---|\n| $1M | $2M |")?;
pdf.save("report.pdf")?;

lopdf:

use lopdf::{Document, Object, Stream, dictionary};

let mut doc = Document::with_version("1.5");

// Create font dictionary
let font_id = doc.add_object(dictionary! {
    "Type" => "Font",
    "Subtype" => "Type1",
    "BaseFont" => "Helvetica",
});

// Create resources
let resources_id = doc.add_object(dictionary! {
    "Font" => dictionary! { "F1" => font_id },
});

// Create content stream (raw PostScript operators)
let content = Stream::new(
    dictionary! {},
    b"BT /F1 12 Tf 72 720 Td (Hello World) Tj ET".to_vec(),
);
let content_id = doc.add_object(content);

// Create page
let page_id = doc.add_object(dictionary! {
    "Type" => "Page",
    "MediaBox" => vec![0.into(), 0.into(), 612.into(), 792.into()],
    "Contents" => content_id,
    "Resources" => resources_id,
});

// Wire up page tree
let pages_id = doc.add_object(dictionary! {
    "Type" => "Pages",
    "Kids" => vec![page_id.into()],
    "Count" => 1,
});
doc.add_object(dictionary! {
    "Type" => "Catalog",
    "Pages" => pages_id,
});

doc.save("report.pdf")?;

Encrypted PDFs

PDF Oxide:

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open_with_password("encrypted.pdf", "secret")?;
let text = doc.extract_text(0)?;
println!("{}", text);

lopdf:

// lopdf does not support encrypted PDFs.
// Loading an encrypted PDF will fail or produce undecrypted streams.

Reliability Comparison

Metric PDF Oxide lopdf
PDFs parsed successfully 3,823 / 3,823 (100%) 3,071 / 3,823 (80.2%)
PDFs with text output 3,823 / 3,823 ~1,320 / 3,823 (estimated)
Encrypted PDF support Yes No
Malformed PDF recovery Yes No

lopdf’s 80.2% pass rate means it fails on roughly 1 in 5 PDFs. The failures occur on encrypted documents, PDFs with non-standard xref tables, and documents using cross-reference streams. PDF Oxide handles all of these with lenient parsing and fallback strategies.

When to Use Each

Choose PDF Oxide if:

  • You need text extraction, image extraction, or any content-level operation
  • You want a single crate for read + write + create
  • You need to handle all PDFs reliably (encrypted, malformed, complex)
  • You need Markdown/HTML output, rendering, or OCR
  • You want compliance validation (PDF/A, PDF/X, PDF/UA)

Choose lopdf if:

  • You need direct access to PDF objects for custom processing
  • You’re building a specialized PDF tool that works at the object level
  • You need to merge documents by manipulating object trees directly
  • Your PDFs are simple and well-formed (not encrypted, standard xref tables)

Combine both:

Use PDF Oxide for high-level operations and lopdf for edge cases requiring raw object access:

[dependencies]
pdf_oxide = "0.3"
lopdf = "0.32"