Getting Started with PDF Oxide (Zig)
PDF Oxide is the fastest PDF library with built-in text extraction — 0.8ms mean, 100% pass rate on 3,830 PDFs. The Zig binding is idiomatic Zig over the pdf_oxide C ABI via @cImport — no shim, first-class C interop. Handles are structs with deinit, and returned C strings/buffers are copied into a caller-provided allocator.
Pinned to Zig 0.15.1 — the pre-1.0 build and C-import API drifts between releases, so CI uses the same version.
Installation
The binding links the default-feature cdylib (not the Python wheel). Build the native library, then point Zig at the header and the cdylib:
# 1. build the native library (shipped binding feature set)
cargo build --release --lib \
--features ocr,rendering,signatures,barcodes,tsa-client,system-fonts
# 2. test + run the example
cd zig
LD_LIBRARY_PATH="$PWD/../target/release" \
zig build test \
-DPDF_OXIDE_INCLUDE_DIR="$PWD/../include" \
-DPDF_OXIDE_LIB_DIR="$PWD/../target/release"
LD_LIBRARY_PATH="$PWD/../target/release" \
zig build example \
-DPDF_OXIDE_INCLUDE_DIR="$PWD/../include" \
-DPDF_OXIDE_LIB_DIR="$PWD/../target/release"
In your own code, import the module and you are ready to go:
const pdf_oxide = @import("pdf_oxide");
Opening a PDF
Open a file with Document.open (or Document.openFromBytes for in-memory data) and inspect its metadata. Every handle owns C resources, so pair it with defer doc.deinit().
const std = @import("std");
const pdf_oxide = @import("pdf_oxide");
pub fn main() !void {
const a = std.heap.page_allocator;
var doc = try pdf_oxide.Document.open("research-paper.pdf");
defer doc.deinit();
std.debug.print("pages: {d}\n", .{try doc.pageCount()});
const v = doc.version();
std.debug.print("version: {d}.{d}\n", .{ v.major, v.minor });
std.debug.print("encrypted: {}\n", .{doc.isEncrypted()});
}
Text Extraction
extractText returns the text of a single page (0-based). The result is owned by the allocator you pass in, so free it when done.
const a = std.heap.page_allocator;
var doc = try pdf_oxide.Document.open("report.pdf");
defer doc.deinit();
const text = try doc.extractText(a, 0);
defer a.free(text);
std.debug.print("{s}\n", .{text});
Whole-document variants extract every page at once:
const all_text = try doc.toPlainTextAll(a);
defer a.free(all_text);
std.debug.print("{s}\n", .{all_text});
Markdown & HTML Conversion
Convert a single page or the whole document to Markdown or HTML. Each returns an allocator-owned slice.
const md = try doc.toMarkdown(a, 0);
defer a.free(md);
std.debug.print("{s}\n", .{md});
const md_all = try doc.toMarkdownAll(a);
defer a.free(md_all);
const html = try doc.toHtml(a, 0);
defer a.free(html);
Word-Level Extraction
extractWords returns a slice of Word structs with text, bounding box, font, and bold flag. Free the whole slice with the matching freeWords helper — it releases the per-word strings and the backing slice.
const words = try doc.extractWords(a, 0);
defer pdf_oxide.Document.freeWords(a, words);
for (words) |w| {
std.debug.print("'{s}' at ({d:.1}, {d:.1}) font={s} size={d:.1} bold={}\n", .{
w.text, w.bbox.x, w.bbox.y, w.fontName, w.fontSize, w.bold,
});
}
Word fields:
| Field | Type | Description |
|---|---|---|
text |
[]u8 |
Word text (allocator-owned) |
bbox |
Bbox |
{ x, y, width, height } in points |
fontName |
[]u8 |
PostScript font name |
fontSize |
f32 |
Font size in points |
bold |
bool |
Whether the run is bold |
The same pattern gives you characters and lines:
const chars = try doc.extractChars(a, 0);
defer pdf_oxide.Document.freeChars(a, chars);
const lines = try doc.extractTextLines(a, 0);
defer pdf_oxide.Document.freeTextLines(a, lines);
Search
search looks within one page; searchAll scans every page. Both take a NUL-terminated term and a case_sensitive flag, and return a slice of SearchResult.
const hits = try doc.searchAll(a, "configuration", false);
defer pdf_oxide.Document.freeSearchResults(a, hits);
for (hits) |hit| {
std.debug.print("page {d}: '{s}' at ({d:.0}, {d:.0})\n", .{
hit.page, hit.text, hit.bbox.x, hit.bbox.y,
});
}
To restrict the search to a single page, use search with the page index:
const page_hits = try doc.search(a, 0, "Alpha", false);
defer pdf_oxide.Document.freeSearchResults(a, page_hits);
Creating a PDF
The Pdf type builds documents from Markdown, HTML, or plain text. toBytes serializes to memory; save writes to disk.
const a = std.heap.page_allocator;
var pdf = try pdf_oxide.Pdf.fromMarkdown("# Hello\n\nThis is a **Zig** PDF.\n");
defer pdf.deinit();
// Serialize to memory...
const bytes = try pdf.toBytes(a);
defer a.free(bytes);
// ...or write straight to disk.
try pdf.save("output.pdf");
You can round-trip a freshly built PDF straight back through the extractor:
var pdf = try pdf_oxide.Pdf.fromHtml("<h1>Invoice</h1><p>Amount: $42</p>");
defer pdf.deinit();
const bytes = try pdf.toBytes(a);
defer a.free(bytes);
var doc = try pdf_oxide.Document.openFromBytes(bytes);
defer doc.deinit();
const text = try doc.extractText(a, 0);
defer a.free(text);
std.debug.print("{s}\n", .{text});
Error Handling
Fallible calls return Error!T, where Error is error{ PdfOxide, OutOfMemory }. Because Zig error values cannot carry a payload, the underlying C-ABI code is exposed via lastErrorCode() — read it right after catching error.PdfOxide.
const text = doc.extractText(a, 99) catch |err| switch (err) {
error.PdfOxide => {
std.debug.print("pdf_oxide error code: {d}\n", .{pdf_oxide.lastErrorCode()});
return;
},
error.OutOfMemory => return err,
};
defer a.free(text);
Next Steps
- Rust Getting Started — the native core PDF Oxide is built on
- Python Getting Started — using PDF Oxide from Python
- Text Extraction — detailed extraction options and recipes
- PDF Creation — advanced creation with metadata and encryption