Skip to content

Getting Started with PDF Oxide (Julia)

PDF Oxide is the fastest PDF library for Julia — 0.8ms mean text extraction, 100% pass rate on 3,830 PDFs. The PdfOxide.jl package wraps the Rust core directly over the C ABI, so you get native speed with an idiomatic Julia API. Page indices are 0-based.

Installation

Add the package from the Julia REPL package manager:

using Pkg
Pkg.add("PdfOxide")

The native library (libpdf_oxide) is loaded at runtime. If it is not on the system loader path, point PdfOxide.jl at it with one of the environment variables it checks, in order: PDF_OXIDE_LIB_PATH (full path to the file), PDF_OXIDE_LIB_DIR (directory), then the local target/release build directory.

export PDF_OXIDE_LIB_DIR=/path/to/pdf_oxide/target/release

Quick start

Open a PDF and extract text from the first page. extract_text takes a 0-based page index.

using PdfOxide

doc = open_document("report.pdf")

println("pages:   ", page_count(doc))
v = version(doc)
println("version: ", v.major, ".", v.minor)

# Plain text from the first page (0-based index)
println(extract_text(doc, 0))

You can also build a document in memory and open it from bytes — handy for tests and pipelines that never touch disk:

using PdfOxide

pdf = from_markdown("# Hello pdf_oxide\n\nThis is the **Julia** binding.\n")
doc = open_from_bytes(to_bytes(pdf))

println("pages: ", page_count(doc))
println(extract_text(doc, 0))

Document inspection

A few cheap calls tell you what you are working with before you extract:

using PdfOxide

doc = open_document("report.pdf")

@show page_count(doc)        # number of pages
@show version(doc).major     # PDF spec version
@show is_encrypted(doc)      # true if the file is password-protected

Markdown and HTML conversion

Convert a single page, or the whole document at once. Markdown preserves headings, lists, and emphasis; the _all variants concatenate every page.

using PdfOxide

doc = open_document("paper.pdf")

# One page (0-based)
md = to_markdown(doc, 0)
println(md)

# Whole document
println(to_markdown_all(doc))

# HTML for a single page
html = to_html(doc, 0)
println(html)

# Plain text without any markup
println(to_plain_text(doc, 0))

Word-level extraction

extract_words returns a vector of Word values, each carrying its text, bounding box, font size, and a bold flag. The bounding box is a Bbox with width, height, and position fields.

using PdfOxide

doc = open_document("paper.pdf")
words = extract_words(doc, 0)

for w in first(words, 10)
    println(rpad(w.text, 20),
            " size=", w.font_size,
            " bold=", w.bold,
            " width=", round(w.bbox.width; digits = 1))
end

For line-oriented layout, extract_text_lines returns TextLine values, each with its text, a word_count, and a bbox:

using PdfOxide

doc = open_document("paper.pdf")
lines = extract_text_lines(doc, 0)

for line in lines
    println(line.word_count, " words: ", line.text)
end

Search a single page, or the entire document. The third argument is the case-sensitivity flag (false for case-insensitive). Each hit reports its text, the page it was found on, and a bbox.

using PdfOxide

doc = open_document("manual.pdf")

# Search one page (case-insensitive)
hits = search(doc, 0, "configuration", false)
for h in hits
    println("page ", h.page, ": ", h.text)
end

# Search every page
all_hits = search_all(doc, "configuration", false)
println(length(all_hits), " total matches")
for h in all_hits
    println("page ", h.page, " at (",
            round(h.bbox.x; digits = 0), ", ",
            round(h.bbox.y; digits = 0), ")")
end

Creating a PDF

The from_* factory functions build a Pdf from Markdown, HTML, or plain text. Call to_bytes to get the raw bytes, or save to write directly to a file.

using PdfOxide

# From Markdown
pdf = from_markdown("# Invoice\n\nAmount due: **\$42**\n")
save(pdf, "invoice.pdf")

# From HTML
html_pdf = from_html("<h1>Report</h1><p>Quarterly results.</p>")
save(html_pdf, "report.pdf")

# From plain text — grab the bytes instead of writing a file
text_pdf = from_text("Plain text body.")
bytes = to_bytes(text_pdf)
println("generated ", length(bytes), " bytes")

Error handling

Failed operations raise a PdfOxideError. Wrap calls that touch untrusted input in a try/catch:

using PdfOxide

try
    doc = open_document("missing.pdf")
    println(extract_text(doc, 0))
catch e
    e isa PdfOxideError || rethrow()
    println("PDF error: ", e)
end

Next Steps