Getting Started with PDF Oxide (Julia)
PDF Oxide is the fastest PDF library for Julia — 0.8ms mean text extraction, 100% pass rate on 3,830 PDFs. The PdfOxide.jl package wraps the Rust core directly over the C ABI, so you get native speed with an idiomatic Julia API. Page indices are 0-based.
Installation
Add the package from the Julia REPL package manager:
using Pkg
Pkg.add("PdfOxide")
The native library (libpdf_oxide) is loaded at runtime. If it is not on the system loader path, point PdfOxide.jl at it with one of the environment variables it checks, in order: PDF_OXIDE_LIB_PATH (full path to the file), PDF_OXIDE_LIB_DIR (directory), then the local target/release build directory.
export PDF_OXIDE_LIB_DIR=/path/to/pdf_oxide/target/release
Quick start
Open a PDF and extract text from the first page. extract_text takes a 0-based page index.
using PdfOxide
doc = open_document("report.pdf")
println("pages: ", page_count(doc))
v = version(doc)
println("version: ", v.major, ".", v.minor)
# Plain text from the first page (0-based index)
println(extract_text(doc, 0))
You can also build a document in memory and open it from bytes — handy for tests and pipelines that never touch disk:
using PdfOxide
pdf = from_markdown("# Hello pdf_oxide\n\nThis is the **Julia** binding.\n")
doc = open_from_bytes(to_bytes(pdf))
println("pages: ", page_count(doc))
println(extract_text(doc, 0))
Document inspection
A few cheap calls tell you what you are working with before you extract:
using PdfOxide
doc = open_document("report.pdf")
@show page_count(doc) # number of pages
@show version(doc).major # PDF spec version
@show is_encrypted(doc) # true if the file is password-protected
Markdown and HTML conversion
Convert a single page, or the whole document at once. Markdown preserves headings, lists, and emphasis; the _all variants concatenate every page.
using PdfOxide
doc = open_document("paper.pdf")
# One page (0-based)
md = to_markdown(doc, 0)
println(md)
# Whole document
println(to_markdown_all(doc))
# HTML for a single page
html = to_html(doc, 0)
println(html)
# Plain text without any markup
println(to_plain_text(doc, 0))
Word-level extraction
extract_words returns a vector of Word values, each carrying its text, bounding box, font size, and a bold flag. The bounding box is a Bbox with width, height, and position fields.
using PdfOxide
doc = open_document("paper.pdf")
words = extract_words(doc, 0)
for w in first(words, 10)
println(rpad(w.text, 20),
" size=", w.font_size,
" bold=", w.bold,
" width=", round(w.bbox.width; digits = 1))
end
For line-oriented layout, extract_text_lines returns TextLine values, each with its text, a word_count, and a bbox:
using PdfOxide
doc = open_document("paper.pdf")
lines = extract_text_lines(doc, 0)
for line in lines
println(line.word_count, " words: ", line.text)
end
Search
Search a single page, or the entire document. The third argument is the case-sensitivity flag (false for case-insensitive). Each hit reports its text, the page it was found on, and a bbox.
using PdfOxide
doc = open_document("manual.pdf")
# Search one page (case-insensitive)
hits = search(doc, 0, "configuration", false)
for h in hits
println("page ", h.page, ": ", h.text)
end
# Search every page
all_hits = search_all(doc, "configuration", false)
println(length(all_hits), " total matches")
for h in all_hits
println("page ", h.page, " at (",
round(h.bbox.x; digits = 0), ", ",
round(h.bbox.y; digits = 0), ")")
end
Creating a PDF
The from_* factory functions build a Pdf from Markdown, HTML, or plain text. Call to_bytes to get the raw bytes, or save to write directly to a file.
using PdfOxide
# From Markdown
pdf = from_markdown("# Invoice\n\nAmount due: **\$42**\n")
save(pdf, "invoice.pdf")
# From HTML
html_pdf = from_html("<h1>Report</h1><p>Quarterly results.</p>")
save(html_pdf, "report.pdf")
# From plain text — grab the bytes instead of writing a file
text_pdf = from_text("Plain text body.")
bytes = to_bytes(text_pdf)
println("generated ", length(bytes), " bytes")
Error handling
Failed operations raise a PdfOxideError. Wrap calls that touch untrusted input in a try/catch:
using PdfOxide
try
doc = open_document("missing.pdf")
println(extract_text(doc, 0))
catch e
e isa PdfOxideError || rethrow()
println("PDF error: ", e)
end
Next Steps
- Rust Getting Started — the native core PDF Oxide is built on
- Python Getting Started — using PDF Oxide from Python
- Text Extraction — detailed extraction options and recipes
- PDF Creation — advanced creation with metadata and styling