Getting Started with PDF Oxide (Clojure)
PDF Oxide is the fastest PDF toolkit with built-in text extraction — 0.8ms mean, 100% pass rate on 3,830 PDFs. The Clojure binding is an idiomatic, thin wrapper over the mature fyi.oxide:pdf-oxide Java binding that owns the single JNI native bridge. It adds zero native code: it calls the Java classes via interop and returns Clojure-friendly values (java.util.List → vector, java.util.Optional → value-or-nil).
Installation
Add the Java binding to your deps.edn. The Clojure namespace (pdf_oxide.core) lives in your source tree and wraps it:
{:deps {fyi.oxide/pdf-oxide {:mvn/version "0.3.69"}}}
The handle types (Pdf, PdfDocument, DocumentEditor) are AutoCloseable, so use with-open for deterministic cleanup.
Quick Start
Build a PDF from Markdown, open it back, and extract its text. Each step returns plain Clojure values.
(require '[pdf-oxide.core :as pdf])
(with-open [p (pdf/from-markdown "# Hello\n\nbody\n")
d (pdf/open (pdf/save p))]
(println "pages: " (pdf/page-count d))
(println "producer:" (or (pdf/producer d) "(none)"))
(println (pdf/extract-text d 0)))
Opening a PDF
pdf/open accepts either a byte array or a filesystem path string, with an optional password for encrypted documents.
(require '[pdf-oxide.core :as pdf])
;; From a path
(with-open [d (pdf/open "research-paper.pdf")]
(println "pages:" (pdf/page-count d)))
;; From bytes (e.g. downloaded from S3 or HTTP)
(with-open [d (pdf/open pdf-bytes)]
(println (pdf/extract-text d 0)))
;; Encrypted document
(with-open [d (pdf/open "confidential.pdf" "secret")]
(println (pdf/extract-text d 0)))
You can also authenticate after opening:
(with-open [d (pdf/open "confidential.pdf")]
(when (pdf/authenticate d "secret")
(println (pdf/extract-text d 0))))
Text Extraction
Extract plain text from any page by its zero-based index.
(require '[pdf-oxide.core :as pdf])
(with-open [d (pdf/open "report.pdf")]
;; A single page
(println (pdf/extract-text d 0))
;; All pages
(doseq [i (range (pdf/page-count d))]
(println "--- Page" (inc i) "---")
(println (pdf/extract-text d i))))
Page Elements
pdf/page returns a PdfPage. From it you can pull words, lines, characters, tables, images, and annotations — each as a Clojure vector. Word/line/char objects expose .text and .bbox via interop.
(require '[pdf-oxide.core :as pdf])
(with-open [d (pdf/open "paper.pdf")]
(let [pg (pdf/page d 0)]
(println "page width:" (.width pg))
;; Words with their bounding boxes
(doseq [w (take 8 (pdf/words pg))]
(println " " (.text w) "@" (.bbox w)))
;; Other element vectors
(println "lines: " (count (pdf/lines pg)))
(println "chars: " (count (pdf/chars pg)))
(println "tables: " (count (pdf/tables pg)))
(println "images: " (count (pdf/images pg)))
(println "annotations:" (count (pdf/annotations pg)))
;; Plain text for the whole page, or a clipped region (BBox)
(println (pdf/page-text pg))))
To clip extraction to a region, pass a fyi.oxide.pdf.geometry.BBox:
(import '[fyi.oxide.pdf.geometry BBox])
(with-open [d (pdf/open "paper.pdf")]
(let [pg (pdf/page d 0)]
(println (pdf/page-text pg (BBox. 0.0 0.0 1000.0 1000.0)))))
Markdown & HTML Conversion
Convert the whole document or a single page to Markdown or HTML.
(require '[pdf-oxide.core :as pdf])
(with-open [d (pdf/open "paper.pdf")]
;; Whole document
(println (pdf/to-markdown d))
(println (pdf/to-html d))
;; A single page (zero-based)
(println (pdf/to-markdown d 0))
(println (pdf/to-html d 0)))
For richer structure, pdf/extract-structured returns the structured element tree for a page:
(with-open [d (pdf/open "paper.pdf")]
(println (pdf/extract-structured d 0)))
Search
pdf/search scans the whole document and returns a vector of match objects. Each match exposes .text via interop.
(require '[pdf-oxide.core :as pdf])
(with-open [d (pdf/open "manual.pdf")]
(doseq [m (pdf/search d "configuration")]
(println (.text m))))
Rendering
Render a page to a PNG byte array, optionally at a chosen DPI.
(require '[pdf-oxide.core :as pdf]
'[clojure.java.io :as io])
(with-open [d (pdf/open "paper.pdf")]
;; Default DPI
(io/copy (pdf/render d 0) (io/file "page-0.png"))
;; Explicit DPI
(io/copy (pdf/render d 0 150) (io/file "page-0@150.png")))
Creation
The Pdf type provides factory functions. pdf/save serializes a built Pdf to a byte array.
(require '[pdf-oxide.core :as pdf]
'[clojure.java.io :as io])
;; From Markdown
(with-open [p (pdf/from-markdown "# Hello World\n\nThis is a PDF.")]
(io/copy (pdf/save p) (io/file "output.pdf")))
;; From HTML
(with-open [p (pdf/from-html "<h1>Invoice</h1><p>Amount: $42</p>")]
(io/copy (pdf/save p) (io/file "invoice.pdf")))
Editing & Redaction
pdf/editor opens a DocumentEditor (from a byte array or path) for structural edits. Scrub metadata, mark regions for redaction, and apply them destructively, then serialize with pdf/editor-save.
(require '[pdf-oxide.core :as pdf]
'[clojure.java.io :as io])
(import '[fyi.oxide.pdf.geometry BBox])
(with-open [ed (pdf/editor "form.pdf")]
(pdf/scrub-metadata ed)
(pdf/add-redaction ed 0 (BBox. 10.0 10.0 50.0 20.0))
(pdf/apply-redactions ed)
(io/copy (pdf/editor-save ed) (io/file "redacted.pdf")))
Metadata & Lifecycle
pdf/producer and pdf/creator return the document metadata as a value, or nil when absent (java.util.Optional is unwrapped for you). Prefer with-open; pdf/close and pdf/open? are escape hatches for manual lifecycle management.
(require '[pdf-oxide.core :as pdf])
(let [d (pdf/open "paper.pdf")]
(println "open? " (pdf/open? d))
(println "producer:" (or (pdf/producer d) "(none)"))
(println "creator: " (or (pdf/creator d) "(none)"))
(pdf/close d)
(println "open? " (pdf/open? d)))
Next Steps
- Java Getting Started – the Java binding this wrapper builds on
- Text Extraction – detailed extraction options and recipes
- PDF Creation – advanced creation with metadata and styling
- Editing – modifying existing PDFs, annotations, and redaction