Skip to content

Getting Started with PDF Oxide (Clojure)

PDF Oxide is the fastest PDF toolkit with built-in text extraction — 0.8ms mean, 100% pass rate on 3,830 PDFs. The Clojure binding is an idiomatic, thin wrapper over the mature fyi.oxide:pdf-oxide Java binding that owns the single JNI native bridge. It adds zero native code: it calls the Java classes via interop and returns Clojure-friendly values (java.util.List → vector, java.util.Optional → value-or-nil).

Installation

Add the Java binding to your deps.edn. The Clojure namespace (pdf_oxide.core) lives in your source tree and wraps it:

{:deps {fyi.oxide/pdf-oxide {:mvn/version "0.3.69"}}}

The handle types (Pdf, PdfDocument, DocumentEditor) are AutoCloseable, so use with-open for deterministic cleanup.

Quick Start

Build a PDF from Markdown, open it back, and extract its text. Each step returns plain Clojure values.

(require '[pdf-oxide.core :as pdf])

(with-open [p (pdf/from-markdown "# Hello\n\nbody\n")
            d (pdf/open (pdf/save p))]
  (println "pages:   " (pdf/page-count d))
  (println "producer:" (or (pdf/producer d) "(none)"))
  (println (pdf/extract-text d 0)))

Opening a PDF

pdf/open accepts either a byte array or a filesystem path string, with an optional password for encrypted documents.

(require '[pdf-oxide.core :as pdf])

;; From a path
(with-open [d (pdf/open "research-paper.pdf")]
  (println "pages:" (pdf/page-count d)))

;; From bytes (e.g. downloaded from S3 or HTTP)
(with-open [d (pdf/open pdf-bytes)]
  (println (pdf/extract-text d 0)))

;; Encrypted document
(with-open [d (pdf/open "confidential.pdf" "secret")]
  (println (pdf/extract-text d 0)))

You can also authenticate after opening:

(with-open [d (pdf/open "confidential.pdf")]
  (when (pdf/authenticate d "secret")
    (println (pdf/extract-text d 0))))

Text Extraction

Extract plain text from any page by its zero-based index.

(require '[pdf-oxide.core :as pdf])

(with-open [d (pdf/open "report.pdf")]
  ;; A single page
  (println (pdf/extract-text d 0))

  ;; All pages
  (doseq [i (range (pdf/page-count d))]
    (println "--- Page" (inc i) "---")
    (println (pdf/extract-text d i))))

Page Elements

pdf/page returns a PdfPage. From it you can pull words, lines, characters, tables, images, and annotations — each as a Clojure vector. Word/line/char objects expose .text and .bbox via interop.

(require '[pdf-oxide.core :as pdf])

(with-open [d (pdf/open "paper.pdf")]
  (let [pg (pdf/page d 0)]
    (println "page width:" (.width pg))

    ;; Words with their bounding boxes
    (doseq [w (take 8 (pdf/words pg))]
      (println "  " (.text w) "@" (.bbox w)))

    ;; Other element vectors
    (println "lines:      " (count (pdf/lines pg)))
    (println "chars:      " (count (pdf/chars pg)))
    (println "tables:     " (count (pdf/tables pg)))
    (println "images:     " (count (pdf/images pg)))
    (println "annotations:" (count (pdf/annotations pg)))

    ;; Plain text for the whole page, or a clipped region (BBox)
    (println (pdf/page-text pg))))

To clip extraction to a region, pass a fyi.oxide.pdf.geometry.BBox:

(import '[fyi.oxide.pdf.geometry BBox])

(with-open [d (pdf/open "paper.pdf")]
  (let [pg (pdf/page d 0)]
    (println (pdf/page-text pg (BBox. 0.0 0.0 1000.0 1000.0)))))

Markdown & HTML Conversion

Convert the whole document or a single page to Markdown or HTML.

(require '[pdf-oxide.core :as pdf])

(with-open [d (pdf/open "paper.pdf")]
  ;; Whole document
  (println (pdf/to-markdown d))
  (println (pdf/to-html d))

  ;; A single page (zero-based)
  (println (pdf/to-markdown d 0))
  (println (pdf/to-html d 0)))

For richer structure, pdf/extract-structured returns the structured element tree for a page:

(with-open [d (pdf/open "paper.pdf")]
  (println (pdf/extract-structured d 0)))

pdf/search scans the whole document and returns a vector of match objects. Each match exposes .text via interop.

(require '[pdf-oxide.core :as pdf])

(with-open [d (pdf/open "manual.pdf")]
  (doseq [m (pdf/search d "configuration")]
    (println (.text m))))

Rendering

Render a page to a PNG byte array, optionally at a chosen DPI.

(require '[pdf-oxide.core :as pdf]
         '[clojure.java.io :as io])

(with-open [d (pdf/open "paper.pdf")]
  ;; Default DPI
  (io/copy (pdf/render d 0) (io/file "page-0.png"))

  ;; Explicit DPI
  (io/copy (pdf/render d 0 150) (io/file "page-0@150.png")))

Creation

The Pdf type provides factory functions. pdf/save serializes a built Pdf to a byte array.

(require '[pdf-oxide.core :as pdf]
         '[clojure.java.io :as io])

;; From Markdown
(with-open [p (pdf/from-markdown "# Hello World\n\nThis is a PDF.")]
  (io/copy (pdf/save p) (io/file "output.pdf")))

;; From HTML
(with-open [p (pdf/from-html "<h1>Invoice</h1><p>Amount: $42</p>")]
  (io/copy (pdf/save p) (io/file "invoice.pdf")))

Editing & Redaction

pdf/editor opens a DocumentEditor (from a byte array or path) for structural edits. Scrub metadata, mark regions for redaction, and apply them destructively, then serialize with pdf/editor-save.

(require '[pdf-oxide.core :as pdf]
         '[clojure.java.io :as io])
(import '[fyi.oxide.pdf.geometry BBox])

(with-open [ed (pdf/editor "form.pdf")]
  (pdf/scrub-metadata ed)
  (pdf/add-redaction ed 0 (BBox. 10.0 10.0 50.0 20.0))
  (pdf/apply-redactions ed)
  (io/copy (pdf/editor-save ed) (io/file "redacted.pdf")))

Metadata & Lifecycle

pdf/producer and pdf/creator return the document metadata as a value, or nil when absent (java.util.Optional is unwrapped for you). Prefer with-open; pdf/close and pdf/open? are escape hatches for manual lifecycle management.

(require '[pdf-oxide.core :as pdf])

(let [d (pdf/open "paper.pdf")]
  (println "open?    " (pdf/open? d))
  (println "producer:" (or (pdf/producer d) "(none)"))
  (println "creator: " (or (pdf/creator d) "(none)"))
  (pdf/close d)
  (println "open?    " (pdf/open? d)))

Next Steps