What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Clojure API Reference

PDF Oxide ships idiomatic Clojure bindings as a thin wrapper over the fyi.oxide:pdf-oxide Java binding, which owns the single JNI native bridge (the pdf_oxide_jni crate). The wrapper adds zero native code: it calls the Java classes directly via interop and returns Clojure-friendly values (java.util.List becomes a vector, java.util.Optional becomes a value or nil). The handle types (Pdf, PdfDocument, DocumentEditor, AutoExtractor) are AutoCloseable, so use with-open for deterministic cleanup.

;; deps.edn
{:deps {fyi.oxide/pdf-oxide-clojure {:mvn/version "0.3.69"}}}

;; Leiningen
[fyi.oxide/pdf-oxide-clojure "0.3.69"]

The JNI native library (libpdf_oxide_jni) is not bundled — make it loadable via System.loadLibrary("pdf_oxide_jni") on your java.library.path, or point the Java NativeLoader at it with -Dfyi.oxide.pdf.lib.path=<path>.

Every function lives in the pdf-oxide.core namespace:

(require '[pdf-oxide.core :as pdf])

For other languages, see the Java API Reference, the Python API Reference, the Rust API Reference, and Types & Enums.

Pdf — Creation

Functions that build a new in-memory Pdf from source content, plus serialization to a byte array. The returned Pdf is AutoCloseable.

Creation

(from-markdown ^Pdf [^String markdown])

Create a Pdf from a Markdown string.

(from-html ^Pdf [^String html])

Create a Pdf from an HTML string.

Saving

(save ^bytes [^Pdf pdf])

Serialize a built Pdf to a byte array (the raw PDF bytes).

(with-open [p (pdf/from-markdown "# Hello\n\nbody\n")]
  (pdf/save p))                 ; => byte[]

PdfDocument — Opening, Extraction & Rendering

The primary read handle for an existing PDF. Open from a byte array or a filesystem path, then extract text, convert to Markdown/HTML, render pages, search, and read metadata and form fields. AutoCloseable.

Opening

(open ^PdfDocument [source])
(open ^PdfDocument [source ^String password])

Open a document from a byte array or a filesystem path string. The two-arity form supplies a password for encrypted PDFs.

(authenticate [^PdfDocument doc ^String password])

Authenticate an encrypted document after opening; returns a boolean.

Document Queries

(page-count [^PdfDocument doc])

Return the number of pages in the document.

(producer [^PdfDocument doc])

Return the /Producer metadata string, or nil if absent.

(creator [^PdfDocument doc])

Return the /Creator metadata string, or nil if absent.

Text Extraction

(extract-text [^PdfDocument doc page])

Extract plain text from a single zero-indexed page.

(extract-structured [^PdfDocument doc page])

Extract structured text (spans/blocks with positioning) for a single page.

Conversion

(to-markdown [^PdfDocument doc])
(to-markdown [^PdfDocument doc page])

Convert the whole document, or a single page, to Markdown.

(to-html [^PdfDocument doc])
(to-html [^PdfDocument doc page])

Convert the whole document, or a single page, to HTML.

Rendering

(render ^bytes [^PdfDocument doc page])
(render ^bytes [^PdfDocument doc page dpi])

Render a page to PNG image bytes, optionally at a given DPI.

Search

(search [^PdfDocument doc ^String query])

Search the document for text; returns a vector of SearchMatch results.

Forms

(form-fields [^PdfDocument doc])

Return a vector of the document’s AcroForm form fields.

Page Access

(page ^PdfPage [^PdfDocument doc idx])

Get a PdfPage handle for the zero-indexed page.

(pages [^PdfDocument doc])

Return a vector of all PdfPage handles in the document.

PdfPage — Page Element Extraction

A page handle obtained from (pdf/page doc idx) or (pdf/pages doc). Each extraction function converts the Java List result into a Clojure vector.

Elements

(words [^PdfPage page])

Return a vector of word elements on the page.

(lines [^PdfPage page])

Return a vector of line elements on the page.

(chars [^PdfPage page])

Return a vector of per-character glyphs on the page. (This pdf/chars intentionally shadows clojure.core/chars.)

(tables [^PdfPage page])

Return a vector of detected tables on the page.

(images [^PdfPage page])

Return a vector of image elements on the page.

(annotations [^PdfPage page])

Return a vector of annotations on the page.

Page Text

(page-text [^PdfPage page])
(page-text [^PdfPage page region])

Return the page’s plain text, optionally restricted to a BBox region.

(with-open [d (pdf/open (pdf/save p))]
  (let [pg (pdf/page d 0)]
    (map #(.text %) (pdf/words pg))                          ; word strings
    (pdf/page-text pg (BBox. 0.0 0.0 1000.0 1000.0))))       ; region text

DocumentEditor — Editing & Redaction

A mutable editing handle opened independently of PdfDocument. Supports metadata scrubbing and destructive redaction, then serializes the result to bytes. AutoCloseable.

(editor ^DocumentEditor [source])

Open a DocumentEditor from a byte array or a filesystem path string.

(scrub-metadata [^DocumentEditor ed])

Remove document metadata (info dictionary / XMP) in place.

(add-redaction [^DocumentEditor ed page region])

Mark a rectangular BBox region on a zero-indexed page for redaction.

(apply-redactions [^DocumentEditor ed])

Apply all pending redactions destructively, removing the underlying content.

(editor-save ^bytes [^DocumentEditor ed])

Serialize the edited document to a byte array.

(with-open [ed (pdf/editor pdf-bytes)]
  (pdf/scrub-metadata ed)
  (pdf/add-redaction ed 0 (BBox. 10.0 10.0 50.0 20.0))
  (pdf/apply-redactions ed)
  (pdf/editor-save ed))

AutoExtractor — Auto Extraction

A convenience extractor that picks an extraction strategy automatically for a PdfDocument.

(auto-extractor ^AutoExtractor [^PdfDocument doc])

Create an AutoExtractor for the given document.

(auto-text [^AutoExtractor ax])

Extract text from the whole document using the auto-selected strategy.

(with-open [d (pdf/open pdf-bytes)]
  (pdf/auto-text (pdf/auto-extractor d)))

Lifecycle

The handle types are AutoCloseable; prefer with-open for deterministic cleanup. These functions are escape hatches for non-with-open usage.

(close [resource])

Close any handle (Pdf, PdfDocument, PdfPage, DocumentEditor, AutoExtractor).

(open? [resource])

Return whether the handle is still open.

(let [d (pdf/open pdf-bytes)]
  (pdf/open? d)        ; => true
  (pdf/close d)
  (pdf/open? d))       ; => false

Complete Example

(require '[pdf-oxide.core :as pdf])
(import '[fyi.oxide.pdf.geometry BBox])

;; --- Creation + Extraction ---
(with-open [p (pdf/from-markdown "# Report\n\nGenerated by PDF Oxide.\n")
            d (pdf/open (pdf/save p))]
  (println "Pages:" (pdf/page-count d))
  (println (pdf/extract-text d 0))
  (println (pdf/to-markdown d))
  (println (pdf/to-html d 0))

  ;; Page elements (List -> vector)
  (let [pg (pdf/page d 0)]
    (println "Words:" (count (pdf/words pg)))
    (doseq [w (pdf/words pg)] (print (.text w) "")))

  ;; Search
  (doseq [m (pdf/search d "Report")]
    (println "Match:" (.text m)))

  ;; Metadata (Optional -> nil)
  (println "Producer:" (or (pdf/producer d) "(none)"))

  ;; Render
  (spit "page0.png" (pdf/render d 0 150)))

;; --- Editing + Redaction ---
(with-open [ed (pdf/editor pdf-bytes)]
  (pdf/scrub-metadata ed)
  (pdf/add-redaction ed 0 (BBox. 10.0 10.0 50.0 20.0))
  (pdf/apply-redactions ed)
  (spit "redacted.pdf" (pdf/editor-save ed)))

Other Language Bindings

PDF Oxide ships native bindings for every major ecosystem: Rust, Python, Node.js, WASM, C#, Golang, Java, PHP, Ruby, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Objective-C, and Elixir.

Next Steps

Types & Enums — all shared types and enums
Page API Reference — consistent per-page iteration across bindings
Getting Started with Clojure — tutorial