What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Getting Started with PDF Oxide (Clojure)

PDF Oxide is the fastest PDF toolkit with built-in text extraction — 0.8ms mean, 100% pass rate on 3,830 PDFs. The Clojure binding is an idiomatic, thin wrapper over the mature fyi.oxide:pdf-oxide Java binding that owns the single JNI native bridge. It adds zero native code: it calls the Java classes via interop and returns Clojure-friendly values (java.util.List → vector, java.util.Optional → value-or-nil).

Installation

Add the Java binding to your deps.edn. The Clojure namespace (pdf_oxide.core) lives in your source tree and wraps it:

{:deps {fyi.oxide/pdf-oxide {:mvn/version "0.3.69"}}}

The handle types (Pdf, PdfDocument, DocumentEditor) are AutoCloseable, so use with-open for deterministic cleanup.

Quick Start

Build a PDF from Markdown, open it back, and extract its text. Each step returns plain Clojure values.

(require '[pdf-oxide.core :as pdf])

(with-open [p (pdf/from-markdown "# Hello\n\nbody\n")
            d (pdf/open (pdf/save p))]
  (println "pages:   " (pdf/page-count d))
  (println "producer:" (or (pdf/producer d) "(none)"))
  (println (pdf/extract-text d 0)))

Opening a PDF

pdf/open accepts either a byte array or a filesystem path string, with an optional password for encrypted documents.

(require '[pdf-oxide.core :as pdf])

;; From a path
(with-open [d (pdf/open "research-paper.pdf")]
  (println "pages:" (pdf/page-count d)))

;; From bytes (e.g. downloaded from S3 or HTTP)
(with-open [d (pdf/open pdf-bytes)]
  (println (pdf/extract-text d 0)))

;; Encrypted document
(with-open [d (pdf/open "confidential.pdf" "secret")]
  (println (pdf/extract-text d 0)))

You can also authenticate after opening:

(with-open [d (pdf/open "confidential.pdf")]
  (when (pdf/authenticate d "secret")
    (println (pdf/extract-text d 0))))

Text Extraction

Extract plain text from any page by its zero-based index.

(require '[pdf-oxide.core :as pdf])

(with-open [d (pdf/open "report.pdf")]
  ;; A single page
  (println (pdf/extract-text d 0))

  ;; All pages
  (doseq [i (range (pdf/page-count d))]
    (println "--- Page" (inc i) "---")
    (println (pdf/extract-text d i))))

Page Elements

pdf/page returns a PdfPage. From it you can pull words, lines, characters, tables, images, and annotations — each as a Clojure vector. Word/line/char objects expose .text and .bbox via interop.

(require '[pdf-oxide.core :as pdf])

(with-open [d (pdf/open "paper.pdf")]
  (let [pg (pdf/page d 0)]
    (println "page width:" (.width pg))

    ;; Words with their bounding boxes
    (doseq [w (take 8 (pdf/words pg))]
      (println "  " (.text w) "@" (.bbox w)))

    ;; Other element vectors
    (println "lines:      " (count (pdf/lines pg)))
    (println "chars:      " (count (pdf/chars pg)))
    (println "tables:     " (count (pdf/tables pg)))
    (println "images:     " (count (pdf/images pg)))
    (println "annotations:" (count (pdf/annotations pg)))

    ;; Plain text for the whole page, or a clipped region (BBox)
    (println (pdf/page-text pg))))

To clip extraction to a region, pass a fyi.oxide.pdf.geometry.BBox:

(import '[fyi.oxide.pdf.geometry BBox])

(with-open [d (pdf/open "paper.pdf")]
  (let [pg (pdf/page d 0)]
    (println (pdf/page-text pg (BBox. 0.0 0.0 1000.0 1000.0)))))

Markdown & HTML Conversion

Convert the whole document or a single page to Markdown or HTML.

(require '[pdf-oxide.core :as pdf])

(with-open [d (pdf/open "paper.pdf")]
  ;; Whole document
  (println (pdf/to-markdown d))
  (println (pdf/to-html d))

  ;; A single page (zero-based)
  (println (pdf/to-markdown d 0))
  (println (pdf/to-html d 0)))

For richer structure, pdf/extract-structured returns the structured element tree for a page:

(with-open [d (pdf/open "paper.pdf")]
  (println (pdf/extract-structured d 0)))

Search

pdf/search scans the whole document and returns a vector of match objects. Each match exposes .text via interop.

(require '[pdf-oxide.core :as pdf])

(with-open [d (pdf/open "manual.pdf")]
  (doseq [m (pdf/search d "configuration")]
    (println (.text m))))

Rendering

Render a page to a PNG byte array, optionally at a chosen DPI.

(require '[pdf-oxide.core :as pdf]
         '[clojure.java.io :as io])

(with-open [d (pdf/open "paper.pdf")]
  ;; Default DPI
  (io/copy (pdf/render d 0) (io/file "page-0.png"))

  ;; Explicit DPI
  (io/copy (pdf/render d 0 150) (io/file "page-0@150.png")))

Creation

The Pdf type provides factory functions. pdf/save serializes a built Pdf to a byte array.

(require '[pdf-oxide.core :as pdf]
         '[clojure.java.io :as io])

;; From Markdown
(with-open [p (pdf/from-markdown "# Hello World\n\nThis is a PDF.")]
  (io/copy (pdf/save p) (io/file "output.pdf")))

;; From HTML
(with-open [p (pdf/from-html "<h1>Invoice</h1><p>Amount: $42</p>")]
  (io/copy (pdf/save p) (io/file "invoice.pdf")))

Editing & Redaction

pdf/editor opens a DocumentEditor (from a byte array or path) for structural edits. Scrub metadata, mark regions for redaction, and apply them destructively, then serialize with pdf/editor-save.

(require '[pdf-oxide.core :as pdf]
         '[clojure.java.io :as io])
(import '[fyi.oxide.pdf.geometry BBox])

(with-open [ed (pdf/editor "form.pdf")]
  (pdf/scrub-metadata ed)
  (pdf/add-redaction ed 0 (BBox. 10.0 10.0 50.0 20.0))
  (pdf/apply-redactions ed)
  (io/copy (pdf/editor-save ed) (io/file "redacted.pdf")))

Metadata & Lifecycle

pdf/producer and pdf/creator return the document metadata as a value, or nil when absent (java.util.Optional is unwrapped for you). Prefer with-open; pdf/close and pdf/open? are escape hatches for manual lifecycle management.

(require '[pdf-oxide.core :as pdf])

(let [d (pdf/open "paper.pdf")]
  (println "open?    " (pdf/open? d))
  (println "producer:" (or (pdf/producer d) "(none)"))
  (println "creator: " (or (pdf/creator d) "(none)"))
  (pdf/close d)
  (println "open?    " (pdf/open? d)))

Next Steps

Java Getting Started – the Java binding this wrapper builds on
Text Extraction – detailed extraction options and recipes
PDF Creation – advanced creation with metadata and styling
Editing – modifying existing PDFs, annotations, and redaction