What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide をはじめよう（Clojure）

PDF Oxide は、テキスト抽出を標準搭載した最速の PDF ツールキットです — 平均 0.8ms、3,830 件の PDF で 100% のパス率を達成しています。Clojure バインディングは、唯一の JNI ネイティブブリッジを担う成熟した fyi.oxide:pdf-oxide Java バインディングを包む、慣用的で薄いラッパーです。ネイティブコードは一切追加せず、interop を通じて Java クラスを呼び出し、Clojure になじむ値を返します（java.util.List → ベクター、java.util.Optional → 値または nil）。

インストール

deps.edn に Java バインディングを追加します。Clojure 名前空間（pdf_oxide.core）はあなたのソースツリーに置かれ、それを包む形になります。

{:deps {fyi.oxide/pdf-oxide {:mvn/version "0.3.69"}}}

ハンドル型（Pdf、PdfDocument、DocumentEditor）は AutoCloseable なので、確実なクリーンアップには with-open を使ってください。

クイックスタート

Markdown から PDF を構築し、それを開き直してテキストを抽出します。各ステップはそのまま使える Clojure の値を返します。

(require '[pdf-oxide.core :as pdf])

(with-open [p (pdf/from-markdown "# Hello\n\nbody\n")
            d (pdf/open (pdf/save p))]
  (println "pages:   " (pdf/page-count d))
  (println "producer:" (or (pdf/producer d) "(none)"))
  (println (pdf/extract-text d 0)))

PDF を開く

pdf/open は、バイト配列とファイルシステムのパス文字列のどちらも受け付けます。暗号化された文書にはパスワードを任意で指定できます。

(require '[pdf-oxide.core :as pdf])

;; パスから
(with-open [d (pdf/open "research-paper.pdf")]
  (println "pages:" (pdf/page-count d)))

;; バイトから（例：S3 や HTTP からダウンロードしたもの）
(with-open [d (pdf/open pdf-bytes)]
  (println (pdf/extract-text d 0)))

;; 暗号化された文書
(with-open [d (pdf/open "confidential.pdf" "secret")]
  (println (pdf/extract-text d 0)))

開いた後で認証することもできます。

(with-open [d (pdf/open "confidential.pdf")]
  (when (pdf/authenticate d "secret")
    (println (pdf/extract-text d 0))))

テキスト抽出

任意のページから、ゼロ始まりのインデックスを指定してプレーンテキストを抽出します。

(require '[pdf-oxide.core :as pdf])

(with-open [d (pdf/open "report.pdf")]
  ;; 単一ページ
  (println (pdf/extract-text d 0))

  ;; 全ページ
  (doseq [i (range (pdf/page-count d))]
    (println "--- Page" (inc i) "---")
    (println (pdf/extract-text d i))))

ページ要素

pdf/page は PdfPage を返します。そこから単語、行、文字、表、画像、注釈を取り出せます — それぞれ Clojure のベクターとして返ります。単語・行・文字のオブジェクトは、interop を通じて .text と .bbox を公開します。

(require '[pdf-oxide.core :as pdf])

(with-open [d (pdf/open "paper.pdf")]
  (let [pg (pdf/page d 0)]
    (println "page width:" (.width pg))

    ;; バウンディングボックス付きの単語
    (doseq [w (take 8 (pdf/words pg))]
      (println "  " (.text w) "@" (.bbox w)))

    ;; その他の要素ベクター
    (println "lines:      " (count (pdf/lines pg)))
    (println "chars:      " (count (pdf/chars pg)))
    (println "tables:     " (count (pdf/tables pg)))
    (println "images:     " (count (pdf/images pg)))
    (println "annotations:" (count (pdf/annotations pg)))

    ;; ページ全体、または切り出した領域（BBox）のプレーンテキスト
    (println (pdf/page-text pg))))

抽出をある領域に限定するには、fyi.oxide.pdf.geometry.BBox を渡します。

(import '[fyi.oxide.pdf.geometry BBox])

(with-open [d (pdf/open "paper.pdf")]
  (let [pg (pdf/page d 0)]
    (println (pdf/page-text pg (BBox. 0.0 0.0 1000.0 1000.0)))))

Markdown と HTML への変換

文書全体、または単一ページを Markdown や HTML に変換します。

(require '[pdf-oxide.core :as pdf])

(with-open [d (pdf/open "paper.pdf")]
  ;; 文書全体
  (println (pdf/to-markdown d))
  (println (pdf/to-html d))

  ;; 単一ページ（ゼロ始まり）
  (println (pdf/to-markdown d 0))
  (println (pdf/to-html d 0)))

より豊かな構造が必要な場合、pdf/extract-structured はページの構造化された要素ツリーを返します。

(with-open [d (pdf/open "paper.pdf")]
  (println (pdf/extract-structured d 0)))

検索

pdf/search は文書全体を走査し、マッチオブジェクトのベクターを返します。各マッチは interop を通じて .text を公開します。

(require '[pdf-oxide.core :as pdf])

(with-open [d (pdf/open "manual.pdf")]
  (doseq [m (pdf/search d "configuration")]
    (println (.text m))))

レンダリング

ページを PNG のバイト配列にレンダリングします。DPI は任意で指定できます。

(require '[pdf-oxide.core :as pdf]
         '[clojure.java.io :as io])

(with-open [d (pdf/open "paper.pdf")]
  ;; デフォルト DPI
  (io/copy (pdf/render d 0) (io/file "page-0.png"))

  ;; DPI を明示
  (io/copy (pdf/render d 0 150) (io/file "page-0@150.png")))

作成

Pdf 型はファクトリ関数を提供します。pdf/save は構築した Pdf をバイト配列にシリアライズします。

(require '[pdf-oxide.core :as pdf]
         '[clojure.java.io :as io])

;; Markdown から
(with-open [p (pdf/from-markdown "# Hello World\n\nThis is a PDF.")]
  (io/copy (pdf/save p) (io/file "output.pdf")))

;; HTML から
(with-open [p (pdf/from-html "<h1>Invoice</h1><p>Amount: $42</p>")]
  (io/copy (pdf/save p) (io/file "invoice.pdf")))

編集と墨消し

pdf/editor は構造的な編集のために DocumentEditor を開きます（バイト配列またはパスから）。メタデータを除去し、墨消しする領域をマークして破壊的に適用したうえで、pdf/editor-save でシリアライズします。

(require '[pdf-oxide.core :as pdf]
         '[clojure.java.io :as io])
(import '[fyi.oxide.pdf.geometry BBox])

(with-open [ed (pdf/editor "form.pdf")]
  (pdf/scrub-metadata ed)
  (pdf/add-redaction ed 0 (BBox. 10.0 10.0 50.0 20.0))
  (pdf/apply-redactions ed)
  (io/copy (pdf/editor-save ed) (io/file "redacted.pdf")))

メタデータとライフサイクル

pdf/producer と pdf/creator は文書のメタデータを値として返し、存在しない場合は nil を返します（java.util.Optional は自動的にアンラップされます）。基本は with-open を使い、pdf/close と pdf/open? は手動でライフサイクルを管理する際の逃げ道として用意されています。

(require '[pdf-oxide.core :as pdf])

(let [d (pdf/open "paper.pdf")]
  (println "open?    " (pdf/open? d))
  (println "producer:" (or (pdf/producer d) "(none)"))
  (println "creator: " (or (pdf/creator d) "(none)"))
  (pdf/close d)
  (println "open?    " (pdf/open? d)))

次のステップ

Java をはじめよう – このラッパーの土台となる Java バインディング
テキスト抽出 – 抽出オプションとレシピの詳細
PDF の作成 – メタデータやスタイル付きの高度な作成
編集 – 既存 PDF の変更、注釈、墨消し