What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Clojure API リファレンス

PDF Oxide は、fyi.oxide:pdf-oxide Java バインディング（唯一の JNI ネイティブブリッジである pdf_oxide_jni クレートを所有する）を薄くラップする形で、慣用的な Clojure バインディングを提供します。このラッパーはネイティブコードを一切追加しません — interop を通じて Java クラスを直接呼び出し、 Clojure になじむ値を返します（java.util.List はベクターに、java.util.Optional は値または nil になります）。ハンドル型（Pdf、PdfDocument、DocumentEditor、AutoExtractor）は AutoCloseable なので、確実なクリーンアップには with-open を使ってください。

;; deps.edn
{:deps {fyi.oxide/pdf-oxide-clojure {:mvn/version "0.3.69"}}}

;; Leiningen
[fyi.oxide/pdf-oxide-clojure "0.3.69"]

JNI ネイティブライブラリ（libpdf_oxide_jni）は同梱されていません — java.library.path 上で System.loadLibrary("pdf_oxide_jni") によって読み込めるようにするか、Java の NativeLoader に -Dfyi.oxide.pdf.lib.path=<path> でその場所を指定してください。

すべての関数は pdf-oxide.core 名前空間にあります。

(require '[pdf-oxide.core :as pdf])

他の言語については、Java API リファレンス、 Python API リファレンス、 Rust API リファレンス、型と Enum を参照してください。

Pdf — 作成

ソースコンテンツから新しいインメモリの Pdf を構築する関数群と、バイト配列へのシリアライズです。返される Pdf は AutoCloseable です。

作成

(from-markdown ^Pdf [^String markdown])

Markdown 文字列から Pdf を作成します。

(from-html ^Pdf [^String html])

HTML 文字列から Pdf を作成します。

保存

(save ^bytes [^Pdf pdf])

構築済みの Pdf をバイト配列（生の PDF バイト列）にシリアライズします。

(with-open [p (pdf/from-markdown "# Hello\n\nbody\n")]
  (pdf/save p))                 ; => byte[]

PdfDocument — オープン、抽出、レンダリング

既存の PDF に対する主要な読み取りハンドルです。バイト配列またはファイルシステムのパスから開き、テキストを抽出し、Markdown/HTML に変換し、ページをレンダリングし、検索し、メタデータとフォームフィールドを読み取ります。AutoCloseable です。

オープン

(open ^PdfDocument [source])
(open ^PdfDocument [source ^String password])

バイト配列またはファイルシステムのパス文字列から文書を開きます。2 引数の形式では、暗号化された PDF 用にパスワードを指定します。

(authenticate [^PdfDocument doc ^String password])

オープン後に暗号化された文書を認証します。真偽値を返します。

文書の問い合わせ

(page-count [^PdfDocument doc])

文書のページ数を返します。

(producer [^PdfDocument doc])

/Producer メタデータ文字列を返します。存在しない場合は nil を返します。

(creator [^PdfDocument doc])

/Creator メタデータ文字列を返します。存在しない場合は nil を返します。

テキスト抽出

(extract-text [^PdfDocument doc page])

単一のゼロ始まりのページからプレーンテキストを抽出します。

(extract-structured [^PdfDocument doc page])

単一ページの構造化テキスト（位置情報付きのスパン/ブロック）を抽出します。

変換

(to-markdown [^PdfDocument doc])
(to-markdown [^PdfDocument doc page])

文書全体、または単一ページを Markdown に変換します。

(to-html [^PdfDocument doc])
(to-html [^PdfDocument doc page])

文書全体、または単一ページを HTML に変換します。

レンダリング

(render ^bytes [^PdfDocument doc page])
(render ^bytes [^PdfDocument doc page dpi])

ページを PNG 画像のバイト列にレンダリングします。DPI は任意で指定できます。

検索

(search [^PdfDocument doc ^String query])

文書内のテキストを検索します。SearchMatch 結果のベクターを返します。

フォーム

(form-fields [^PdfDocument doc])

文書の AcroForm フォームフィールドのベクターを返します。

ページアクセス

(page ^PdfPage [^PdfDocument doc idx])

ゼロ始まりのページに対する PdfPage ハンドルを取得します。

(pages [^PdfDocument doc])

文書内のすべての PdfPage ハンドルのベクターを返します。

PdfPage — ページ要素の抽出

(pdf/page doc idx) または (pdf/pages doc) から得られるページハンドルです。各抽出関数は、 Java の List の結果を Clojure のベクターに変換します。

要素

(words [^PdfPage page])

ページ上の単語要素のベクターを返します。

(lines [^PdfPage page])

ページ上の行要素のベクターを返します。

(chars [^PdfPage page])

ページ上の文字（グリフ）単位の要素のベクターを返します。（この pdf/chars は clojure.core/chars を意図的にシャドーイングしています。）

(tables [^PdfPage page])

ページ上で検出された表のベクターを返します。

(images [^PdfPage page])

ページ上の画像要素のベクターを返します。

(annotations [^PdfPage page])

ページ上の注釈のベクターを返します。

ページテキスト

(page-text [^PdfPage page])
(page-text [^PdfPage page region])

ページのプレーンテキストを返します。任意で BBox 領域に限定できます。

(with-open [d (pdf/open (pdf/save p))]
  (let [pg (pdf/page d 0)]
    (map #(.text %) (pdf/words pg))                          ; word strings
    (pdf/page-text pg (BBox. 0.0 0.0 1000.0 1000.0))))       ; region text

DocumentEditor — 編集と墨消し

PdfDocument とは独立に開かれる、変更可能な編集ハンドルです。メタデータの除去と破壊的な墨消しをサポートし、結果をバイト列にシリアライズします。AutoCloseable です。

(editor ^DocumentEditor [source])

バイト配列またはファイルシステムのパス文字列から DocumentEditor を開きます。

(scrub-metadata [^DocumentEditor ed])

文書のメタデータ（info ディクショナリ / XMP）をその場で除去します。

(add-redaction [^DocumentEditor ed page region])

ゼロ始まりのページ上の矩形 BBox 領域を墨消し対象としてマークします。

(apply-redactions [^DocumentEditor ed])

保留中のすべての墨消しを破壊的に適用し、対象のコンテンツを削除します。

(editor-save ^bytes [^DocumentEditor ed])

編集後の文書をバイト配列にシリアライズします。

(with-open [ed (pdf/editor pdf-bytes)]
  (pdf/scrub-metadata ed)
  (pdf/add-redaction ed 0 (BBox. 10.0 10.0 50.0 20.0))
  (pdf/apply-redactions ed)
  (pdf/editor-save ed))

AutoExtractor — 自動抽出

PdfDocument に対して抽出戦略を自動で選択する、便利な抽出器です。

(auto-extractor ^AutoExtractor [^PdfDocument doc])

指定した文書に対する AutoExtractor を作成します。

(auto-text [^AutoExtractor ax])

自動選択された戦略を使って、文書全体からテキストを抽出します。

(with-open [d (pdf/open pdf-bytes)]
  (pdf/auto-text (pdf/auto-extractor d)))

ライフサイクル

ハンドル型は AutoCloseable です。確実なクリーンアップには with-open を使うことを推奨します。以下の関数は、with-open を使わない場合のための逃げ道です。

(close [resource])

任意のハンドル（Pdf、PdfDocument、PdfPage、DocumentEditor、AutoExtractor）を閉じます。

(open? [resource])

ハンドルがまだ開いているかどうかを返します。

(let [d (pdf/open pdf-bytes)]
  (pdf/open? d)        ; => true
  (pdf/close d)
  (pdf/open? d))       ; => false

完全な例

(require '[pdf-oxide.core :as pdf])
(import '[fyi.oxide.pdf.geometry BBox])

;; --- Creation + Extraction ---
(with-open [p (pdf/from-markdown "# Report\n\nGenerated by PDF Oxide.\n")
            d (pdf/open (pdf/save p))]
  (println "Pages:" (pdf/page-count d))
  (println (pdf/extract-text d 0))
  (println (pdf/to-markdown d))
  (println (pdf/to-html d 0))

  ;; Page elements (List -> vector)
  (let [pg (pdf/page d 0)]
    (println "Words:" (count (pdf/words pg)))
    (doseq [w (pdf/words pg)] (print (.text w) "")))

  ;; Search
  (doseq [m (pdf/search d "Report")]
    (println "Match:" (.text m)))

  ;; Metadata (Optional -> nil)
  (println "Producer:" (or (pdf/producer d) "(none)"))

  ;; Render
  (spit "page0.png" (pdf/render d 0 150)))

;; --- Editing + Redaction ---
(with-open [ed (pdf/editor pdf-bytes)]
  (pdf/scrub-metadata ed)
  (pdf/add-redaction ed 0 (BBox. 10.0 10.0 50.0 20.0))
  (pdf/apply-redactions ed)
  (spit "redacted.pdf" (pdf/editor-save ed)))

他の言語のバインディング

PDF Oxide はあらゆる主要なエコシステム向けにネイティブバインディングを提供しています： Rust, Python, Node.js, WASM, C#, Golang, Java, PHP, Ruby, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Objective-C, Elixir。

次のステップ

型と列挙型 — すべての共有型と列挙型
Page API リファレンス — バインディング間で一貫したページ単位の反復処理
Clojure 入門 — チュートリアル