What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide をはじめる (Elixir)

PDF Oxide は Elixir から PDF を読み書きする最速の手段です — テキスト抽出は平均 0.8ms、3,830 件の PDF で 100% のパス率を達成しています。同じ Rust コアの上に構築された NIF であり、CPU バウンドな処理を dirty CPU スケジューラ (ERL_NIF_DIRTY_JOB_CPU_BOUND) 上で実行するため、BEAM のスケジューラをブロックすることがありません。

Document および Pdf のハンドルは GC によって解放される NIF リソースです。失敗しうる関数は {:ok, value} または {:error, code} を返し、ページのインデックスは 0 始まり です。

インストール

mix.exs の依存関係に pdf_oxide を追加します:

def deps do
  [
    {:pdf_oxide, "~> 0.3"}
  ]
end

続いて依存関係を取得してコンパイルします。NIF は elixir_make 経由でネイティブの cdylib に対してビルドされます:

mix deps.get
mix compile

クイックスタート

Markdown から PDF を生成し、バイト列にシリアライズしてから、それを開いてテキストを抽出してみます。

{:ok, pdf}   = PdfOxide.from_markdown("# Hello pdf_oxide\n\nThis is an **Elixir** binding.\n")
{:ok, bytes} = PdfOxide.to_bytes(pdf)
{:ok, doc}   = PdfOxide.open_from_bytes(bytes)

{:ok, pages} = PdfOxide.page_count(doc)
IO.puts("pages: #{pages}")

%{major: maj, minor: min} = PdfOxide.version(doc)
IO.puts("version: #{maj}.#{min}")

{:ok, text} = PdfOxide.extract_text(doc, 0)
IO.puts(text)

PDF を開く

ファイルパスから開くことも、メモリ上のバイト列から直接開くこともできます (S3、HTTP、データベースからストリーミングする場合に便利です):

# パスから
{:ok, doc} = PdfOxide.open("report.pdf")

# すでにメモリ上にあるバイト列から
{:ok, doc} = PdfOxide.open_from_bytes(pdf_bytes)

# 暗号化されたドキュメント
{:ok, doc} = PdfOxide.open_with_password("confidential.pdf", "secret")

# 情報を確認する
{:ok, count} = PdfOxide.page_count(doc)
encrypted? = PdfOxide.encrypted?(doc)

処理が終わったらドキュメントを明示的に閉じることも (close/1 は冪等です)、GC による回収に任せることもできます:

:ok = PdfOxide.close(doc)

テキスト抽出

0 始まりのインデックスで単一ページからプレーンテキストを抽出することも、ドキュメント全体を一度に取り出すこともできます:

{:ok, doc} = PdfOxide.open("book.pdf")

# 単一ページ
{:ok, text} = PdfOxide.extract_text(doc, 0)

# プレーンテキスト、1 ページ分
{:ok, pt} = PdfOxide.to_plain_text(doc, 0)

# 全ページを連結
{:ok, all} = PdfOxide.to_plain_text_all(doc)
IO.puts(all)

Markdown と HTML への変換

ページ単位で、あるいはドキュメント全体を Markdown または HTML に変換できます:

{:ok, doc} = PdfOxide.open("paper.pdf")

{:ok, md}    = PdfOxide.to_markdown(doc, 0)
{:ok, mdall} = PdfOxide.to_markdown_all(doc)

{:ok, html}    = PdfOxide.to_html(doc, 0)
{:ok, htmlall} = PdfOxide.to_html_all(doc)

単語と行

extract_words/2 はバウンディングボックスと bold フラグを備えた構造化された PdfOxide.Word 構造体を返します。extract_text_lines/2 はそれらを行ごとにまとめます。

{:ok, doc} = PdfOxide.open("paper.pdf")

{:ok, words} = PdfOxide.extract_words(doc, 0)

for w <- Enum.take(words, 10) do
  %PdfOxide.Bbox{x: x, y: y, width: width} = w.bbox
  IO.puts("#{w.text} at (#{x}, #{y}) w=#{width} bold=#{w.bold}")
end

{:ok, lines} = PdfOxide.extract_text_lines(doc, 0)

for line <- lines do
  IO.puts("#{line.word_count} words: #{line.text}")
end

検索

単一ページ内、またはドキュメント全体を横断して検索できます。第 4 引数は case_sensitive です。各結果には text、page、および PdfOxide.Bbox が含まれます。

{:ok, doc} = PdfOxide.open("manual.pdf")

# 1 ページ (ページインデックス 0)、大文字・小文字を区別しない
{:ok, results} = PdfOxide.search(doc, 0, "configuration", false)

for r <- results do
  %PdfOxide.Bbox{x: x, y: y} = r.bbox
  IO.puts("page #{r.page}: '#{r.text}' at (#{x}, #{y})")
end

# 全ページ
{:ok, all} = PdfOxide.search_all(doc, "configuration", false)
IO.puts("#{length(all)} matches")

PDF の生成

ビルダーのファクトリ関数は Pdf ハンドルを返します。これを to_bytes/1 でシリアライズするか、save/2 でそのままディスクに書き出せます:

{:ok, pdf} = PdfOxide.from_markdown("# Hello World\n\nThis is a PDF.")
:ok = PdfOxide.save(pdf, "output.pdf")

{:ok, pdf} = PdfOxide.from_html("<h1>Invoice</h1><p>Amount: $42</p>")
{:ok, bytes} = PdfOxide.to_bytes(pdf)

{:ok, pdf} = PdfOxide.from_text("Plain text content.")
:ok = PdfOxide.save(pdf, "notes.pdf")

ページの画像へのレンダリング

レンダリング機能を使えば、ページを PdfOxide.RenderedImage にラスタライズして PNG として保存できます:

{:ok, doc} = PdfOxide.open("paper.pdf")

{:ok, img} = PdfOxide.render_page(doc, 0)
IO.puts("#{img.width}x#{img.height}, #{byte_size(img.data)} bytes")
:ok = PdfOxide.save(img, "page0.png")

# ズーム倍率、または固定サイズのサムネイル
{:ok, zoomed} = PdfOxide.render_page_zoom(doc, 0, 2.0)
{:ok, thumb}  = PdfOxide.render_page_thumbnail(doc, 0, 128)

エラーハンドリング

失敗しうる関数はタグ付きタプルを返します。パターンマッチで制御フローをすっきり書けます:

case PdfOxide.open("/nonexistent/nope.pdf") do
  {:ok, doc} ->
    {:ok, text} = PdfOxide.extract_text(doc, 0)
    IO.puts(text)

  {:error, code} ->
    IO.puts("could not open PDF: #{inspect(code)}")
end

次のステップ

Rust クイックスタート — Rust から PDF Oxide を使う
Python クイックスタート — Python から PDF Oxide を使う
テキスト抽出 — 抽出オプションとレシピの詳細
PDF の生成 — メタデータや暗号化を含む高度な生成
編集 — 既存 PDF、注釈、フォームフィールドの編集