What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide をはじめよう（R）

PDF Oxide は、高速な PDF テキスト・Markdown・HTML 抽出のための慣用的な R バインディングを提供します。平均 0.8ms のテキスト抽出、3,830 件の PDF で 100% の成功率を実現し、他のすべてのバインディングと同じ Rust コアに支えられています。R パッケージは R の .Call インターフェイスを通じて pdf_oxide の C ABI をラップしています。ドキュメントのハンドルはガベージコレクタによって解放される R の外部ポインタであり、ページのインデックスは基盤エンジンに合わせて 0 始まりです。

インストール

R パッケージはデフォルト機能の cdylib をリンクします。まずネイティブライブラリをビルドし、続いてヘッダーと cdylib を指定してパッケージをインストールします。

# 1. build the native library (shipped binding feature set)
cargo build --release --lib \
  --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. install the R package
PDF_OXIDE_INCLUDE_DIR="$PWD/include" PDF_OXIDE_LIB_DIR="$PWD/target/release" \
  R CMD INSTALL r/

実行時には、リンカが cdylib を見つけられるようにします。

LD_LIBRARY_PATH="$PWD/target/release" Rscript your_script.R

PDF を開く

pdf_open() でファイルを開き、メタデータを調べます。pdf_version() は major と minor を持つ名前付きリストを返します。

library(pdfoxide)

doc <- pdf_open("research-paper.pdf")

pdf_page_count(doc)               # number of pages
v <- pdf_version(doc)
cat("PDF version:", paste(v$major, v$minor, sep = "."), "\n")
pdf_is_encrypted(doc)             # logical

テキスト抽出

pdf_extract_text() を使って、0 始まりの単一ページから読み取り順のテキストを抽出します。

library(pdfoxide)

doc <- pdf_open("report.pdf")
text <- pdf_extract_text(doc, 0)  # 0-based page index
cat(text)

pdf_page_count() を使ってすべてのページをループ処理します。

doc <- pdf_open("book.pdf")
for (page in seq_len(pdf_page_count(doc)) - 1L) {   # 0-based indices
  cat("--- Page", page + 1L, "---\n")
  cat(pdf_extract_text(doc, page), "\n")
}

Markdown と HTML

単一ページを Markdown または HTML に変換するか、ドキュメント全体を一度に変換できます。

library(pdfoxide)

doc <- pdf_open("paper.pdf")

md  <- pdf_to_markdown(doc, 0)    # one page as Markdown
html <- pdf_to_html(doc, 0)       # one page as HTML

all_md   <- pdf_to_markdown_all(doc)    # whole document
all_text <- pdf_to_plain_text_all(doc)  # whole document, plain text

cat(all_md)

単語・文字・行

要素抽出は、位置を示すバウンディングボックス付きのレコードのリストを返します。各 bbox は x、y、width、height を持つ名前付きリストです。

library(pdfoxide)

doc <- pdf_open("paper.pdf")

# Positioned words — each has $text, $bbox, $font_name, $font_size, $bold
words <- pdf_extract_words(doc, 0)
for (w in head(words, 10)) {
  cat(sprintf("'%s' at (%.1f, %.1f) font=%s bold=%s\n",
              w$text, w$bbox$x, w$bbox$y, w$font_name, w$bold))
}

# Reading-order lines — each has $text, $bbox, $word_count
lines <- pdf_extract_text_lines(doc, 0)
for (ln in head(lines, 5)) {
  cat(sprintf("[%d words] %s\n", ln$word_count, ln$text))
}

# Positioned characters — $character is the Unicode codepoint (integer)
chars <- pdf_extract_chars(doc, 0)
for (ch in head(chars, 10)) {
  cat(sprintf("'%s' at (%.1f, %.1f) size=%.1f\n",
              intToUtf8(ch$character), ch$bbox$x, ch$bbox$y, ch$font_size))
}

テーブル

pdf_extract_tables() は検出されたテーブルを返します。各テーブルレコードは row_count、col_count、has_header、そして tbl$cells[row, col] のように 1 始まりでインデックスされる cells 文字行列を保持します。

library(pdfoxide)

doc <- pdf_open("statement.pdf")
tables <- pdf_extract_tables(doc, 0)

for (tbl in tables) {
  cat(sprintf("Table: %d rows x %d cols (header=%s)\n",
              tbl$row_count, tbl$col_count, tbl$has_header))
  for (r in seq_len(tbl$row_count)) {
    cat(paste(tbl$cells[r, ], collapse = " | "), "\n")
  }
}

検索

pdf_search() で単一ページを、pdf_search_all() でドキュメント全体を検索します。どちらも省略可能な case_sensitive フラグ（デフォルトは FALSE）を受け取り、text、page、bbox を持つレコードを返します。

library(pdfoxide)

doc <- pdf_open("manual.pdf")

# Whole document
hits <- pdf_search_all(doc, "configuration")
for (h in hits) {
  cat(sprintf("Page %d: '%s' at (%.0f, %.0f)\n",
              h$page, h$text, h$bbox$x, h$bbox$y))
}

# Single page, case-sensitive
page_hits <- pdf_search(doc, 0, "Configuration", case_sensitive = TRUE)

バイト列から開く

メモリ上に保持された PDF を開くには pdf_open_from_bytes() を使います。S3、HTTP、データベースから読み込む際に便利です。この関数は raw ベクトルを受け取ります。

library(pdfoxide)

bytes <- readBin("report.pdf", "raw", file.info("report.pdf")$size)
doc <- pdf_open_from_bytes(bytes)
cat(pdf_extract_text(doc, 0))

パスワード保護された PDF

暗号化されたドキュメントは pdf_open_with_password() で開くか、開いた後に pdf_authenticate() を呼び出します（成功時は TRUE、パスワードが誤っている場合は FALSE を返します）。

library(pdfoxide)

doc <- pdf_open_with_password("confidential.pdf", "secret")
cat(pdf_extract_text(doc, 0))

PDF を作成する

ビルダー関数は、Markdown、HTML、プレーンテキストから pdfoxide_pdf を作成します。pdf_save() でパスに保存するか、pdf_to_bytes() で raw ベクトルにシリアライズできます（これは pdf_open_from_bytes() で再び開けます）。

library(pdfoxide)

pdf <- pdf_from_markdown("# Hello World\n\nThis is a PDF.\n")
pdf_save(pdf, "output.pdf")

pdf_from_html("<h1>Invoice</h1><p>Amount due: $42.00</p>") |>
  pdf_save("invoice.pdf")

pdf_from_text("Plain text document.\n\nSecond paragraph.") |>
  pdf_save("notes.pdf")

# Round-trip through bytes
bytes <- pdf_to_bytes(pdf_from_markdown("# In memory\n\nbody\n"))
doc <- pdf_open_from_bytes(bytes)
cat(pdf_extract_text(doc, 0))

次のステップ

Python ではじめよう – Python から PDF Oxide を使う
Rust ではじめよう – 基盤となる Rust クレート
テキスト抽出 – 詳細な抽出オプションとレシピ
PDF の作成 – ビルダー・暗号化・メタデータを使った高度な作成