What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide をはじめよう（Swift）

PDF Oxide は、テキスト抽出機能を標準搭載した最速の PDF ライブラリです。平均 0.8ms、3,830 件の PDF で 100% の合格率を達成しています。v0.3.69 で新登場した Swift バインディングは、C ABI を介して Rust コアをラップしています。ハンドルはクラスが所有し（deinit で解放されます）、C バッファは Swift の String / [UInt8] へコピーされ、エラーコードは PdfOxideError としてスローされます。

インストール

このバインディングは、デフォルトフィーチャーの cdylib をリンクします。まずネイティブライブラリをビルドし、続いて SwiftPM にヘッダーとライブラリの場所を指定します。

# 1. ネイティブライブラリをビルド（バインディング同梱のフィーチャーセット）
cargo build --release --lib --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. テスト + サンプルの実行（Package.swift は PDF_OXIDE_INCLUDE_DIR / _LIB_DIR を読み取ります）
cd swift
export PDF_OXIDE_INCLUDE_DIR="$PWD/../include"
export PDF_OXIDE_LIB_DIR="$PWD/../target/release"
DYLD_LIBRARY_PATH="$PDF_OXIDE_LIB_DIR" swift test
DYLD_LIBRARY_PATH="$PDF_OXIDE_LIB_DIR" swift run basic_extraction

クイックスタート

Markdown から PDF を生成し、生成されたバイト列から開いて、そのテキストを抽出します。この一連の往復処理は、外部のフィクスチャを一切使わずに実行できます。

import PdfOxide

let pdf = try Pdf.fromMarkdown("# Hello pdf_oxide\n\nThis is a **Swift** binding.\n")
let doc = try Document.openFromBytes(try pdf.toBytes())

print("pages:   \(try doc.pageCount())")
print("version: \(try doc.version())")
print(try doc.extractText(0))

ディスク上のファイルを開くには、Document.open(_:) を使います。

import PdfOxide

let doc = try Document.open("research-paper.pdf")
print("Pages:   \(try doc.pageCount())")
print("Version: \(try doc.version())")        // 例: 1.7

テキスト抽出

extractText(_:) は、0 始まりのページ番号で指定した 1 ページ分のテキストを返します。pageCount() をループして、ドキュメント全体を読み取れます。

import PdfOxide

let doc = try Document.open("book.pdf")
for i in 0..<(try doc.pageCount()) {
    print("--- Page \(i + 1) ---")
    print(try doc.extractText(i))
}

toPlainText(_:) はレイアウト情報を持たないフラットなテキストを返し、*All() 系のメソッドは全ページを一度に抽出します。

let doc = try Document.open("report.pdf")
let plain = try doc.toPlainText(0)            // 1 ページ分、レイアウトなし
let everything = try doc.toPlainTextAll()     // 全ページを連結

単語と文字

extractWords(_:) は、各単語のバウンディングボックスとフォント情報を含む [Word] を返します。extractChars(_:) は、1 文字ごとの位置情報を含む [Char] を返します。

import PdfOxide

let doc = try Document.open("paper.pdf")

let words = try doc.extractWords(0)
for word in words.prefix(10) {
    print("'\(word.text)' at (\(word.bbox.x), \(word.bbox.y)) "
        + "font=\(word.fontName) size=\(word.fontSize) bold=\(word.bold)")
}

let chars = try doc.extractChars(0)
for ch in chars.prefix(10) {
    let scalar = Unicode.Scalar(ch.character).map(String.init) ?? "?"
    print("'\(scalar)' size=\(ch.fontSize) font=\(ch.fontName)")
}

Word のフィールド: text（String）、bbox（Bbox）、fontName（String）、fontSize（Double）、bold（Bool）。Char のフィールド: character（UInt32 のコードポイント）、bbox、fontName、fontSize。Bbox は x、y、width、height を Double として公開します。

extractTextLines(_:) を使えば、テキストを行単位で取得することもできます。このメソッドは [TextLine]（text、bbox、wordCount）を返します。

let lines = try doc.extractTextLines(0)
for line in lines {
    print("\(line.wordCount) words: \(line.text)")
}

Markdown と HTML への変換

1 ページ分、またはドキュメント全体を Markdown や HTML に変換します。

import PdfOxide

let doc = try Document.open("paper.pdf")

let md = try doc.toMarkdown(0)        // 1 ページを Markdown へ
let mdAll = try doc.toMarkdownAll()   // ドキュメント全体を Markdown へ
let html = try doc.toHtml(0)          // 1 ページを HTML へ
let htmlAll = try doc.toHtmlAll()     // ドキュメント全体を HTML へ

print(mdAll)

検索

search(_:_:_:) は 1 ページを、searchAll(_:_:) はドキュメント全体を検索します。どちらも検索語と caseSensitive フラグを受け取り、[SearchResult]（text、page、bbox）を返します。

import PdfOxide

let doc = try Document.open("manual.pdf")

// 1 ページを検索（ページ 0、大文字小文字を区別しない）
let hits = try doc.search(0, "configuration", false)
for hit in hits {
    print("page \(hit.page): '\(hit.text)' at (\(hit.bbox.x), \(hit.bbox.y))")
}

// ドキュメント全体を検索
let allHits = try doc.searchAll("configuration", false)
print("\(allHits.count) total matches")

PDF の作成

Pdf 型には、ソース形式からドキュメントを構築するファクトリメソッドが用意されています。save(_:) でディスクに保存するか、toBytes() で生のバイト列を取得できます。

import PdfOxide

try Pdf.fromMarkdown("# Hello World\n\nThis is a PDF.").save("output.pdf")
try Pdf.fromHtml("<h1>Invoice</h1><p>Amount: $42</p>").save("invoice.pdf")
try Pdf.fromText("Plain text content.").save("notes.pdf")

let bytes = try Pdf.fromMarkdown("# In-memory\n\nbody\n").toBytes()
print("produced \(bytes.count) bytes")

エラー処理

失敗する可能性のある呼び出しはすべて PdfOxideError をスローします。このエラーは、失敗した操作名と、その背後にある C-ABI エラーコードを保持します。

import PdfOxide

do {
    let doc = try Document.open("document.pdf")
    print(try doc.extractText(0))
} catch let error as PdfOxideError {
    print("PDF error: \(error)")   // 例: "PdfOxideError: open failed (error code 1)"
}

次のステップ

Rust をはじめよう – Rust から PDF Oxide を使う
Python をはじめよう – Python から PDF Oxide を使う
テキスト抽出 – 抽出オプションとレシピの詳細
PDF の作成 – メタデータと暗号化を伴う高度な作成
編集 – 既存 PDF・注釈・フォームフィールドの編集