What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

範囲指定抽出 — ページの特定領域から内容を取り出す

請求書・銀行明細・税務フォームなど、テンプレート化されたレイアウトを処理するときは、項目が どこに あるかをあらかじめ把握していることが多いはずです。ページ全体を抽出してから値を探すのではなく、PDF Oxide に矩形を直接指定し、そこに書かれているものだけを取得できます。

流れるような within(page, rect) API は、抽出メソッドを連結できる限定領域を返します: extract_text()、extract_words()、extract_chars()、extract_tables()。

バインディング対応状況。 within(page, rect) は Python・Rust・WASM で利用できます。Go と C# は同等の低レベルヘルパー (ExtractTextInRect、ExtractWordsInRect、ExtractImagesInRect) を提供しています — 下記を参照してください。

クイック例

rect は (x, y, width, height) で PDF ポイント単位、原点はページの左下です。Letter サイズのページは 612 × 792 ポイントになります。

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")

# ページ 0 の上端 92 ポイント — 典型的なヘッダー帯
header = doc.within(0, (0, 700, 612, 92)).extract_text()
print(header)

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let mut doc = PdfDocument::open("invoice.pdf")?;
let header = doc.within(0, Rect::new(0.0, 700.0, 612.0, 92.0)).extract_text()?;
println!("{}", header);

JavaScript (WASM)

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const headerRegion = doc.within(0, [0, 700, 612, 92]);
console.log(headerRegion.extractText());
doc.free();

Go（低レベルヘルパー、効果は同じ）

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("invoice.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    // ExtractTextInRect(pageIndex, x, y, width, height)
    header, _ := doc.ExtractTextInRect(0, 0, 700, 612, 92)
    fmt.Println(header)
}

C#（低レベルヘルパー）

using PdfOxide;

using var doc = PdfDocument.Open("invoice.pdf");
string header = doc.ExtractTextInRect(0, 0, 700, 612, 92);
Console.WriteLine(header);

領域からの連結抽出

Python / Rust / WASM での within() 形式では、矩形を再指定することなく、同じ領域に対して任意の抽出メソッドを呼び出せます。

Python

doc = PdfDocument("invoice.pdf")
region = doc.within(0, (400, 100, 200, 200))   # 右下の 200×200 ボックス

total_text = region.extract_text()              # プレーンテキスト
words      = region.extract_words()             # 単語単位のレコード
chars      = region.extract_chars()             # 文字単位のレコード

Rust

let region = doc.within(0, Rect::new(400.0, 100.0, 200.0, 200.0));
let text  = region.extract_text()?;
let words = region.extract_words()?;

代表的なユースケース

請求書のフィールド抽出

請求書では、ベンダー住所・請求書番号・明細表がたいてい固定ゾーンにあります。テンプレートごとに矩形を一度定義しておきます。

from pdf_oxide import PdfDocument

TEMPLATES = {
    "acme_v1": {
        "invoice_no":  (450, 720,  120,  20),
        "issue_date":  (450, 700,  120,  20),
        "vendor_name": ( 50, 740,  300,  40),
        "total":       (450, 100,  120,  24),
    },
}

def parse_invoice(path, template):
    doc = PdfDocument(path)
    out = {}
    for field, rect in template.items():
        out[field] = doc.within(0, rect).extract_text().strip()
    return out

print(parse_invoice("invoice-2025-04.pdf", TEMPLATES["acme_v1"]))

銀行明細の取引行

多くの明細書には狭い「取引」帯があります。その帯に切り出して extract_words() を呼ぶと、各行を読み取り順で bbox 付きに取得できます。

doc = PdfDocument("statement.pdf")
for page in range(doc.page_count()):
    txn_region = doc.within(page, (36, 72, 540, 650))   # ヘッダーとフッターを除外
    for w in txn_region.extract_words():
        print(f"page {page}: {w.text} at ({w.x0:.0f},{w.y0:.0f})")

ヘッダー・フッターの除去

本文だけをインデックスしたい場合、各ページの上下を切り落とします。

Rust

let mut doc = PdfDocument::open("book.pdf")?;
for i in 0..doc.page_count()? {
    let body = doc.within(i, Rect::new(0.0, 100.0, 612.0, 600.0))
                  .extract_text()?;
    // body をインデックスする …
}

表領域の限定

ページに表があり位置が分かっているなら、表の矩形にスコープして extract_tables() をその領域だけに集中させます。

Python

tables = doc.within(0, (50, 200, 500, 400)).extract_tables()
for t in tables:
    for row in t["rows"]:
        print([c["text"] for c in row["cells"]])

座標リファレンス

PDF は 左下原点 を使用し、単位はポイント（1 pt = 1/72 インチ）です。Letter サイズのページは (0, 0, 612, 792) です。上端 1 インチの帯を指定するには次のように書きます。

(x, y, w, h) = (0, 792 - 72, 612, 72)
             = (0, 720,      612, 72)

画像座標系（左上原点）から来た場合は、y を反転させてください。

ページの実際の MediaBox を事前に取得するには:

Python

doc = PdfDocument("doc.pdf")
mb = doc.page_media_box(0)       # (llx, lly, urx, ury)

Rust

let mb = editor.get_page_media_box(0)?;   // [f32; 4]

Go / C# — 矩形指定ヘルパー

Go と C# は流れるような within() チェーンをまだ公開していませんが、基盤となる低レベルメソッドは共通です。

メソッド	Go	C#
矩形内テキスト	`doc.ExtractTextInRect(page, x, y, w, h)`	`doc.ExtractTextInRect(page, x, y, w, h)`
矩形内単語	`doc.ExtractWordsInRect(page, x, y, w, h)`	(未ラップ)
矩形内画像	`doc.ExtractImagesInRect(page, x, y, w, h)`	(未ラップ)

Go または C# で同じ矩形に対して複数の抽出を行う場合は、矩形を変数に保持してヘルパーを順次呼び出してください。流れるような API はエディター API の安定後に追加されます。

範囲指定抽出 — ページの特定領域から内容を取り出す

クイック例

領域からの連結抽出

代表的なユースケース

請求書のフィールド抽出

銀行明細の取引行

ヘッダー・フッターの除去

表領域の限定

座標リファレンス

Go / C# — 矩形指定ヘルパー

関連ページ