What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

スコープ付き抽出 — 特定領域からコンテンツを取り出す

請求書、銀行明細、税務フォーム、あるいはテンプレート化されたレイアウトを処理する場合、フィールドがどこにあるかは大抵わかっています。ページ全体を抽出して値を探すのではなく、PDF Oxide に正確な矩形を指定して、その部分だけを返してもらいましょう。

流暢なチェーン構文 within(page, rect) API はスコープ付きリージョンを返し、そこに抽出メソッドを連鎖させられます：extract_text()、extract_words()、extract_chars()、extract_tables()。

バインディングの対応状況。 within(page, rect) は Python、Rust、WASM で利用できます。Go と C# は同等の低レベルヘルパー（ExtractTextInRect、ExtractWordsInRect、ExtractImagesInRect）を提供しています — 詳細は下記を参照。in-rect ファミリー全体（テキスト、単語、行、テーブル、画像）は Rust、C ABI、Swift ラッパーでエンドツーエンドに提供されています。どのバインディングが何を持つかは In-rect 抽出バリアントを参照してください。

クイックサンプル

rect は PDF ポイント単位の (x, y, width, height) で、原点はページの左下です。レターサイズのページは 612 × 792 ポイントです。

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")

# Top 92 points of page 0 — typical header band
header = doc.within(0, (0, 700, 612, 92)).extract_text()
print(header)

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let mut doc = PdfDocument::open("invoice.pdf")?;
let header = doc.within(0, Rect::new(0.0, 700.0, 612.0, 92.0)).extract_text()?;
println!("{}", header);

JavaScript (WASM)

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const headerRegion = doc.within(0, [0, 700, 612, 92]);
console.log(headerRegion.extractText());
doc.free();

Go （低レベルヘルパー、同等の効果）

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("invoice.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    // ExtractTextInRect(pageIndex, x, y, width, height)
    header, _ := doc.ExtractTextInRect(0, 0, 700, 612, 92)
    fmt.Println(header)
}

C# （低レベルヘルパー）

using PdfOxide;

using var doc = PdfDocument.Open("invoice.pdf");
string header = doc.ExtractTextInRect(0, 0, 700, 612, 92);
Console.WriteLine(header);

Java （page.text(region)；BBox はコーナー形式 (x0, y0, x1, y1)）

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.geometry.BBox;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("invoice.pdf"))) {
    // Top 92 points of page 0 → corners (0, 700) … (612, 792)
    String header = doc.page(0).text(new BBox(0, 700, 612, 792));
    System.out.println(header);
}

Kotlin

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.geometry.BBox

PdfDocument.open(java.nio.file.Path.of("invoice.pdf")).use { doc ->
    val header = doc.page(0).text(BBox(0.0, 700.0, 612.0, 792.0))
    println(header)
}

Scala

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.geometry.BBox
import scala.util.Using

Using.resource(PdfDocument.open("invoice.pdf")) { doc =>
  val header = doc.page(0).text(BBox(0, 700, 612, 792))
  println(header)
}

Clojure

(require '[pdf-oxide.core :as pdf])
(import '[fyi.oxide.pdf.geometry BBox])

(with-open [doc (pdf/open "invoice.pdf")]
  ;; Top 92 points of page 0 → corners (0 700) … (612 792)
  (println (pdf/page-text (pdf/page doc 0) (BBox. 0 700 612 792))))

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("invoice.pdf");
// extract_text_in_rect(page, x, y, w, h)
auto header = doc.extract_text_in_rect(0, 0, 700, 612, 92);
std::cout << header << "\n";

Swift

import PdfOxide

let doc = try Document.open("invoice.pdf")
let header = try doc.extractTextInRect(0, x: 0, y: 700, w: 612, h: 92)
print(header)

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('invoice.pdf');
final header = doc.extractTextInRect(0, 0, 700, 612, 92);
print(header);
doc.close();

library(pdfoxide)

doc <- pdf_open("invoice.pdf")
# pdf_extract_text_in_rect(doc, page, x, y, width, height)
header <- pdf_extract_text_in_rect(doc, 0, 0, 700, 612, 92)
cat(header)

Julia

using PdfOxide

doc = open_document("invoice.pdf")
header = extract_text_in_rect(doc, 0, 0, 700, 612, 92)
println(header)

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("invoice.pdf");
const header = try doc.extractTextInRect(a, 0, 0, 700, 612, 92);  // free header
std.debug.print("{s}\n", .{header});

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"invoice.pdf" error:&err];
NSString *header = [doc extractTextInRect:0 x:0 y:700 w:612 h:92 error:&err];
NSLog(@"%@", header);

Elixir

{:ok, doc} = PdfOxide.open("invoice.pdf")
# extract_text_in_rect(doc, page, x, y, w, h)
{:ok, header} = PdfOxide.extract_text_in_rect(doc, 0, 0, 700, 612, 92)
IO.puts(header)

リージョンへのチェーン抽出

Python / Rust / WASM の within() 流暢形式では、矩形を再指定せずに同じスコープ付きリージョンに対して任意の抽出メソッドを呼び出せます：

Python

doc = PdfDocument("invoice.pdf")
region = doc.within(0, (400, 100, 200, 200))   # bottom-right 200×200 box

total_text = region.extract_text()              # plain text
words      = region.extract_words()             # word-level records
chars      = region.extract_chars()             # character-level records

Rust

let region = doc.within(0, Rect::new(400.0, 100.0, 200.0, 200.0));
let text  = region.extract_text()?;
let words = region.extract_words()?;

C++ （流暢チェーンなし — 同じ矩形に対して各 in-rect ヘルパーを呼び出す）

// bottom-right 200×200 box: x=400, y=100, w=200, h=200
auto text  = doc.extract_text_in_rect(0, 400, 100, 200, 200);
auto words = doc.extract_words_in_rect(0, 400, 100, 200, 200);
auto lines = doc.extract_lines_in_rect(0, 400, 100, 200, 200);

Swift

let text  = try doc.extractTextInRect(0, x: 400, y: 100, w: 200, h: 200)
let words = try doc.extractWordsInRect(0, x: 400, y: 100, w: 200, h: 200)

Dart

final text  = doc.extractTextInRect(0, 400, 100, 200, 200);
final words = doc.extractWordsInRect(0, 400, 100, 200, 200);

text  <- pdf_extract_text_in_rect(doc, 0, 400, 100, 200, 200)
words <- pdf_extract_words_in_rect(doc, 0, 400, 100, 200, 200)

Julia

text  = extract_text_in_rect(doc, 0, 400, 100, 200, 200)
words = extract_words_in_rect(doc, 0, 400, 100, 200, 200)

Zig

const text  = try doc.extractTextInRect(a, 0, 400, 100, 200, 200);
const words = try doc.extractWordsInRect(a, 0, 400, 100, 200, 200);  // freeWords

Objective-C

NSString *text = [doc extractTextInRect:0 x:400 y:100 w:200 h:200 error:&err];
NSArray<POXWord*> *words = [doc extractWordsInRect:0 x:400 y:100 w:200 h:200 error:&err];

Elixir

{:ok, text}  = PdfOxide.extract_text_in_rect(doc, 0, 400, 100, 200, 200)
{:ok, words} = PdfOxide.extract_words_in_rect(doc, 0, 400, 100, 200, 200)

よくあるユースケース

請求書フィールドの抽出

請求書には通常、ベンダー住所、請求書番号、明細テーブルが固定ゾーンに配置されています。テンプレートごとに矩形を一度定義しておきましょう：

from pdf_oxide import PdfDocument

TEMPLATES = {
    "acme_v1": {
        "invoice_no":  (450, 720,  120,  20),
        "issue_date":  (450, 700,  120,  20),
        "vendor_name": ( 50, 740,  300,  40),
        "total":       (450, 100,  120,  24),
    },
}

def parse_invoice(path, template):
    doc = PdfDocument(path)
    out = {}
    for field, rect in template.items():
        out[field] = doc.within(0, rect).extract_text().strip()
    return out

print(parse_invoice("invoice-2025-04.pdf", TEMPLATES["acme_v1"]))

銀行明細の明細行

ほとんどの明細書には「取引」帯が狭い範囲に収まっています。その帯を切り取り、extract_words() を呼び出すことで、各行をバウンディングボックス付きで読み取り順に取得できます：

doc = PdfDocument("statement.pdf")
for page in range(doc.page_count()):
    txn_region = doc.within(page, (36, 72, 540, 650))   # skip header + footer
    for w in txn_region.extract_words():
        print(f"page {page}: {w.text} at ({w.x0:.0f},{w.y0:.0f})")

ヘッダー / フッターの除去

本文コンテンツのみをインデックスする場合は、各ページの上部と下部を切り取ります：

Rust

let mut doc = PdfDocument::open("book.pdf")?;
for i in 0..doc.page_count()? {
    let body = doc.within(i, Rect::new(0.0, 100.0, 612.0, 600.0))
                  .extract_text()?;
    // index `body` …
}

テーブル領域の検出

ページにテーブルがあり、その位置がわかっている場合は、テーブル矩形にスコープを絞って extract_tables() をその領域だけに集中させましょう：

Python

tables = doc.within(0, (50, 200, 500, 400)).extract_tables()
for t in tables:
    for row in t["rows"]:
        print([c["text"] for c in row["cells"]])

矩形スコープ抽出にはどんなバリアントがあるか？ {#what-rect-scoped-extraction-variants-exist}

extract_text()、extract_words()、extract_chars() に加えて、さらに 2 つの矩形スコープバリアントがあり、1 つの矩形からジオメトリ対応の結果を返します：矩形内の行と矩形内のテーブル。どちらもフルページ抽出から、バウンディングボックスが指定矩形と交差するリージョンだけをフィルタリングします。返される座標と読み取り順はフルページ呼び出しと同じで、単に切り取られています。

リージョン内のテキスト行を抽出する (`extract_lines_in_rect`)

矩形内に収まる行レベルのレコード（テキスト、バウンディングボックス、単語数を含む）を返します。個々の単語ではなく読み取り順で行全体が必要な場合に使います — 住所ブロック、複数行の合計、単一の明細行など。

C ABI シグネチャが正規仕様です：

FfiTextLineList *pdf_document_extract_lines_in_rect(
    PdfDocument *handle,
    int32_t page_index,
    float x, float y, float w, float h,
    int32_t *error_code);

Rust — PdfDocument の extract_lines_in_rect(page_index, region) -> Result<Vec<PathContent>>：

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let doc = PdfDocument::open("statement.pdf")?;

// Transactions band: skip the header (top 92pt) and footer (bottom 72pt)
let region = Rect::new(36.0, 72.0, 540.0, 628.0);
let lines = doc.extract_lines_in_rect(0, region)?;
for line in &lines {
    println!("{:?}", line.bbox);
}

Python — 流暢リージョンは extract_text_lines() で行を提供します：

from pdf_oxide import PdfDocument

doc = PdfDocument("statement.pdf")

# Same band as the Rust example above
region = doc.within(0, (36, 72, 540, 628))
for line in region.extract_text_lines():
    print(line.text, line.bbox)

Swift — extractLinesInRect(_:x:y:w:h:) は [TextLine] を返します：

import PdfOxide

let doc = try PdfDocument(path: "statement.pdf")
let lines = try doc.extractLinesInRect(0, x: 36, y: 72, w: 540, h: 628)
for line in lines {
    print(line.text, line.bbox, line.wordCount)
}

C++ — extract_lines_in_rect(page, x, y, w, h) は std::vector<TextLine> を返します：

auto lines = doc.extract_lines_in_rect(0, 36, 72, 540, 628);
for (const auto& line : lines) {
    std::cout << line.text << "\n";
}

Dart — extractLinesInRect(page, x, y, w, h) は List<TextLine> を返します：

final lines = doc.extractLinesInRect(0, 36, 72, 540, 628);
for (final line in lines) {
    print('${line.text} ${line.bbox}');
}

R — pdf_extract_lines_in_rect(doc, page, x, y, width, height)：

lines <- pdf_extract_lines_in_rect(doc, 0, 36, 72, 540, 628)

Julia — extract_lines_in_rect(doc, page, x, y, w, h)：

lines = extract_lines_in_rect(doc, 0, 36, 72, 540, 628)
for line in lines
    println(line.text, " ", line.bbox)
end

Zig — extractLinesInRect(allocator, page, x, y, w, h)：

const lines = try doc.extractLinesInRect(a, 0, 36, 72, 540, 628);  // freeTextLines

Objective-C — extractLinesInRect:x:y:w:h: は NSArray<POXTextLine*> を返します：

NSArray<POXTextLine*> *lines = [doc extractLinesInRect:0 x:36 y:72 w:540 h:628 error:&err];

Elixir — extract_lines_in_rect(doc, page, x, y, w, h)：

{:ok, lines} = PdfOxide.extract_lines_in_rect(doc, 0, 36, 72, 540, 628)

Go / C#。 extract_lines_in_rect の C エントリポイントは存在しますが、Go と C# のラッパーはまだ提供していません。これらの言語ではページ全体の行を抽出して返されたバウンディングボックスでフィルタリングするか、ExtractWordsInRect（Go）を使って単語を行にグループ化してください。

リージョン内のテーブルを抽出する (`extract_tables_in_rect`)

テーブル検出を 1 つの矩形にスコープします — バウンディングボックスがその矩形と交差するテーブルだけが返されます。これは上で示した流暢な within(...).extract_tables() に対応するジオメトリ対応バリアントです。

C ABI シグネチャ：

FfiTableList *pdf_document_extract_tables_in_rect(
    PdfDocument *handle,
    int32_t page_index,
    float x, float y, float w, float h,
    int32_t *error_code);

Rust — extract_tables_in_rect(page_index, region) -> Result<Vec<Table>>（..._with_config バリアントはカスタム TableDetectionConfig を受け取ります）：

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let doc = PdfDocument::open("invoice.pdf")?;
let region = Rect::new(50.0, 200.0, 500.0, 400.0);
let tables = doc.extract_tables_in_rect(0, region)?;
for table in &tables {
    println!("{} rows × {} cols", table.rows.len(), table.col_count);
}

Python — 流暢リージョン経由：

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
tables = doc.within(0, (50, 200, 500, 400)).extract_tables()
for t in tables:
    for row in t["rows"]:
        print([c["text"] for c in row["cells"]])

Swift — extractTablesInRect(_:x:y:w:h:) は [Table] を返します：

let tables = try doc.extractTablesInRect(0, x: 50, y: 200, w: 500, h: 400)
for table in tables {
    print("\(table.rowCount) rows, header: \(table.hasHeader)")
}

C++ — extract_tables_in_rect(page, x, y, w, h) は std::vector<Table> を返します：

auto tables = doc.extract_tables_in_rect(0, 50, 200, 500, 400);
for (const auto& table : tables) {
    std::cout << table.rows.size() << " rows\n";
}

Dart — extractTablesInRect(page, x, y, w, h) は List<Table> を返します：

final tables = doc.extractTablesInRect(0, 50, 200, 500, 400);
for (final table in tables) {
    print('${table.rows.length} rows');
}

R — pdf_extract_tables_in_rect(doc, page, x, y, width, height)：

tables <- pdf_extract_tables_in_rect(doc, 0, 50, 200, 500, 400)

Julia — extract_tables_in_rect(doc, page, x, y, w, h)：

tables = extract_tables_in_rect(doc, 0, 50, 200, 500, 400)

Zig — extractTablesInRect(allocator, page, x, y, w, h)：

const tables = try doc.extractTablesInRect(a, 0, 50, 200, 500, 400);

Objective-C — extractTablesInRect:x:y:w:h: は NSArray<POXTable*> を返します：

NSArray<POXTable*> *tables = [doc extractTablesInRect:0 x:50 y:200 w:500 h:400 error:&err];

Elixir — extract_tables_in_rect(doc, page, x, y, w, h)：

{:ok, tables} = PdfOxide.extract_tables_in_rect(doc, 0, 50, 200, 500, 400)

Go / C#。 行と同様に、extract_tables_in_rect の C エントリポイントは存在しますが Go や C# にはまだラップされていません。ページ全体の ExtractTables(page) を呼び出し、バウンディングボックスが目的の矩形内に収まるテーブルだけを保持してください。

テキスト vs OCR を選ばずにページを自動抽出するには？

ページがデジタルテキスト、スキャン、またはその混合かどうかわからない場合、extract_page_auto がルーティングを代行します。これは AutoExtractor を実行します — リージョンごとのテキスト vs OCR ルーティングで、ネイティブへのグレースフルフォールバック付き（不透明な OCR エラーは発生しません）— そして JSON の PageExtraction を返します：ページの kind、整理された読み取り順の text、confidence、型付きの reason、ocr_used フラグ、そして各リージョンに bbox、kind、text、confidence、source、reason を持つ regions[] 配列（bbox と reason はリージョンのテキストが空のときも存在するため、読み取り順が暗黙的に壊れることはありません）。

{} 許容です：デフォルトには空 / null オプション JSON を渡すか、AutoExtractOptions オブジェクトを指定します。認識されるフィールド（シリアライズされたスネークケース）は：

フィールド	型	デフォルト	意味
`mode`	`"text_only"` \| `"auto"` \| `"force_ocr"`	`"auto"`	テキスト vs OCR ルーティング戦略
`reconstruct_image_tables`	bool	`true`	OCR スパン上の空間検出器で画像のみのテーブルを再構築
`emit_placeholders`	bool	`true`	テキストフローに位置付きの Figure/Table プレースホルダーを挿入
`ocr_languages`	string[]	`[]`	OCR 言語ヒント（例：`["english","chinese"]`）
`min_text_confidence`	float \| null	`null`	自動判断の信頼度しきい値
`table_confidence`	float \| null	`null`	画像テーブル再構築のしきい値
`force_ocr_pages`	int[]	`[]`	OCR を強制する 0 始まりのページインデックス

OCR 機能ゲート。 OCR が実際に実行されるのはライブラリが ocr フィーチャ付きでビルドされている場合のみで、それ以外では extract_page_auto はネイティブテキストレイヤーにフォールバックし（エラーにはなりません）。自動エントリポイントは Python、Go、C#、Swift、WASM、C ABI に提供されています。Rust では 1 行の PdfDocument メソッドではなく、ライブラリレベルの AutoExtractor API です — 下記を参照。

Python — extract_page_auto(page, options_json=None) -> str（JSON）：

import json
from pdf_oxide import PdfDocument

doc = PdfDocument("mixed-scan.pdf")

# Defaults (balanced preset)
page = json.loads(doc.extract_page_auto(0))
print(page["kind"], page["confidence"], page["ocr_used"])
for region in page["regions"]:
    print(region["kind"], region["bbox"], region["reason"])

# With options
opts = json.dumps({"mode": "auto", "reconstruct_image_tables": True,
                   "ocr_languages": ["english"]})
page = json.loads(doc.extract_page_auto(0, opts))

Go — ExtractPageAuto(pageIndex, opts ...AutoOption) (string, error)（JSON を返す；ファンクショナルオプションで設定）：

package main

import (
    "encoding/json"
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("mixed-scan.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    raw, err := doc.ExtractPageAuto(0)
    if err != nil { log.Fatal(err) }

    var page map[string]any
    json.Unmarshal([]byte(raw), &page)
    fmt.Println(page["kind"], page["confidence"], page["ocr_used"])
}

C# — ExtractPageAuto(int pageIndex, string? optionsJson = null) -> string（JSON）：

using System.Text.Json;
using PdfOxide.Core;

using var doc = PdfDocument.Open("mixed-scan.pdf");

// Defaults
string json = doc.ExtractPageAuto(0);
using var page = JsonDocument.Parse(json);
Console.WriteLine(page.RootElement.GetProperty("kind"));

// With options
string opts = """{"mode":"auto","ocr_languages":["english"]}""";
string json2 = doc.ExtractPageAuto(0, opts);

Swift — extractPageAuto(_:optionsJson:) -> String（デフォルトは "{}"）：

let json = try doc.extractPageAuto(0, optionsJson: "{}")

JavaScript (WASM) — extractPageAuto(pageIndex, optionsJson?)：

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const page = JSON.parse(doc.extractPageAuto(0));
console.log(page.kind, page.confidence, page.ocr_used);
doc.free();

Rust — 自動パスは AutoExtractor ライブラリ API です。AutoExtractOptions（プリセット fast()、balanced()、high_fidelity()、またはフルエントビルダー）を構築して extract_page を呼び出すと、型付きの PageExtraction が返されます（JSON ラウンドトリップなし）：

use pdf_oxide::PdfDocument;
use pdf_oxide::extractors::auto::{AutoExtractor, AutoExtractOptions, ExtractMode};

let doc = PdfDocument::open("mixed-scan.pdf")?;

// Default (balanced) preset
let page = AutoExtractor::new().extract_page(&doc, 0)?;
println!("{:?} conf={} ocr={}", page.kind, page.confidence, page.ocr_used);

// Custom options via the builder
let opts = AutoExtractOptions::builder()
    .mode(ExtractMode::Auto)
    .reconstruct_image_tables(true)
    .ocr_languages(["english"])
    .build();
let page = AutoExtractor::with(opts).extract_page(&doc, 0)?;
for region in &page.regions {
    println!("{:?} {:?} {:?}", region.kind, region.bbox, region.reason);
}

C++ — extract_page_auto(page, options_json = "") は JSON エンベロープを返します：

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("mixed-scan.pdf");
auto json = doc.extract_page_auto(0);                                    // defaults
auto json2 = doc.extract_page_auto(0, R"({"mode":"auto","ocr_languages":["english"]})");

Dart — extractPageAuto(page, [optionsJson]) は JSON エンベロープを返します：

import 'dart:convert';
import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('mixed-scan.pdf');
final page = jsonDecode(doc.extractPageAuto(0));
print('${page["kind"]} ${page["confidence"]} ${page["ocr_used"]}');
doc.close();

R — pdf_extract_page_auto(doc, page, options_json = NULL) は JSON を返します：

library(jsonlite)

doc  <- pdf_open("mixed-scan.pdf")
page <- fromJSON(pdf_extract_page_auto(doc, 0))
cat(page$kind, page$confidence, page$ocr_used, "\n")

Julia — extract_page_auto(doc, page, options = "{}") は JSON を返します：

using PdfOxide, JSON

doc  = open_document("mixed-scan.pdf")
page = JSON.parse(extract_page_auto(doc, 0))
println(page["kind"], " ", page["confidence"], " ", page["ocr_used"])

Zig — extractPageAuto(allocator, page, options_json) は JSON バイトを返します：

const json = try doc.extractPageAuto(a, 0, null);  // free json

Objective-C — extractPageAuto:optionsJson:error: は JSON エンベロープを返します：

NSString *json = [doc extractPageAuto:0 optionsJson:@"{}" error:&err];

Elixir — extract_page_auto(doc, page, options_json \\ "") は JSON を返します：

{:ok, json} = PdfOxide.extract_page_auto(doc, 0)
page = Jason.decode!(json)
IO.inspect({page["kind"], page["confidence"], page["ocr_used"]})

Java — 自動パスは AutoExtractor API です（extractPage → 型付き結果；プレーンテキストは extractTextForPage）：

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("mixed-scan.pdf"))) {
    AutoExtractor ax = AutoExtractor.of(doc);             // or .fast/.balanced/.highFidelity
    String text = ax.extractTextForPage(0);               // graceful native/OCR routing
    System.out.println(text);
}

Kotlin

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.AutoExtractor

PdfDocument.open(java.nio.file.Path.of("mixed-scan.pdf")).use { doc ->
    val ax = AutoExtractor.of(doc)
    println(ax.extractTextForPage(0))
}

Scala

import fyi.oxide.pdf.{PdfDocument, AutoExtractor}
import scala.util.Using

Using.resource(PdfDocument.open("mixed-scan.pdf")) { doc =>
  val ax = AutoExtractor.of(doc)
  println(ax.extractTextForPage(0))
}

PHP — リッチな JSON エンベロープは AutoExtractor::extractPageJson で取得できます：

use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;

$doc = PdfDocument::open('mixed-scan.pdf');
$ax  = AutoExtractor::balanced($doc);
$page = json_decode($ax->extractPageJson(0), true);
echo $page['kind'], ' ', $page['confidence'], ' ', $page['ocr_used'];

Ruby — auto_extractor.extract_page(page) は解析されたエンベロープを Hash にマージして返します：

require 'pdf_oxide'

PdfOxide::PdfDocument.open('mixed-scan.pdf') do |doc|
  result = doc.auto_extractor.extract_page(0)
  cls = result[:classification]            # full PageExtraction JSON as a Hash
  puts [cls['kind'], cls['confidence'], cls['ocr_used']].join(' ')
end

構造化型リージョンを JSON で取得するには？

ページ全体の構造化ビュー — 見出し、本文ブロック、ヘッダー/フッター、ページ番号、カラム順 — には構造化抽出エントリポイントを使います。StructuredPage を返します：page_index、page_width、page_height、そして各リージョンに kind（セマンティックロール）、text、bbox、spans、column_index（複数カラムの読み取り順）を持つ regions[] 配列。リージョンの kind には本文ブロック、構造的見出し（H1–H6）、欄外ラベル、ランニングヘッダー/フッター、ページ番号、アーティファクトが含まれます。

ほとんどのバインディングはこれを JSON 文字列 として返します（C ABI が一度シリアライズし、バインディングはネイティブ型にデシリアライズします）；Rust は型付きの StructuredPage を直接返します。

C ABI シグネチャ：

char *pdf_document_extract_structured_to_json(
    PdfDocument *handle,
    int32_t page_index,
    int32_t *error_code);

Python — extract_structured(page) -> str（JSON；json.loads でデシリアライズ）：

import json
from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
page = json.loads(doc.extract_structured(0))

print(page["page_width"], page["page_height"])
for region in page["regions"]:
    print(region["kind"], region["column_index"], region["text"][:60])

Go — ExtractStructured(page) (string, error)：

raw, err := doc.ExtractStructured(0)
if err != nil { log.Fatal(err) }

var page map[string]any
json.Unmarshal([]byte(raw), &page)
for _, r := range page["regions"].([]any) {
    region := r.(map[string]any)
    fmt.Println(region["kind"], region["text"])
}

C# — ExtractStructured(int page) -> string：

using System.Text.Json;

string json = doc.ExtractStructured(0);
using var page = JsonDocument.Parse(json);
foreach (var region in page.RootElement.GetProperty("regions").EnumerateArray())
{
    Console.WriteLine(region.GetProperty("kind"));
}

Swift — extractStructuredJson(_:) -> String：

let json = try doc.extractStructuredJson(0)

JavaScript (WASM) — extractStructured(pageIndex)（camelCase キーの JSON 文字列を返します）：

const page = JSON.parse(doc.extractStructured(0));
for (const region of page.regions) {
    console.log(region.kind, region.columnIndex);
}

Rust — extract_structured(page_index) -> Result<StructuredPage> は型付きリージョンを直接返します（JSON ラウンドトリップなし）。extract_structured_with_column_mode バリアントで頑固なレイアウトに ColumnMode::Two/Single を強制できます：

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("report.pdf")?;
let page = doc.extract_structured(0)?;
for region in &page.regions {
    println!("{:?} col={:?}: {}", region.kind, region.column_index, region.text);
}

C++ — extract_structured_json(page) は JSON 文字列を返します：

auto json = doc.extract_structured_json(0);

Dart — extractStructuredJson(page) は JSON 文字列を返します：

import 'dart:convert';

final page = jsonDecode(doc.extractStructuredJson(0));
for (final region in page['regions']) {
    print('${region["kind"]} ${region["column_index"]}');
}

R — pdf_extract_structured_json(doc, page) は JSON を返します：

library(jsonlite)

page <- fromJSON(pdf_extract_structured_json(doc, 0))
print(page$page_width)

Julia — extract_structured_json(doc, page) は JSON を返します：

using JSON
page = JSON.parse(extract_structured_json(doc, 0))
for region in page["regions"]
    println(region["kind"], " ", region["column_index"])
end

Zig — extractStructuredJson(allocator, page) は JSON バイトを返します：

const json = try doc.extractStructuredJson(a, 0);  // free json

Objective-C — extractStructuredJson:error: は JSON 文字列を返します：

NSString *json = [doc extractStructuredJson:0 error:&err];

Elixir — extract_structured_json(doc, page) は JSON を返します：

{:ok, json} = PdfOxide.extract_structured_json(doc, 0)
page = Jason.decode!(json)

Java — extractStructured(page) は JSON 文字列を返します：

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;

String json = doc.extractStructured(0);
JsonNode page = new ObjectMapper().readTree(json);
for (JsonNode region : page.get("regions")) {
    System.out.println(region.get("kind").asText());
}

Kotlin

val json = doc.extractStructured(0)   // JSON string; parse with your library of choice

Scala

val json = doc.extractStructured(0)   // JSON string

Clojure — (pdf/extract-structured doc page) は JSON 文字列を返します：

(require '[clojure.data.json :as json])

(with-open [doc (pdf/open "report.pdf")]
  (let [page (json/read-str (pdf/extract-structured doc 0))]
    (doseq [region (get page "regions")]
      (println (get region "kind") (get region "column_index")))))

Ruby — extract_structured(page) は解析された StructuredPage Hash を返します：

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  page = doc.extract_structured(0)
  page['regions'].each { |r| puts "#{r['kind']} #{r['column_index']}" }
end

PHP — extractStructured($page) はデシリアライズされた連想配列を返します：

$doc = PdfOxide\PdfDocument::open('report.pdf');
$page = $doc->extractStructured(0);
foreach ($page['regions'] as $region) {
    echo $region['kind'], ' ', $region['column_index'], "\n";
}

座標リファレンス

PDF は左下原点を使い、ポイント単位で計測します（1 pt = 1/72 インチ）。レターサイズのページは (0, 0, 612, 792) です。上端 1 インチの帯を指定するには：

(x, y, w, h) = (0, 792 - 72, 612, 72)
             = (0, 720,      612, 72)

画像座標系（左上原点）から来た場合は y を逆転させてください。

計算前にページの実際の MediaBox を取得するには：

Python

doc = PdfDocument("doc.pdf")
mb = doc.page_media_box(0)       # (llx, lly, urx, ury)

Rust

let mb = editor.get_page_media_box(0)?;   // [f32; 4]

Java — page.mediaBox() は BBox（x0, y0, x1, y1）を返します：

import fyi.oxide.pdf.geometry.BBox;

BBox mb = doc.page(0).mediaBox();         // (x0, y0, x1, y1) in PDF user space
double w = mb.width(), h = mb.height();   // 612 × 792 for US Letter

Kotlin

val mb = doc.page(0).mediaBox()           // BBox(x0, y0, x1, y1)

Scala

val mb = doc.page(0).mediaBox             // BBox(x0, y0, x1, y1)

C++ — エディタ経由：get_page_media_box(page)：

auto editor = pdf_oxide::DocumentEditor::open("doc.pdf");
auto mb = editor.get_page_media_box(0);   // Bbox{x, y, width, height}

Swift

let editor = try DocumentEditor.open("doc.pdf")
let mb = try editor.getPageMediaBox(0)    // Bbox(x, y, width, height)

Dart

final editor = DocumentEditor.open('doc.pdf');
final mb = editor.getPageMediaBox(0);     // Bbox(x, y, width, height)

editor <- pdf_editor_open("doc.pdf")
mb <- pdf_editor_get_page_media_box(editor, 0)   # list(x=, y=, width=, height=)

Julia

editor = open_editor("doc.pdf")
mb = get_page_media_box(editor, 0)        # Bbox

Zig

var editor = try pdf_oxide.DocumentEditor.openEditor("doc.pdf");
const mb = try editor.getPageMediaBox(0);  // Bbox{ x, y, width, height }

Objective-C

POXDocumentEditor *editor = [POXDocumentEditor openEditor:@"doc.pdf" error:&err];
POXBbox mb = [editor pageMediaBox:0 error:&err];   // {x, y, width, height}

Elixir

{:ok, editor} = PdfOxide.open_editor("doc.pdf")
{:ok, mb} = PdfOxide.get_page_media_box(editor, 0)   # %Bbox{}

Go / C# — in-rect ヘルパー

Go と C# はまだ流暢な within() チェーンを提供していませんが、基礎となる低レベルメソッドは同じです：

メソッド	Go	C#
矩形内のテキスト	`doc.ExtractTextInRect(page, x, y, w, h)`	`doc.ExtractTextInRect(page, x, y, w, h)`
矩形内の単語	`doc.ExtractWordsInRect(page, x, y, w, h)`	（未ラップ）
矩形内の画像	`doc.ExtractImagesInRect(page, x, y, w, h)`	（未ラップ）

Go や C# で同じ矩形に対して複数の抽出タイプが必要なパターンでは、矩形を変数に保持してヘルパーを順次呼び出してください。エディタ API が安定したら流暢なサーフェスも追加される予定です。

よくある質問

extract_words() と extract_lines_in_rect() の違いは何ですか？ extract_words() は単語ごとに 1 レコードを返します；extract_lines_in_rect() はバウンディングボックスが矩形と交差する行ごとに 1 レコード（テキスト、バウンディングボックス、単語数）を返します。住所ブロック、明細行、複数行の合計など、単語を再グループ化せずに読み取り順で行全体が必要な場合は行を使いましょう。

extract_page_auto は常に OCR を実行しますか？ いいえ。リージョンごとにルーティングします。デフォルトの "auto" モードでは、ネイティブテキストレイヤーが欠如または不審な場合のみ OCR にエスカレートし、OCR が実際に実行されるのはライブラリが ocr フィーチャ付きでビルドされている場合のみです。そのフィーチャなしではネイティブテキストレイヤーにフォールバックし、不透明な OCR エラーは発生しません。

lines-in-rect と tables-in-rect バリアントはどのバインディングで利用できますか？ Rust、C ABI、Swift は extract_lines_in_rect / extract_tables_in_rect を直接提供しています。Python は流暢リージョン（within(...).extract_text_lines() と within(...).extract_tables()）で同じ結果に到達します。Go と C# はまだ in-rect の行/テーブルエントリポイントをラップしていません — ページ全体を抽出して返されたバウンディングボックスでフィルタリングしてください。

スコープ付き抽出の速度は？ スコーピングはフルページ抽出に対して測定可能なオーバーヘッドを追加しません — PDF Oxide はベンチマークコーパスで平均 0.8ms（通過率 100%）で抽出し、in-rect 呼び出しはその結果をバウンディングボックスでフィルタリングするだけです。

スコープ付き抽出 — 特定領域からコンテンツを取り出す

クイックサンプル

リージョンへのチェーン抽出

よくあるユースケース

請求書フィールドの抽出

銀行明細の明細行

ヘッダー / フッターの除去

テーブル領域の検出

矩形スコープ抽出にはどんなバリアントがあるか？ {#what-rect-scoped-extraction-variants-exist}

リージョン内のテキスト行を抽出する (extract_lines_in_rect)

リージョン内のテーブルを抽出する (extract_tables_in_rect)

テキスト vs OCR を選ばずにページを自動抽出するには？

構造化型リージョンを JSON で取得するには？

座標リファレンス

Go / C# — in-rect ヘルパー

よくある質問

関連ページ

リージョン内のテキスト行を抽出する (`extract_lines_in_rect`)

リージョン内のテーブルを抽出する (`extract_tables_in_rect`)