What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

范围提取 — 从特定区域抽取内容

处理发票、银行对账单、税务表格或任何模板化版面时，字段的位置通常是已知的。与其提取整页再搜索目标值，不如直接将 PDF Oxide 指向精确的矩形区域，只取回所需内容。

流式 within(page, rect) API 返回一个范围区域，可在其上链式调用各种提取方法：extract_text()、extract_words()、extract_chars()、extract_tables()。

绑定覆盖范围。 within(page, rect) 在 Python、Rust 和 WASM 中可用。Go 和 C# 提供等效的底层辅助函数（ExtractTextInRect、ExtractWordsInRect、ExtractImagesInRect）——详见下文。完整的 in-rect 系列（文本、单词、行、表格、图片）在 Rust、C ABI 和 Swift 封装中完整提供；各绑定的具体支持情况请参阅 In-rect 提取变体。

快速示例

rect 为 PDF 点为单位的 (x, y, width, height)，原点位于页面左下角。Letter 尺寸页面为 612 × 792 点。

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")

# Top 92 points of page 0 — typical header band
header = doc.within(0, (0, 700, 612, 92)).extract_text()
print(header)

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let mut doc = PdfDocument::open("invoice.pdf")?;
let header = doc.within(0, Rect::new(0.0, 700.0, 612.0, 92.0)).extract_text()?;
println!("{}", header);

JavaScript (WASM)

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const headerRegion = doc.within(0, [0, 700, 612, 92]);
console.log(headerRegion.extractText());
doc.free();

Go（底层辅助函数，效果相同）

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("invoice.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    // ExtractTextInRect(pageIndex, x, y, width, height)
    header, _ := doc.ExtractTextInRect(0, 0, 700, 612, 92)
    fmt.Println(header)
}

C#（底层辅助函数）

using PdfOxide;

using var doc = PdfDocument.Open("invoice.pdf");
string header = doc.ExtractTextInRect(0, 0, 700, 612, 92);
Console.WriteLine(header);

Java（page.text(region)；BBox 为角点形式 (x0, y0, x1, y1)）

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.geometry.BBox;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("invoice.pdf"))) {
    // Top 92 points of page 0 → corners (0, 700) … (612, 792)
    String header = doc.page(0).text(new BBox(0, 700, 612, 792));
    System.out.println(header);
}

Kotlin

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.geometry.BBox

PdfDocument.open(java.nio.file.Path.of("invoice.pdf")).use { doc ->
    val header = doc.page(0).text(BBox(0.0, 700.0, 612.0, 792.0))
    println(header)
}

Scala

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.geometry.BBox
import scala.util.Using

Using.resource(PdfDocument.open("invoice.pdf")) { doc =>
  val header = doc.page(0).text(BBox(0, 700, 612, 792))
  println(header)
}

Clojure

(require '[pdf-oxide.core :as pdf])
(import '[fyi.oxide.pdf.geometry BBox])

(with-open [doc (pdf/open "invoice.pdf")]
  ;; Top 92 points of page 0 → corners (0 700) … (612 792)
  (println (pdf/page-text (pdf/page doc 0) (BBox. 0 700 612 792))))

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("invoice.pdf");
// extract_text_in_rect(page, x, y, w, h)
auto header = doc.extract_text_in_rect(0, 0, 700, 612, 92);
std::cout << header << "\n";

Swift

import PdfOxide

let doc = try Document.open("invoice.pdf")
let header = try doc.extractTextInRect(0, x: 0, y: 700, w: 612, h: 92)
print(header)

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('invoice.pdf');
final header = doc.extractTextInRect(0, 0, 700, 612, 92);
print(header);
doc.close();

library(pdfoxide)

doc <- pdf_open("invoice.pdf")
# pdf_extract_text_in_rect(doc, page, x, y, width, height)
header <- pdf_extract_text_in_rect(doc, 0, 0, 700, 612, 92)
cat(header)

Julia

using PdfOxide

doc = open_document("invoice.pdf")
header = extract_text_in_rect(doc, 0, 0, 700, 612, 92)
println(header)

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("invoice.pdf");
const header = try doc.extractTextInRect(a, 0, 0, 700, 612, 92);  // free header
std.debug.print("{s}\n", .{header});

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"invoice.pdf" error:&err];
NSString *header = [doc extractTextInRect:0 x:0 y:700 w:612 h:92 error:&err];
NSLog(@"%@", header);

Elixir

{:ok, doc} = PdfOxide.open("invoice.pdf")
# extract_text_in_rect(doc, page, x, y, w, h)
{:ok, header} = PdfOxide.extract_text_in_rect(doc, 0, 0, 700, 612, 92)
IO.puts(header)

对区域进行链式提取

Python / Rust / WASM 中的 within() 流式形式允许在同一范围区域上调用任意提取方法，无需重复指定矩形：

Python

doc = PdfDocument("invoice.pdf")
region = doc.within(0, (400, 100, 200, 200))   # bottom-right 200×200 box

total_text = region.extract_text()              # plain text
words      = region.extract_words()             # word-level records
chars      = region.extract_chars()             # character-level records

Rust

let region = doc.within(0, Rect::new(400.0, 100.0, 200.0, 200.0));
let text  = region.extract_text()?;
let words = region.extract_words()?;

C++（无流式链——对同一矩形逐个调用 in-rect 辅助函数）

// bottom-right 200×200 box: x=400, y=100, w=200, h=200
auto text  = doc.extract_text_in_rect(0, 400, 100, 200, 200);
auto words = doc.extract_words_in_rect(0, 400, 100, 200, 200);
auto lines = doc.extract_lines_in_rect(0, 400, 100, 200, 200);

Swift

let text  = try doc.extractTextInRect(0, x: 400, y: 100, w: 200, h: 200)
let words = try doc.extractWordsInRect(0, x: 400, y: 100, w: 200, h: 200)

Dart

final text  = doc.extractTextInRect(0, 400, 100, 200, 200);
final words = doc.extractWordsInRect(0, 400, 100, 200, 200);

text  <- pdf_extract_text_in_rect(doc, 0, 400, 100, 200, 200)
words <- pdf_extract_words_in_rect(doc, 0, 400, 100, 200, 200)

Julia

text  = extract_text_in_rect(doc, 0, 400, 100, 200, 200)
words = extract_words_in_rect(doc, 0, 400, 100, 200, 200)

Zig

const text  = try doc.extractTextInRect(a, 0, 400, 100, 200, 200);
const words = try doc.extractWordsInRect(a, 0, 400, 100, 200, 200);  // freeWords

Objective-C

NSString *text = [doc extractTextInRect:0 x:400 y:100 w:200 h:200 error:&err];
NSArray<POXWord*> *words = [doc extractWordsInRect:0 x:400 y:100 w:200 h:200 error:&err];

Elixir

{:ok, text}  = PdfOxide.extract_text_in_rect(doc, 0, 400, 100, 200, 200)
{:ok, words} = PdfOxide.extract_words_in_rect(doc, 0, 400, 100, 200, 200)

常见使用场景

发票字段提取

发票通常在固定区域放置供应商地址、发票号和明细表。针对每种模板定义一次矩形即可：

from pdf_oxide import PdfDocument

TEMPLATES = {
    "acme_v1": {
        "invoice_no":  (450, 720,  120,  20),
        "issue_date":  (450, 700,  120,  20),
        "vendor_name": ( 50, 740,  300,  40),
        "total":       (450, 100,  120,  24),
    },
}

def parse_invoice(path, template):
    doc = PdfDocument(path)
    out = {}
    for field, rect in template.items():
        out[field] = doc.within(0, rect).extract_text().strip()
    return out

print(parse_invoice("invoice-2025-04.pdf", TEMPLATES["acme_v1"]))

银行对账单明细行

大多数对账单有一条较窄的"交易"区域。裁剪到该区域并调用 extract_words()，即可按阅读顺序获取每行内容及其边界框：

doc = PdfDocument("statement.pdf")
for page in range(doc.page_count()):
    txn_region = doc.within(page, (36, 72, 540, 650))   # skip header + footer
    for w in txn_region.extract_words():
        print(f"page {page}: {w.text} at ({w.x0:.0f},{w.y0:.0f})")

去除页眉 / 页脚

如果只需索引正文内容，可裁剪掉每页顶部和底部：

Rust

let mut doc = PdfDocument::open("book.pdf")?;
for i in 0..doc.page_count()? {
    let body = doc.within(i, Rect::new(0.0, 100.0, 612.0, 600.0))
                  .extract_text()?;
    // index `body` …
}

表格区域检测

当已知页面包含表格及其位置时，将范围限定在表格矩形，让 extract_tables() 只处理该区域：

Python

tables = doc.within(0, (50, 200, 500, 400)).extract_tables()
for t in tables:
    for row in t["rows"]:
        print([c["text"] for c in row["cells"]])

存在哪些矩形范围提取变体？ {#what-rect-scoped-extraction-variants-exist}

除 extract_text()、extract_words() 和 extract_chars() 外，还有两个矩形范围变体可从单个矩形返回几何感知结果：矩形内的行和矩形内的表格。两者均从完整页面提取结果中过滤出边界框与指定矩形相交的区域，因此返回的坐标和阅读顺序与完整页面调用相同，只是经过裁剪。

提取区域内的文本行（`extract_lines_in_rect`）

返回落在矩形内的行级记录（每条记录包含文本、边界框和单词数）。当需要按阅读顺序获取完整行而非单个单词时使用——例如地址块、多行合计或单条对账单行。

C ABI 签名为权威定义：

FfiTextLineList *pdf_document_extract_lines_in_rect(
    PdfDocument *handle,
    int32_t page_index,
    float x, float y, float w, float h,
    int32_t *error_code);

Rust — PdfDocument 上的 extract_lines_in_rect(page_index, region) -> Result<Vec<PathContent>>：

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let doc = PdfDocument::open("statement.pdf")?;

// Transactions band: skip the header (top 92pt) and footer (bottom 72pt)
let region = Rect::new(36.0, 72.0, 540.0, 628.0);
let lines = doc.extract_lines_in_rect(0, region)?;
for line in &lines {
    println!("{:?}", line.bbox);
}

Python — 流式区域通过 extract_text_lines() 提供行：

from pdf_oxide import PdfDocument

doc = PdfDocument("statement.pdf")

# Same band as the Rust example above
region = doc.within(0, (36, 72, 540, 628))
for line in region.extract_text_lines():
    print(line.text, line.bbox)

Swift — extractLinesInRect(_:x:y:w:h:) 返回 [TextLine]：

import PdfOxide

let doc = try PdfDocument(path: "statement.pdf")
let lines = try doc.extractLinesInRect(0, x: 36, y: 72, w: 540, h: 628)
for line in lines {
    print(line.text, line.bbox, line.wordCount)
}

C++ — extract_lines_in_rect(page, x, y, w, h) 返回 std::vector<TextLine>：

auto lines = doc.extract_lines_in_rect(0, 36, 72, 540, 628);
for (const auto& line : lines) {
    std::cout << line.text << "\n";
}

Dart — extractLinesInRect(page, x, y, w, h) 返回 List<TextLine>：

final lines = doc.extractLinesInRect(0, 36, 72, 540, 628);
for (final line in lines) {
    print('${line.text} ${line.bbox}');
}

R — pdf_extract_lines_in_rect(doc, page, x, y, width, height)：

lines <- pdf_extract_lines_in_rect(doc, 0, 36, 72, 540, 628)

Julia — extract_lines_in_rect(doc, page, x, y, w, h)：

lines = extract_lines_in_rect(doc, 0, 36, 72, 540, 628)
for line in lines
    println(line.text, " ", line.bbox)
end

Zig — extractLinesInRect(allocator, page, x, y, w, h)：

const lines = try doc.extractLinesInRect(a, 0, 36, 72, 540, 628);  // freeTextLines

Objective-C — extractLinesInRect:x:y:w:h: 返回 NSArray<POXTextLine*>：

NSArray<POXTextLine*> *lines = [doc extractLinesInRect:0 x:36 y:72 w:540 h:628 error:&err];

Elixir — extract_lines_in_rect(doc, page, x, y, w, h)：

{:ok, lines} = PdfOxide.extract_lines_in_rect(doc, 0, 36, 72, 540, 628)

Go / C#。 extract_lines_in_rect 的 C 入口点存在，但 Go 和 C# 封装尚未提供。在这两种语言中，可提取整页行后按返回的边界框过滤，或使用 ExtractWordsInRect（Go）自行将单词分组为行。

提取区域内的表格（`extract_tables_in_rect`）

将表格检测范围限定在单个矩形——只有边界框与该矩形相交的表格才会被返回。这是上文流式 within(...).extract_tables() 的几何感知对应变体。

C ABI 签名：

FfiTableList *pdf_document_extract_tables_in_rect(
    PdfDocument *handle,
    int32_t page_index,
    float x, float y, float w, float h,
    int32_t *error_code);

Rust — extract_tables_in_rect(page_index, region) -> Result<Vec<Table>>（..._with_config 变体接受自定义 TableDetectionConfig）：

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let doc = PdfDocument::open("invoice.pdf")?;
let region = Rect::new(50.0, 200.0, 500.0, 400.0);
let tables = doc.extract_tables_in_rect(0, region)?;
for table in &tables {
    println!("{} rows × {} cols", table.rows.len(), table.col_count);
}

Python — 通过流式区域：

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
tables = doc.within(0, (50, 200, 500, 400)).extract_tables()
for t in tables:
    for row in t["rows"]:
        print([c["text"] for c in row["cells"]])

Swift — extractTablesInRect(_:x:y:w:h:) 返回 [Table]：

let tables = try doc.extractTablesInRect(0, x: 50, y: 200, w: 500, h: 400)
for table in tables {
    print("\(table.rowCount) rows, header: \(table.hasHeader)")
}

C++ — extract_tables_in_rect(page, x, y, w, h) 返回 std::vector<Table>：

auto tables = doc.extract_tables_in_rect(0, 50, 200, 500, 400);
for (const auto& table : tables) {
    std::cout << table.rows.size() << " rows\n";
}

Dart — extractTablesInRect(page, x, y, w, h) 返回 List<Table>：

final tables = doc.extractTablesInRect(0, 50, 200, 500, 400);
for (final table in tables) {
    print('${table.rows.length} rows');
}

R — pdf_extract_tables_in_rect(doc, page, x, y, width, height)：

tables <- pdf_extract_tables_in_rect(doc, 0, 50, 200, 500, 400)

Julia — extract_tables_in_rect(doc, page, x, y, w, h)：

tables = extract_tables_in_rect(doc, 0, 50, 200, 500, 400)

Zig — extractTablesInRect(allocator, page, x, y, w, h)：

const tables = try doc.extractTablesInRect(a, 0, 50, 200, 500, 400);

Objective-C — extractTablesInRect:x:y:w:h: 返回 NSArray<POXTable*>：

NSArray<POXTable*> *tables = [doc extractTablesInRect:0 x:50 y:200 w:500 h:400 error:&err];

Elixir — extract_tables_in_rect(doc, page, x, y, w, h)：

{:ok, tables} = PdfOxide.extract_tables_in_rect(doc, 0, 50, 200, 500, 400)

Go / C#。 与行的情况相同，extract_tables_in_rect 的 C 入口点存在，但尚未在 Go 或 C# 中封装。请调用 ExtractTables(page) 提取整页，然后保留边界框落在目标矩形内的表格。

如何在不选择文本或 OCR 的情况下自动提取页面？

当不确定页面是数字文本、扫描件还是混合内容时，extract_page_auto 会自动完成路由。它运行 AutoExtractor——按区域进行文本与 OCR 路由，并提供优雅的原生回退（永远不会抛出晦涩的 OCR 错误）——并返回 JSON 格式的 PageExtraction：页面 kind、按阅读顺序组合的 text、confidence、类型化的 reason、ocr_used 标志，以及 regions[] 数组（每个区域包含 bbox、kind、text、confidence、source 和 reason；即便某区域文本为空，bbox 和 reason 仍会存在，确保阅读顺序不被静默破坏）。

该函数兼容 {}：传入空 / null 选项 JSON 使用默认值，或提供 AutoExtractOptions 对象。可识别的字段（序列化为蛇形命名法）如下：

字段	类型	默认值	含义
`mode`	`"text_only"` \| `"auto"` \| `"force_ocr"`	`"auto"`	文本与 OCR 路由策略
`reconstruct_image_tables`	bool	`true`	通过 OCR 跨度上的空间检测器重建纯图像表格
`emit_placeholders`	bool	`true`	在文本流中插入带位置的 Figure/Table 占位符
`ocr_languages`	string[]	`[]`	OCR 语言提示（如 `["english","chinese"]`）
`min_text_confidence`	float \| null	`null`	自动判断的置信度阈值
`table_confidence`	float \| null	`null`	图像表格重建阈值
`force_ocr_pages`	int[]	`[]`	强制使用 OCR 的页面索引（从 0 开始）

OCR 功能开关。 只有在构建库时启用了 ocr 特性，OCR 才会真正运行；否则 extract_page_auto 会回退到原生文本层（不会出错）。自动入口点在 Python、Go、C#、Swift、WASM 和 C ABI 中均已提供。在 Rust 中，它是库级别的 AutoExtractor API，而非 PdfDocument 的单行方法——详见下文。

Python — extract_page_auto(page, options_json=None) -> str（JSON）：

import json
from pdf_oxide import PdfDocument

doc = PdfDocument("mixed-scan.pdf")

# Defaults (balanced preset)
page = json.loads(doc.extract_page_auto(0))
print(page["kind"], page["confidence"], page["ocr_used"])
for region in page["regions"]:
    print(region["kind"], region["bbox"], region["reason"])

# With options
opts = json.dumps({"mode": "auto", "reconstruct_image_tables": True,
                   "ocr_languages": ["english"]})
page = json.loads(doc.extract_page_auto(0, opts))

Go — ExtractPageAuto(pageIndex, opts ...AutoOption) (string, error)（返回 JSON；通过函数选项配置）：

package main

import (
    "encoding/json"
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("mixed-scan.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    raw, err := doc.ExtractPageAuto(0)
    if err != nil { log.Fatal(err) }

    var page map[string]any
    json.Unmarshal([]byte(raw), &page)
    fmt.Println(page["kind"], page["confidence"], page["ocr_used"])
}

C# — ExtractPageAuto(int pageIndex, string? optionsJson = null) -> string（JSON）：

using System.Text.Json;
using PdfOxide.Core;

using var doc = PdfDocument.Open("mixed-scan.pdf");

// Defaults
string json = doc.ExtractPageAuto(0);
using var page = JsonDocument.Parse(json);
Console.WriteLine(page.RootElement.GetProperty("kind"));

// With options
string opts = """{"mode":"auto","ocr_languages":["english"]}""";
string json2 = doc.ExtractPageAuto(0, opts);

Swift — extractPageAuto(_:optionsJson:) -> String（默认为 "{}"）：

let json = try doc.extractPageAuto(0, optionsJson: "{}")

JavaScript (WASM) — extractPageAuto(pageIndex, optionsJson?)：

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const page = JSON.parse(doc.extractPageAuto(0));
console.log(page.kind, page.confidence, page.ocr_used);
doc.free();

Rust — 自动路径为 AutoExtractor 库 API。构建 AutoExtractOptions（预设 fast()、balanced()、high_fidelity()，或使用流式构建器）并调用 extract_page，返回类型化的 PageExtraction（无 JSON 往返）：

use pdf_oxide::PdfDocument;
use pdf_oxide::extractors::auto::{AutoExtractor, AutoExtractOptions, ExtractMode};

let doc = PdfDocument::open("mixed-scan.pdf")?;

// Default (balanced) preset
let page = AutoExtractor::new().extract_page(&doc, 0)?;
println!("{:?} conf={} ocr={}", page.kind, page.confidence, page.ocr_used);

// Custom options via the builder
let opts = AutoExtractOptions::builder()
    .mode(ExtractMode::Auto)
    .reconstruct_image_tables(true)
    .ocr_languages(["english"])
    .build();
let page = AutoExtractor::with(opts).extract_page(&doc, 0)?;
for region in &page.regions {
    println!("{:?} {:?} {:?}", region.kind, region.bbox, region.reason);
}

C++ — extract_page_auto(page, options_json = "") 返回 JSON 信封：

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("mixed-scan.pdf");
auto json = doc.extract_page_auto(0);                                    // defaults
auto json2 = doc.extract_page_auto(0, R"({"mode":"auto","ocr_languages":["english"]})");

Dart — extractPageAuto(page, [optionsJson]) 返回 JSON 信封：

import 'dart:convert';
import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('mixed-scan.pdf');
final page = jsonDecode(doc.extractPageAuto(0));
print('${page["kind"]} ${page["confidence"]} ${page["ocr_used"]}');
doc.close();

R — pdf_extract_page_auto(doc, page, options_json = NULL) 返回 JSON：

library(jsonlite)

doc  <- pdf_open("mixed-scan.pdf")
page <- fromJSON(pdf_extract_page_auto(doc, 0))
cat(page$kind, page$confidence, page$ocr_used, "\n")

Julia — extract_page_auto(doc, page, options = "{}") 返回 JSON：

using PdfOxide, JSON

doc  = open_document("mixed-scan.pdf")
page = JSON.parse(extract_page_auto(doc, 0))
println(page["kind"], " ", page["confidence"], " ", page["ocr_used"])

Zig — extractPageAuto(allocator, page, options_json) 返回 JSON 字节：

const json = try doc.extractPageAuto(a, 0, null);  // free json

Objective-C — extractPageAuto:optionsJson:error: 返回 JSON 信封：

NSString *json = [doc extractPageAuto:0 optionsJson:@"{}" error:&err];

Elixir — extract_page_auto(doc, page, options_json \\ "") 返回 JSON：

{:ok, json} = PdfOxide.extract_page_auto(doc, 0)
page = Jason.decode!(json)
IO.inspect({page["kind"], page["confidence"], page["ocr_used"]})

Java — 自动路径为 AutoExtractor API（extractPage 返回类型化结果；extractTextForPage 返回纯文本）：

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("mixed-scan.pdf"))) {
    AutoExtractor ax = AutoExtractor.of(doc);             // or .fast/.balanced/.highFidelity
    String text = ax.extractTextForPage(0);               // graceful native/OCR routing
    System.out.println(text);
}

Kotlin

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.AutoExtractor

PdfDocument.open(java.nio.file.Path.of("mixed-scan.pdf")).use { doc ->
    val ax = AutoExtractor.of(doc)
    println(ax.extractTextForPage(0))
}

Scala

import fyi.oxide.pdf.{PdfDocument, AutoExtractor}
import scala.util.Using

Using.resource(PdfDocument.open("mixed-scan.pdf")) { doc =>
  val ax = AutoExtractor.of(doc)
  println(ax.extractTextForPage(0))
}

PHP — 丰富的 JSON 信封可通过 AutoExtractor::extractPageJson 获取：

use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;

$doc = PdfDocument::open('mixed-scan.pdf');
$ax  = AutoExtractor::balanced($doc);
$page = json_decode($ax->extractPageJson(0), true);
echo $page['kind'], ' ', $page['confidence'], ' ', $page['ocr_used'];

Ruby — auto_extractor.extract_page(page) 返回解析后的信封，合并为 Hash：

require 'pdf_oxide'

PdfOxide::PdfDocument.open('mixed-scan.pdf') do |doc|
  result = doc.auto_extractor.extract_page(0)
  cls = result[:classification]            # full PageExtraction JSON as a Hash
  puts [cls['kind'], cls['confidence'], cls['ocr_used']].join(' ')
end

如何以 JSON 格式获取结构化类型区域？

若需整页的结构化视图——标题、正文块、页眉/页脚、页码和栏序——请使用结构化提取入口点。它返回 StructuredPage：page_index、page_width、page_height，以及 regions[] 数组（每个区域包含 kind（语义角色）、text、bbox、spans 和 column_index（多栏阅读顺序））。区域 kind 包括正文块、结构化标题（H1–H6）、边注标签、页眉/页脚、页码和装饰物。

大多数绑定以 JSON 字符串形式返回（C ABI 一次性序列化，各绑定反序列化为本地类型）；Rust 直接返回类型化的 StructuredPage。

C ABI 签名：

char *pdf_document_extract_structured_to_json(
    PdfDocument *handle,
    int32_t page_index,
    int32_t *error_code);

Python — extract_structured(page) -> str（JSON；用 json.loads 反序列化）：

import json
from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
page = json.loads(doc.extract_structured(0))

print(page["page_width"], page["page_height"])
for region in page["regions"]:
    print(region["kind"], region["column_index"], region["text"][:60])

Go — ExtractStructured(page) (string, error)：

raw, err := doc.ExtractStructured(0)
if err != nil { log.Fatal(err) }

var page map[string]any
json.Unmarshal([]byte(raw), &page)
for _, r := range page["regions"].([]any) {
    region := r.(map[string]any)
    fmt.Println(region["kind"], region["text"])
}

C# — ExtractStructured(int page) -> string：

using System.Text.Json;

string json = doc.ExtractStructured(0);
using var page = JsonDocument.Parse(json);
foreach (var region in page.RootElement.GetProperty("regions").EnumerateArray())
{
    Console.WriteLine(region.GetProperty("kind"));
}

Swift — extractStructuredJson(_:) -> String：

let json = try doc.extractStructuredJson(0)

JavaScript (WASM) — extractStructured(pageIndex)（返回带驼峰命名键的 JSON 字符串）：

const page = JSON.parse(doc.extractStructured(0));
for (const region of page.regions) {
    console.log(region.kind, region.columnIndex);
}

Rust — extract_structured(page_index) -> Result<StructuredPage> 直接返回类型化区域（无 JSON 往返）。extract_structured_with_column_mode 变体可对难处理的版面强制指定 ColumnMode::Two/Single：

use pdf_oxide::PdfDocument;

let doc = PdfDocument::open("report.pdf")?;
let page = doc.extract_structured(0)?;
for region in &page.regions {
    println!("{:?} col={:?}: {}", region.kind, region.column_index, region.text);
}

C++ — extract_structured_json(page) 返回 JSON 字符串：

auto json = doc.extract_structured_json(0);

Dart — extractStructuredJson(page) 返回 JSON 字符串：

import 'dart:convert';

final page = jsonDecode(doc.extractStructuredJson(0));
for (final region in page['regions']) {
    print('${region["kind"]} ${region["column_index"]}');
}

R — pdf_extract_structured_json(doc, page) 返回 JSON：

library(jsonlite)

page <- fromJSON(pdf_extract_structured_json(doc, 0))
print(page$page_width)

Julia — extract_structured_json(doc, page) 返回 JSON：

using JSON
page = JSON.parse(extract_structured_json(doc, 0))
for region in page["regions"]
    println(region["kind"], " ", region["column_index"])
end

Zig — extractStructuredJson(allocator, page) 返回 JSON 字节：

const json = try doc.extractStructuredJson(a, 0);  // free json

Objective-C — extractStructuredJson:error: 返回 JSON 字符串：

NSString *json = [doc extractStructuredJson:0 error:&err];

Elixir — extract_structured_json(doc, page) 返回 JSON：

{:ok, json} = PdfOxide.extract_structured_json(doc, 0)
page = Jason.decode!(json)

Java — extractStructured(page) 返回 JSON 字符串：

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;

String json = doc.extractStructured(0);
JsonNode page = new ObjectMapper().readTree(json);
for (JsonNode region : page.get("regions")) {
    System.out.println(region.get("kind").asText());
}

Kotlin

val json = doc.extractStructured(0)   // JSON string; parse with your library of choice

Scala

val json = doc.extractStructured(0)   // JSON string

Clojure — (pdf/extract-structured doc page) 返回 JSON 字符串：

(require '[clojure.data.json :as json])

(with-open [doc (pdf/open "report.pdf")]
  (let [page (json/read-str (pdf/extract-structured doc 0))]
    (doseq [region (get page "regions")]
      (println (get region "kind") (get region "column_index")))))

Ruby — extract_structured(page) 返回解析后的 StructuredPage Hash：

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  page = doc.extract_structured(0)
  page['regions'].each { |r| puts "#{r['kind']} #{r['column_index']}" }
end

PHP — extractStructured($page) 返回反序列化的关联数组：

$doc = PdfOxide\PdfDocument::open('report.pdf');
$page = $doc->extractStructured(0);
foreach ($page['regions'] as $region) {
    echo $region['kind'], ' ', $region['column_index'], "\n";
}

坐标参考

PDF 使用左下角原点，以点为单位（1 pt = 1/72 英寸）。Letter 尺寸页面为 (0, 0, 612, 792)。要定位顶部 1 英寸区域，请写：

(x, y, w, h) = (0, 792 - 72, 612, 72)
             = (0, 720,      612, 72)

如果您来自图像坐标系（左上角原点），请相应翻转 y。

在计算前获取页面实际 MediaBox：

Python

doc = PdfDocument("doc.pdf")
mb = doc.page_media_box(0)       # (llx, lly, urx, ury)

Rust

let mb = editor.get_page_media_box(0)?;   // [f32; 4]

Java — page.mediaBox() 返回 BBox（x0, y0, x1, y1）：

import fyi.oxide.pdf.geometry.BBox;

BBox mb = doc.page(0).mediaBox();         // (x0, y0, x1, y1) in PDF user space
double w = mb.width(), h = mb.height();   // 612 × 792 for US Letter

Kotlin

val mb = doc.page(0).mediaBox()           // BBox(x0, y0, x1, y1)

Scala

val mb = doc.page(0).mediaBox             // BBox(x0, y0, x1, y1)

C++ — 通过编辑器：get_page_media_box(page)：

auto editor = pdf_oxide::DocumentEditor::open("doc.pdf");
auto mb = editor.get_page_media_box(0);   // Bbox{x, y, width, height}

Swift

let editor = try DocumentEditor.open("doc.pdf")
let mb = try editor.getPageMediaBox(0)    // Bbox(x, y, width, height)

Dart

final editor = DocumentEditor.open('doc.pdf');
final mb = editor.getPageMediaBox(0);     // Bbox(x, y, width, height)

editor <- pdf_editor_open("doc.pdf")
mb <- pdf_editor_get_page_media_box(editor, 0)   # list(x=, y=, width=, height=)

Julia

editor = open_editor("doc.pdf")
mb = get_page_media_box(editor, 0)        # Bbox

Zig

var editor = try pdf_oxide.DocumentEditor.openEditor("doc.pdf");
const mb = try editor.getPageMediaBox(0);  // Bbox{ x, y, width, height }

Objective-C

POXDocumentEditor *editor = [POXDocumentEditor openEditor:@"doc.pdf" error:&err];
POXBbox mb = [editor pageMediaBox:0 error:&err];   // {x, y, width, height}

Elixir

{:ok, editor} = PdfOxide.open_editor("doc.pdf")
{:ok, mb} = PdfOxide.get_page_media_box(editor, 0)   # %Bbox{}

Go / C# — in-rect 辅助函数

Go 和 C# 尚未提供流式 within() 链，但底层低级方法相同：

方法	Go	C#
矩形内文本	`doc.ExtractTextInRect(page, x, y, w, h)`	`doc.ExtractTextInRect(page, x, y, w, h)`
矩形内单词	`doc.ExtractWordsInRect(page, x, y, w, h)`	（尚未封装）
矩形内图片	`doc.ExtractImagesInRect(page, x, y, w, h)`	（尚未封装）

对于需要在 Go 或 C# 中对同一矩形执行多种提取类型的场景，将矩形保存在变量中并依次调用辅助函数。流式接口将在编辑器 API 稳定后跟进。

常见问题

extract_words() 和 extract_lines_in_rect() 在区域内有何区别？ extract_words() 每个单词返回一条记录；extract_lines_in_rect() 为边界框与矩形相交的每行返回一条记录（文本、边界框和单词数）。当需要按阅读顺序获取完整行——地址块、对账单行、多行合计——而无需自行将单词重新分组时，请使用行提取。

extract_page_auto 总是运行 OCR 吗？ 不会。它按区域路由。在默认的 "auto" 模式下，只有在原生文本层缺失或可疑时才会升级到 OCR，且 OCR 实际运行的前提是库构建时启用了 ocr 特性。没有该特性时，会回退到原生文本层，不会抛出晦涩的 OCR 错误。

哪些绑定支持 lines-in-rect 和 tables-in-rect 变体？ Rust、C ABI 和 Swift 直接提供 extract_lines_in_rect / extract_tables_in_rect。Python 通过流式区域（within(...).extract_text_lines() 和 within(...).extract_tables()）获得相同结果。Go 和 C# 尚未封装 in-rect 行/表格入口点——请提取整页后按返回的边界框过滤。

范围提取有多快？ 范围限定不会在完整页面提取基础上增加可测量的开销——PDF Oxide 在基准测试语料库上平均提取耗时 0.8ms（通过率 100%），in-rect 调用仅对该结果按边界框过滤。

范围提取 — 从特定区域抽取内容

快速示例

对区域进行链式提取

常见使用场景

发票字段提取

银行对账单明细行

去除页眉 / 页脚

表格区域检测

存在哪些矩形范围提取变体？ {#what-rect-scoped-extraction-variants-exist}

提取区域内的文本行（extract_lines_in_rect）

提取区域内的表格（extract_tables_in_rect）

如何在不选择文本或 OCR 的情况下自动提取页面？

如何以 JSON 格式获取结构化类型区域？

坐标参考

Go / C# — in-rect 辅助函数

常见问题

相关页面

提取区域内的文本行（`extract_lines_in_rect`）

提取区域内的表格（`extract_tables_in_rect`）