What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

区域提取 — 从页面指定区域获取内容

处理发票、银行对账单、税务表单或任意模板化版面时，你通常已经知道字段 位于何处。与其整页提取再查找目标值，不如让 PDF Oxide 直接对准这个矩形，只返回区域内的内容。

流式 API within(page, rect) 返回一个受限区域，可在其上链式调用各种提取方法：extract_text()、extract_words()、extract_chars()、extract_tables()。

绑定覆盖。 within(page, rect) 已在 Python、Rust 和 WASM 中提供。Go 和 C# 暴露了等价的底层辅助函数（ExtractTextInRect、ExtractWordsInRect、ExtractImagesInRect）— 见下文。

快速示例

rect 为 (x, y, width, height)，单位为 PDF 点，原点在页面 左下角。Letter 尺寸页面为 612 × 792 点。

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")

# 第 0 页顶部 92 点 — 典型的页眉区域
header = doc.within(0, (0, 700, 612, 92)).extract_text()
print(header)

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::geometry::Rect;

let mut doc = PdfDocument::open("invoice.pdf")?;
let header = doc.within(0, Rect::new(0.0, 700.0, 612.0, 92.0)).extract_text()?;
println!("{}", header);

JavaScript (WASM)

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const headerRegion = doc.within(0, [0, 700, 612, 92]);
console.log(headerRegion.extractText());
doc.free();

Go（底层辅助函数，效果一致）

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("invoice.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    // ExtractTextInRect(pageIndex, x, y, width, height)
    header, _ := doc.ExtractTextInRect(0, 0, 700, 612, 92)
    fmt.Println(header)
}

C#（底层辅助函数）

using PdfOxide;

using var doc = PdfDocument.Open("invoice.pdf");
string header = doc.ExtractTextInRect(0, 0, 700, 612, 92);
Console.WriteLine(header);

链式提取同一区域

Python / Rust / WASM 的 within() 流式形式，可让你在同一区域上调用任意提取方法，无需反复指定矩形：

Python

doc = PdfDocument("invoice.pdf")
region = doc.within(0, (400, 100, 200, 200))   # 右下 200×200 方框

total_text = region.extract_text()              # 纯文本
words      = region.extract_words()             # 单词级记录
chars      = region.extract_chars()             # 字符级记录

Rust

let region = doc.within(0, Rect::new(400.0, 100.0, 200.0, 200.0));
let text  = region.extract_text()?;
let words = region.extract_words()?;

常见用例

发票字段提取

发票通常把供应商地址、发票号、明细表放在固定区域。每个模板只需定义一次矩形：

from pdf_oxide import PdfDocument

TEMPLATES = {
    "acme_v1": {
        "invoice_no":  (450, 720,  120,  20),
        "issue_date":  (450, 700,  120,  20),
        "vendor_name": ( 50, 740,  300,  40),
        "total":       (450, 100,  120,  24),
    },
}

def parse_invoice(path, template):
    doc = PdfDocument(path)
    out = {}
    for field, rect in template.items():
        out[field] = doc.within(0, rect).extract_text().strip()
    return out

print(parse_invoice("invoice-2025-04.pdf", TEMPLATES["acme_v1"]))

银行对账单的交易行

大多数对账单有狭窄的「交易」区。将页面裁剪到该区域后调用 extract_words()，即可按阅读顺序取得每行文字及其 bbox：

doc = PdfDocument("statement.pdf")
for page in range(doc.page_count()):
    txn_region = doc.within(page, (36, 72, 540, 650))   # 跳过页眉和页脚
    for w in txn_region.extract_words():
        print(f"page {page}: {w.text} at ({w.x0:.0f},{w.y0:.0f})")

去除页眉页脚

若只为正文建立索引，裁掉每页顶部和底部即可：

Rust

let mut doc = PdfDocument::open("book.pdf")?;
for i in 0..doc.page_count()? {
    let body = doc.within(i, Rect::new(0.0, 100.0, 612.0, 600.0))
                  .extract_text()?;
    // 对 body 建索引 …
}

表格区域限定

当你已知某页包含表格且位置已知，可将 extract_tables() 限定在该矩形内：

Python

tables = doc.within(0, (50, 200, 500, 400)).extract_tables()
for t in tables:
    for row in t["rows"]:
        print([c["text"] for c in row["cells"]])

坐标参考

PDF 采用 左下原点，以点为单位（1 pt = 1/72 英寸）。Letter 尺寸页面为 (0, 0, 612, 792)。若要指定顶部 1 英寸区域，写作：

(x, y, w, h) = (0, 792 - 72, 612, 72)
             = (0, 720,      612, 72)

如果你来自图像坐标（左上原点）的世界，记得翻转 y。

在计算前获取页面真实的 MediaBox：

Python

doc = PdfDocument("doc.pdf")
mb = doc.page_media_box(0)       # (llx, lly, urx, ury)

Rust

let mb = editor.get_page_media_box(0)?;   // [f32; 4]

Go / C# — 矩形内辅助函数

Go 和 C# 暂未提供 within() 流式链，但底层方法一致：

方法	Go	C#
矩形内文本	`doc.ExtractTextInRect(page, x, y, w, h)`	`doc.ExtractTextInRect(page, x, y, w, h)`
矩形内单词	`doc.ExtractWordsInRect(page, x, y, w, h)`	（尚未封装）
矩形内图像	`doc.ExtractImagesInRect(page, x, y, w, h)`	（尚未封装）

在 Go 或 C# 中若需要对同一矩形执行多种提取，把矩形存入变量并依次调用辅助函数即可。待编辑器 API 稳定后会补上流式接口。

区域提取 — 从页面指定区域获取内容

快速示例

链式提取同一区域

常见用例

发票字段提取

银行对账单的交易行

去除页眉页脚

表格区域限定

坐标参考

Go / C# — 矩形内辅助函数

相关页面