What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 快速上手（Dart / Flutter）

PDF Oxide 是在 Dart 和 Flutter 中读取 PDF 的最快方式——文本提取平均耗时 0.8ms，在 3,830 个 PDF 上达到 100% 通过率。pdf_oxide 包是对 Rust 内核的地道 dart:ffi 封装：PDF 句柄由 NativeFinalizer 自动释放（也可显式调用 close()），C 字符串和缓冲区会自动为你复制到 Dart 中，C-ABI 错误码则以 PdfOxideError 异常的形式抛出。

安装

在 pubspec.yaml 中添加 pdf_oxide：

dependencies:
  pdf_oxide: ^0.3.69

然后获取依赖：

dart pub get

绑定会在运行时加载原生库（libpdf_oxide.{so,dylib,dll}）。解析顺序为 PDF_OXIDE_LIB_PATH（完整路径）→ PDF_OXIDE_LIB_DIR → ../target/release → target/release → 系统加载器。对于 Flutter，请将平台库随应用一起打包，并让 PDF_OXIDE_LIB_PATH 指向它。

快速开始

打开一个 PDF 并提取第一页的文本。用完后务必 close() 文档——用 try/finally 能让这件事保持整洁。

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('research-paper.pdf');
  try {
    print('Pages: ${doc.pageCount}');
    print('PDF version: ${doc.version}'); // e.g. 1.7
    print(doc.extractText(0));
  } finally {
    doc.close();
  }
}

如果要打开已经在内存中的 PDF（例如在 Flutter 应用中通过 HTTP 下载得到的数据），请使用 openFromBytes：

import 'dart:typed_data';
import 'package:pdf_oxide/pdf_oxide.dart';

void render(Uint8List bytes) {
  final doc = PdfDocument.openFromBytes(bytes);
  try {
    print(doc.extractText(0));
  } finally {
    doc.close();
  }
}

文本、Markdown 和 HTML

每一页都可以渲染为纯文本、Markdown 或 HTML。逐页方法接受一个从 0 开始的页码索引；带 …All() 后缀的变体则对整个文档运行。

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('report.pdf');
  try {
    // Single page (index 0)
    print(doc.extractText(0));   // raw extracted text
    print(doc.toPlainText(0));   // normalized plain text
    print(doc.toMarkdown(0));    // Markdown with headings, lists, tables
    print(doc.toHtml(0));        // HTML

    // Whole document
    print(doc.toMarkdownAll());
    print(doc.toHtmlAll());
    print(doc.toPlainTextAll());
  } finally {
    doc.close();
  }
}

如果你更喜欢一次只处理一页，还有一个轻量的 Page 视图可用：

final doc = PdfDocument.open('report.pdf');
final page = doc.page(0);
print(page.text());
print(page.markdown());
doc.close();

带坐标的单词和行

extractWords 返回每个单词及其边界框、字体和字重；extractTextLines 则返回整行。坐标以 PDF 用户空间的点（point）为单位。

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('paper.pdf');
  try {
    for (final word in doc.extractWords(0)) {
      final b = word.bbox; // Bbox(x, y, width, height)
      print("'${word.text}' at (${b.x}, ${b.y}) "
          'font=${word.fontName} size=${word.fontSize} bold=${word.bold}');
    }

    for (final line in doc.extractTextLines(0)) {
      print('${line.wordCount} words: ${line.text}');
    }
  } finally {
    doc.close();
  }
}

如果需要字形级别的细节，extractChars 会返回每个 Char，包含其 Unicode 码点、边界框、字体名称和字号：

final doc = PdfDocument.open('paper.pdf');
for (final ch in doc.extractChars(0)) {
  print('${String.fromCharCode(ch.character)} @ ${ch.bbox} ${ch.fontSize}pt');
}
doc.close();

搜索

search 在单页内查找；searchAll 扫描整个文档。两者都接受一个搜索词和一个 caseSensitive 标志，并返回 SearchResult 记录，其中携带匹配到的文本、所在的页码索引和边界框。

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('manual.pdf');
  try {
    // Single page (page 0), case-insensitive
    for (final hit in doc.search(0, 'configuration', false)) {
      print("page ${hit.page}: '${hit.text}' at ${hit.bbox}");
    }

    // Across the whole document
    final hits = doc.searchAll('configuration', false);
    print('${hits.length} matches');
  } finally {
    doc.close();
  }
}

创建 PDF

Pdf 构建器可以把 Markdown、HTML 或纯文本转换成 PDF。调用 toBytes() 获取字节数据，或调用 save() 写入文件，完成后调用 close()。

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final pdf = Pdf.fromMarkdown('# Hello World\n\nThis is a **PDF**.\n');
  try {
    pdf.save('output.pdf');
    final bytes = pdf.toBytes();
    print('Wrote ${bytes.length} bytes');
  } finally {
    pdf.close();
  }
}

Pdf.fromHtml('<h1>Invoice</h1><p>Amount: \$42</p>') 和 Pdf.fromText('Plain text content.') 的用法完全相同。由于构建出来的 Pdf 本质上就是字节数据，你可以直接把它传给 PdfDocument.openFromBytes(pdf.toBytes())，无需落盘即可重新提取内容。

错误处理

每个可能失败的调用都会抛出 PdfOxideError（它 implements Exception），其中携带底层的 C-ABI 错误码：

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  try {
    final doc = PdfDocument.open('/nonexistent/nope.pdf');
    doc.close();
  } on PdfOxideError catch (e) {
    print('Failed to open PDF: $e');
  }
}

后续步骤

Rust 快速上手 — 驱动本绑定的原生内核
Python 快速上手 — 在 Python 中使用 PDF Oxide
文本提取 — 详细的提取选项与实用方案
创建 PDF — 进阶创建功能，包含元数据与加密