What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide をはじめる（Dart / Flutter）

PDF Oxide は、Dart / Flutter から PDF を読み取る最速の手段です。平均 0.8ms のテキスト抽出、3,830 件の PDF で 100% の合格率を達成しています。pdf_oxide パッケージは Rust コアを dart:ffi で自然にラップしたもので、PDF ハンドルは NativeFinalizer（および明示的な close()）によって自動的に解放され、C 文字列やバッファは Dart 側へコピーされ、C-ABI のエラーコードは PdfOxideError 例外として表面化します。

インストール

pubspec.yaml に pdf_oxide を追加します。

dependencies:
  pdf_oxide: ^0.3.69

続いて依存関係を取得します。

dart pub get

バインディングは実行時にネイティブライブラリ（libpdf_oxide.{so,dylib,dll}）を読み込みます。解決順序は PDF_OXIDE_LIB_PATH（フルパス）→ PDF_OXIDE_LIB_DIR → ../target/release → target/release → システムローダーの順です。Flutter ではプラットフォーム向けライブラリをアプリに同梱し、PDF_OXIDE_LIB_PATH でその場所を指定してください。

クイックスタート

PDF を開いて 1 ページ目からテキストを取り出します。処理を終えたら必ずドキュメントを close() してください。try/finally を使えば後始末がきれいにまとまります。

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('research-paper.pdf');
  try {
    print('Pages: ${doc.pageCount}');
    print('PDF version: ${doc.version}'); // e.g. 1.7
    print(doc.extractText(0));
  } finally {
    doc.close();
  }
}

すでにメモリ上にある PDF（たとえば Flutter アプリで HTTP 経由ダウンロードしたもの）を開くには openFromBytes を使います。

import 'dart:typed_data';
import 'package:pdf_oxide/pdf_oxide.dart';

void render(Uint8List bytes) {
  final doc = PdfDocument.openFromBytes(bytes);
  try {
    print(doc.extractText(0));
  } finally {
    doc.close();
  }
}

テキスト・Markdown・HTML

各ページはプレーンテキスト・Markdown・HTML としてレンダリングできます。ページ単位のメソッドは 0 始まりのページインデックスを取り、…All() バリアントはドキュメント全体に対して実行されます。

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('report.pdf');
  try {
    // 単一ページ（インデックス 0）
    print(doc.extractText(0));   // 抽出した生テキスト
    print(doc.toPlainText(0));   // 正規化されたプレーンテキスト
    print(doc.toMarkdown(0));    // 見出し・リスト・テーブル付き Markdown
    print(doc.toHtml(0));        // HTML

    // ドキュメント全体
    print(doc.toMarkdownAll());
    print(doc.toHtmlAll());
    print(doc.toPlainTextAll());
  } finally {
    doc.close();
  }
}

1 ページずつ扱いたい場合は、軽量な Page ビューも利用できます。

final doc = PdfDocument.open('report.pdf');
final page = doc.page(0);
print(page.text());
print(page.markdown());
doc.close();

座標付きの単語と行

extractWords はすべての単語をそのバウンディングボックス・フォント・太さとともに返し、extractTextLines は行全体を返します。座標は PDF のユーザー空間ポイント単位です。

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('paper.pdf');
  try {
    for (final word in doc.extractWords(0)) {
      final b = word.bbox; // Bbox(x, y, width, height)
      print("'${word.text}' at (${b.x}, ${b.y}) "
          'font=${word.fontName} size=${word.fontSize} bold=${word.bold}');
    }

    for (final line in doc.extractTextLines(0)) {
      print('${line.wordCount} words: ${line.text}');
    }
  } finally {
    doc.close();
  }
}

グリフレベルの詳細が必要な場合は、extractChars が各 Char を Unicode コードポイント・バウンディングボックス・フォント名・サイズとともに返します。

final doc = PdfDocument.open('paper.pdf');
for (final ch in doc.extractChars(0)) {
  print('${String.fromCharCode(ch.character)} @ ${ch.bbox} ${ch.fontSize}pt');
}
doc.close();

検索

search は単一ページ内を、searchAll はドキュメント全体を走査します。どちらも検索語と caseSensitive フラグを取り、一致したテキスト・そのページインデックス・バウンディングボックスを持つ SearchResult レコードを返します。

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('manual.pdf');
  try {
    // 単一ページ（ページ 0）、大文字小文字を区別しない
    for (final hit in doc.search(0, 'configuration', false)) {
      print("page ${hit.page}: '${hit.text}' at ${hit.bbox}");
    }

    // ドキュメント全体を対象に
    final hits = doc.searchAll('configuration', false);
    print('${hits.length} matches');
  } finally {
    doc.close();
  }
}

PDF の作成

Pdf ビルダーは Markdown・HTML・プレーンテキストを PDF に変換します。バイト列を得るには toBytes() を、ファイルに書き込むには save() を呼び、完了したら close() してください。

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final pdf = Pdf.fromMarkdown('# Hello World\n\nThis is a **PDF**.\n');
  try {
    pdf.save('output.pdf');
    final bytes = pdf.toBytes();
    print('Wrote ${bytes.length} bytes');
  } finally {
    pdf.close();
  }
}

Pdf.fromHtml('<h1>Invoice</h1><p>Amount: \$42</p>') と Pdf.fromText('Plain text content.') も同じように使えます。生成された Pdf は単なるバイト列なので、PdfDocument.openFromBytes(pdf.toBytes()) にそのまま渡せば、ディスクを介さずに内容を抽出し直せます。

エラーハンドリング

失敗しうる呼び出しはすべて、基盤となる C-ABI のエラーコードを持つ PdfOxideError（implements Exception）をスローします。

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  try {
    final doc = PdfDocument.open('/nonexistent/nope.pdf');
    doc.close();
  } on PdfOxideError catch (e) {
    print('Failed to open PDF: $e');
  }
}

次のステップ

Rust のはじめ方 — このバインディングを支えるネイティブコア
Python のはじめ方 — Python から PDF Oxide を使う
テキスト抽出 — 抽出オプションとレシピの詳細
PDF の作成 — メタデータや暗号化を含む高度な作成