What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide をはじめる（C++）

PDF Oxide は、Rust コアの上に構築された、イディオマティックでヘッダオンリーの C++17 バインディングを提供します。テキスト抽出は平均 0.8ms、3,830 件の PDF で 100% のパス率を達成しています。ハンドルはムーブ専用の RAII ラッパーで、ネイティブの文字列やバッファは std::string / std::vector<std::uint8_t> へ自動的にコピーされ、C ABI のエラーコードは pdf_oxide::Error としてスローされます。v0.3.69 で新登場。

インストール

バインディングは単一のヘッダ（cpp/include/pdf_oxide/pdf_oxide.hpp）で、ネイティブの cdylib にリンクします。まずリポジトリのルートでライブラリを一度ビルドし、その後 CMake にその場所を指定します。

# 1. build the native library (shipped binding feature set)
cargo build --release --lib \
  --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. configure + build with the header-only wrapper
cmake -S cpp -B cpp/build -DCMAKE_BUILD_TYPE=Release \
  -DPDF_OXIDE_LIB_DIR="$PWD/target/release"
cmake --build cpp/build -j

あとは、ご自身の翻訳単位でヘッダをインクルードするだけです。

#include <pdf_oxide/pdf_oxide.hpp>

C ヘッダはグローバルな Pdf 型を宣言しているため、using namespace pdf_oxide; は使わないでください。名前は修飾して書く（pdf_oxide::Pdf、pdf_oxide::Document）か、必要なものだけを using 宣言で取り込んでください。

クイックスタート

PDF を開き、ページから読み取り順のテキストを抽出します。失敗しうる呼び出しはすべて pdf_oxide::Error をスローするので、処理は try/catch で囲んでください。

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    try {
        auto doc = pdf_oxide::Document::open("research-paper.pdf");

        std::cout << "pages: " << doc.page_count() << "\n";

        pdf_oxide::Version v = doc.version();
        std::cout << "version: " << static_cast<int>(v.major) << "."
                  << static_cast<int>(v.minor) << "\n";

        std::string text = doc.extract_text(0);   // 0-based page index
        std::cout << text << "\n";
        return 0;
    } catch (const pdf_oxide::Error& e) {
        std::cerr << "error: " << e.what() << "\n";
        return 1;
    }
}

すでにメモリ上にある PDF を開くには、Document::open_from_bytes を使います。

std::vector<std::uint8_t> bytes = load_pdf_bytes();   // from S3, HTTP, a DB…
auto doc = pdf_oxide::Document::open_from_bytes(bytes);
std::string text = doc.extract_text(0);

Markdown・HTML への変換

単一ページ、あるいはドキュメント全体を Markdown または HTML に変換できます。

auto doc = pdf_oxide::Document::open("paper.pdf");

std::string page_md = doc.to_markdown(0);   // one page
std::string all_md   = doc.to_markdown_all(); // every page

std::string page_html = doc.to_html(0);
std::string all_html  = doc.to_html_all();

std::cout << all_md << "\n";

単語単位の抽出

extract_words(page_index) は std::vector<pdf_oxide::Word> を返し、ページ上のすべての単語についてテキスト・バウンディングボックス・フォントのメタデータを含みます。

auto doc   = pdf_oxide::Document::open("paper.pdf");
auto words = doc.extract_words(0);

for (const auto& w : words) {
    std::cout << "'" << w.text << "'"
              << " at (" << w.bbox.x << ", " << w.bbox.y << ")"
              << " size=" << w.font_size
              << " font=" << w.font_name
              << (w.bold ? " [bold]" : "") << "\n";
}

pdf_oxide::Word のフィールド:

フィールド	型	説明
`text`	`std::string`	単語のテキスト
`bbox`	`Bbox`	バウンディングボックス（`x`、`y`、`width`、`height`）
`font_name`	`std::string`	PostScript フォント名
`font_size`	`float`	フォントサイズ（ポイント単位）
`bold`	`bool`	そのランが太字かどうか

文字単位・行単位の抽出も同じ形をとります。extract_chars(0) は Char レコード（Unicode コードポイント + bbox）を返し、extract_text_lines(0) は TextLine レコード（text、bbox、word_count）を返します。

検索

単一ページの検索には search(page_index, term, case_sensitive) を、ドキュメント全体の検索には search_all(term, case_sensitive) を使います。いずれも std::vector<pdf_oxide::SearchResult> を返します。

auto doc = pdf_oxide::Document::open("manual.pdf");

// One page
auto hits = doc.search(0, "configuration", /*case_sensitive=*/false);

// Every page
auto all_hits = doc.search_all("configuration", /*case_sensitive=*/false);
for (const auto& r : all_hits) {
    std::cout << "page " << r.page << ": '" << r.text << "'"
              << " at (" << r.bbox.x << ", " << r.bbox.y << ")\n";
}

PDF の作成

pdf_oxide::Pdf ビルダーは、Markdown・HTML・プレーンテキストからドキュメントを作成します。to_bytes() でシリアライズするか、save() でディスクへ直接書き出します。

// From Markdown
auto pdf = pdf_oxide::Pdf::from_markdown("# Hello World\n\nThis is a PDF.\n");
pdf.save("output.pdf");

// From HTML
auto invoice = pdf_oxide::Pdf::from_html("<h1>Invoice</h1><p>Amount: $42</p>");
invoice.save("invoice.pdf");

// From plain text, or grab the bytes for in-memory use
auto notes = pdf_oxide::Pdf::from_text("Plain text body.");
std::vector<std::uint8_t> bytes = notes.to_bytes();

作成したばかりの PDF を、そのまま Document に往復させることもできます。

auto pdf  = pdf_oxide::Pdf::from_markdown("# Title\n\nbody\n");
auto doc  = pdf_oxide::Document::open_from_bytes(pdf.to_bytes());
std::cout << doc.to_markdown_all() << "\n";

エラーハンドリング

失敗しうる操作はすべて pdf_oxide::Error をスローします。これはネイティブのエラーメッセージ（what()）と生の C ABI エラーコード（code()）を保持します。また、ハンドルは明示的にクローズ可能で、冪等です。doc.close() はネイティブハンドルを早期に解放し、クローズ後に使用するとスローされます。

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    try {
        auto doc = pdf_oxide::Document::open("missing.pdf");
        std::cout << doc.extract_text(0) << "\n";
        doc.close();   // optional — happens automatically at scope exit
    } catch (const pdf_oxide::Error& e) {
        std::cerr << "pdf error (" << e.code() << "): " << e.what() << "\n";
        return 1;
    }
}

次のステップ

Rust をはじめる – Rust から PDF Oxide を使う
Python をはじめる – Python から PDF Oxide を使う
テキスト抽出 – 抽出オプションとレシピの詳細
PDF の作成 – メタデータや暗号化を伴う高度な作成
編集 – 既存 PDF・注釈・フォームフィールドの編集