What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 入门（C++）

PDF Oxide 在其 Rust 内核之上提供了地道、仅头文件的 C++17 绑定 —— 文本提取平均 0.8ms，在 3,830 个 PDF 上 100% 通过率。句柄是只可移动（move-only）的 RAII 封装，原生字符串和缓冲区会自动为你拷贝进 std::string / std::vector<std::uint8_t>，C ABI 错误码则以 pdf_oxide::Error 异常抛出。本特性自 v0.3.69 起引入。

安装

绑定就是一个头文件（cpp/include/pdf_oxide/pdf_oxide.hpp），它链接到原生 cdylib。在仓库根目录构建一次库，然后让 CMake 指向它：

# 1. build the native library (shipped binding feature set)
cargo build --release --lib \
  --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. configure + build with the header-only wrapper
cmake -S cpp -B cpp/build -DCMAKE_BUILD_TYPE=Release \
  -DPDF_OXIDE_LIB_DIR="$PWD/target/release"
cmake --build cpp/build -j

然后在你自己的翻译单元中包含该头文件：

#include <pdf_oxide/pdf_oxide.hpp>

C 头文件声明了一个全局的 Pdf 类型，所以不要写 using namespace pdf_oxide;。请使用限定名（pdf_oxide::Pdf、 pdf_oxide::Document），或用有针对性的 using 声明逐个引入。

快速上手

打开一个 PDF 并按阅读顺序提取某一页的文本。每个可能失败的调用都会抛出 pdf_oxide::Error，所以请把你的代码包在 try/catch 里。

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    try {
        auto doc = pdf_oxide::Document::open("research-paper.pdf");

        std::cout << "pages: " << doc.page_count() << "\n";

        pdf_oxide::Version v = doc.version();
        std::cout << "version: " << static_cast<int>(v.major) << "."
                  << static_cast<int>(v.minor) << "\n";

        std::string text = doc.extract_text(0);   // 0-based page index
        std::cout << text << "\n";
        return 0;
    } catch (const pdf_oxide::Error& e) {
        std::cerr << "error: " << e.what() << "\n";
        return 1;
    }
}

要打开一个已经在内存中的 PDF，使用 Document::open_from_bytes：

std::vector<std::uint8_t> bytes = load_pdf_bytes();   // from S3, HTTP, a DB…
auto doc = pdf_oxide::Document::open_from_bytes(bytes);
std::string text = doc.extract_text(0);

Markdown 和 HTML 转换

把单独一页 —— 或整篇文档 —— 转换为 Markdown 或 HTML。

auto doc = pdf_oxide::Document::open("paper.pdf");

std::string page_md = doc.to_markdown(0);   // one page
std::string all_md   = doc.to_markdown_all(); // every page

std::string page_html = doc.to_html(0);
std::string all_html  = doc.to_html_all();

std::cout << all_md << "\n";

词级提取

extract_words(page_index) 返回一个 std::vector<pdf_oxide::Word>，其中包含页面上每个词的文本、边界框以及字体元数据。

auto doc   = pdf_oxide::Document::open("paper.pdf");
auto words = doc.extract_words(0);

for (const auto& w : words) {
    std::cout << "'" << w.text << "'"
              << " at (" << w.bbox.x << ", " << w.bbox.y << ")"
              << " size=" << w.font_size
              << " font=" << w.font_name
              << (w.bold ? " [bold]" : "") << "\n";
}

pdf_oxide::Word 字段：

字段	类型	说明
`text`	`std::string`	词的文本
`bbox`	`Bbox`	边界框（`x`、`y`、`width`、`height`）
`font_name`	`std::string`	PostScript 字体名
`font_size`	`float`	字号（单位为点 point）
`bold`	`bool`	该文本是否为粗体

字符级和行级提取的用法相同： extract_chars(0) 产出 Char 记录（Unicode 码点 + bbox），而 extract_text_lines(0) 产出 TextLine 记录（text、bbox、 word_count）。

搜索

用 search(page_index, term, case_sensitive) 搜索单独一页，或用 search_all(term, case_sensitive) 搜索整篇文档。两者都返回一个 std::vector<pdf_oxide::SearchResult>。

auto doc = pdf_oxide::Document::open("manual.pdf");

// One page
auto hits = doc.search(0, "configuration", /*case_sensitive=*/false);

// Every page
auto all_hits = doc.search_all("configuration", /*case_sensitive=*/false);
for (const auto& r : all_hits) {
    std::cout << "page " << r.page << ": '" << r.text << "'"
              << " at (" << r.bbox.x << ", " << r.bbox.y << ")\n";
}

创建 PDF

pdf_oxide::Pdf 构建器可以从 Markdown、HTML 或纯文本创建文档。用 to_bytes() 序列化，或用 save() 直接写入磁盘。

// From Markdown
auto pdf = pdf_oxide::Pdf::from_markdown("# Hello World\n\nThis is a PDF.\n");
pdf.save("output.pdf");

// From HTML
auto invoice = pdf_oxide::Pdf::from_html("<h1>Invoice</h1><p>Amount: $42</p>");
invoice.save("invoice.pdf");

// From plain text, or grab the bytes for in-memory use
auto notes = pdf_oxide::Pdf::from_text("Plain text body.");
std::vector<std::uint8_t> bytes = notes.to_bytes();

把一个刚构建好的 PDF 直接回灌成一个 Document：

auto pdf  = pdf_oxide::Pdf::from_markdown("# Title\n\nbody\n");
auto doc  = pdf_oxide::Document::open_from_bytes(pdf.to_bytes());
std::cout << doc.to_markdown_all() << "\n";

错误处理

每个可能失败的操作都会抛出 pdf_oxide::Error，它携带了原生错误消息（what()）和原始的 C ABI 错误码（code()）。句柄也可以显式关闭，且操作是幂等的：doc.close() 会提前释放原生句柄，而关闭后再使用则会抛出异常。

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    try {
        auto doc = pdf_oxide::Document::open("missing.pdf");
        std::cout << doc.extract_text(0) << "\n";
        doc.close();   // optional — happens automatically at scope exit
    } catch (const pdf_oxide::Error& e) {
        std::cerr << "pdf error (" << e.code() << "): " << e.what() << "\n";
        return 1;
    }
}

后续步骤

Rust 快速上手 —— 从 Rust 使用 PDF Oxide
Python 快速上手 —— 从 Python 使用 PDF Oxide
文本提取 —— 详细的提取选项与实战示例
创建 PDF —— 包含元数据与加密的进阶创建
编辑 —— 修改已有 PDF、注释和表单字段