What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Getting Started with PDF Oxide (C++)

PDF Oxide ships idiomatic, header-only C++17 bindings over its Rust core — 0.8ms mean text extraction, 100% pass rate on 3,830 PDFs. Handles are move-only RAII wrappers, native strings and buffers are copied into std::string / std::vector<std::uint8_t> for you, and C ABI error codes are thrown as pdf_oxide::Error. New in v0.3.69.

Installation

The bindings are a single header (cpp/include/pdf_oxide/pdf_oxide.hpp) that links the native cdylib. Build the library once from the repo root, then point CMake at it:

# 1. build the native library (shipped binding feature set)
cargo build --release --lib \
  --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. configure + build with the header-only wrapper
cmake -S cpp -B cpp/build -DCMAKE_BUILD_TYPE=Release \
  -DPDF_OXIDE_LIB_DIR="$PWD/target/release"
cmake --build cpp/build -j

Then include the header in your own translation units:

#include <pdf_oxide/pdf_oxide.hpp>

The C header declares a global Pdf type, so do not using namespace pdf_oxide;. Qualify names (pdf_oxide::Pdf, pdf_oxide::Document) or bring them in with targeted using declarations.

Quick Start

Open a PDF and extract reading-order text from a page. Every fallible call throws pdf_oxide::Error, so wrap your work in a try/catch.

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    try {
        auto doc = pdf_oxide::Document::open("research-paper.pdf");

        std::cout << "pages: " << doc.page_count() << "\n";

        pdf_oxide::Version v = doc.version();
        std::cout << "version: " << static_cast<int>(v.major) << "."
                  << static_cast<int>(v.minor) << "\n";

        std::string text = doc.extract_text(0);   // 0-based page index
        std::cout << text << "\n";
        return 0;
    } catch (const pdf_oxide::Error& e) {
        std::cerr << "error: " << e.what() << "\n";
        return 1;
    }
}

To open a PDF already in memory, use Document::open_from_bytes:

std::vector<std::uint8_t> bytes = load_pdf_bytes();   // from S3, HTTP, a DB…
auto doc = pdf_oxide::Document::open_from_bytes(bytes);
std::string text = doc.extract_text(0);

Markdown and HTML Conversion

Convert a single page — or the whole document — to Markdown or HTML.

auto doc = pdf_oxide::Document::open("paper.pdf");

std::string page_md = doc.to_markdown(0);   // one page
std::string all_md   = doc.to_markdown_all(); // every page

std::string page_html = doc.to_html(0);
std::string all_html  = doc.to_html_all();

std::cout << all_md << "\n";

Word-Level Extraction

extract_words(page_index) returns a std::vector<pdf_oxide::Word> with the text, bounding box, and font metadata for every word on the page.

auto doc   = pdf_oxide::Document::open("paper.pdf");
auto words = doc.extract_words(0);

for (const auto& w : words) {
    std::cout << "'" << w.text << "'"
              << " at (" << w.bbox.x << ", " << w.bbox.y << ")"
              << " size=" << w.font_size
              << " font=" << w.font_name
              << (w.bold ? " [bold]" : "") << "\n";
}

pdf_oxide::Word fields:

Field	Type	Description
`text`	`std::string`	The word text
`bbox`	`Bbox`	Bounding box (`x`, `y`, `width`, `height`)
`font_name`	`std::string`	PostScript font name
`font_size`	`float`	Font size in points
`bold`	`bool`	Whether the run is bold

Character- and line-level extraction follow the same shape: extract_chars(0) yields Char records (Unicode codepoint + bbox), and extract_text_lines(0) yields TextLine records (text, bbox, word_count).

Search

Search a single page with search(page_index, term, case_sensitive), or the whole document with search_all(term, case_sensitive). Both return a std::vector<pdf_oxide::SearchResult>.

auto doc = pdf_oxide::Document::open("manual.pdf");

// One page
auto hits = doc.search(0, "configuration", /*case_sensitive=*/false);

// Every page
auto all_hits = doc.search_all("configuration", /*case_sensitive=*/false);
for (const auto& r : all_hits) {
    std::cout << "page " << r.page << ": '" << r.text << "'"
              << " at (" << r.bbox.x << ", " << r.bbox.y << ")\n";
}

Creating a PDF

The pdf_oxide::Pdf builder creates documents from Markdown, HTML, or plain text. Serialize with to_bytes() or write straight to disk with save().

// From Markdown
auto pdf = pdf_oxide::Pdf::from_markdown("# Hello World\n\nThis is a PDF.\n");
pdf.save("output.pdf");

// From HTML
auto invoice = pdf_oxide::Pdf::from_html("<h1>Invoice</h1><p>Amount: $42</p>");
invoice.save("invoice.pdf");

// From plain text, or grab the bytes for in-memory use
auto notes = pdf_oxide::Pdf::from_text("Plain text body.");
std::vector<std::uint8_t> bytes = notes.to_bytes();

Round-trip a freshly built PDF straight back into a Document:

auto pdf  = pdf_oxide::Pdf::from_markdown("# Title\n\nbody\n");
auto doc  = pdf_oxide::Document::open_from_bytes(pdf.to_bytes());
std::cout << doc.to_markdown_all() << "\n";

Error Handling

Every fallible operation throws pdf_oxide::Error, which carries the native error message (what()) and the raw C ABI error code (code()). Handles are also explicitly closable and idempotent: doc.close() frees the native handle early, and use-after-close throws.

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    try {
        auto doc = pdf_oxide::Document::open("missing.pdf");
        std::cout << doc.extract_text(0) << "\n";
        doc.close();   // optional — happens automatically at scope exit
    } catch (const pdf_oxide::Error& e) {
        std::cerr << "pdf error (" << e.code() << "): " << e.what() << "\n";
        return 1;
    }
}

Next Steps

Rust Getting Started – using PDF Oxide from Rust
Python Getting Started – using PDF Oxide from Python
Text Extraction – detailed extraction options and recipes
PDF Creation – advanced creation with metadata and encryption
Editing – modifying existing PDFs, annotations, and form fields