What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Начало работы с PDF Oxide (C++)

PDF Oxide поставляется с идиоматичными header-only-биндингами на C++17 поверх ядра на Rust — извлечение текста за 0,8 мс в среднем, 100% успешных результатов на 3830 PDF. Дескрипторы представляют собой move-only RAII-обёртки, нативные строки и буферы автоматически копируются в std::string / std::vector<std::uint8_t>, а коды ошибок C ABI выбрасываются как исключения pdf_oxide::Error. Доступно начиная с v0.3.69.

Установка

Биндинги — это один заголовочный файл (cpp/include/pdf_oxide/pdf_oxide.hpp), который линкуется с нативной cdylib. Один раз соберите библиотеку из корня репозитория, затем укажите путь к ней в CMake:

# 1. build the native library (shipped binding feature set)
cargo build --release --lib \
  --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. configure + build with the header-only wrapper
cmake -S cpp -B cpp/build -DCMAKE_BUILD_TYPE=Release \
  -DPDF_OXIDE_LIB_DIR="$PWD/target/release"
cmake --build cpp/build -j

Затем подключите заголовочный файл в своих единицах трансляции:

#include <pdf_oxide/pdf_oxide.hpp>

Заголовок C объявляет глобальный тип Pdf, поэтому не пишите using namespace pdf_oxide;. Указывайте имена с квалификатором (pdf_oxide::Pdf, pdf_oxide::Document) либо подключайте их точечными объявлениями using.

Быстрый старт

Откройте PDF и извлеките текст со страницы в порядке чтения. Любой вызов, способный завершиться неудачей, выбрасывает pdf_oxide::Error, поэтому оборачивайте код в try/catch.

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    try {
        auto doc = pdf_oxide::Document::open("research-paper.pdf");

        std::cout << "pages: " << doc.page_count() << "\n";

        pdf_oxide::Version v = doc.version();
        std::cout << "version: " << static_cast<int>(v.major) << "."
                  << static_cast<int>(v.minor) << "\n";

        std::string text = doc.extract_text(0);   // 0-based page index
        std::cout << text << "\n";
        return 0;
    } catch (const pdf_oxide::Error& e) {
        std::cerr << "error: " << e.what() << "\n";
        return 1;
    }
}

Чтобы открыть PDF, уже находящийся в памяти, используйте Document::open_from_bytes:

std::vector<std::uint8_t> bytes = load_pdf_bytes();   // from S3, HTTP, a DB…
auto doc = pdf_oxide::Document::open_from_bytes(bytes);
std::string text = doc.extract_text(0);

Конвертация в Markdown и HTML

Преобразуйте отдельную страницу — или весь документ — в Markdown или HTML.

auto doc = pdf_oxide::Document::open("paper.pdf");

std::string page_md = doc.to_markdown(0);   // one page
std::string all_md   = doc.to_markdown_all(); // every page

std::string page_html = doc.to_html(0);
std::string all_html  = doc.to_html_all();

std::cout << all_md << "\n";

Извлечение на уровне слов

extract_words(page_index) возвращает std::vector<pdf_oxide::Word> с текстом, ограничивающим прямоугольником и метаданными шрифта для каждого слова на странице.

auto doc   = pdf_oxide::Document::open("paper.pdf");
auto words = doc.extract_words(0);

for (const auto& w : words) {
    std::cout << "'" << w.text << "'"
              << " at (" << w.bbox.x << ", " << w.bbox.y << ")"
              << " size=" << w.font_size
              << " font=" << w.font_name
              << (w.bold ? " [bold]" : "") << "\n";
}

Поля pdf_oxide::Word:

Поле	Тип	Описание
`text`	`std::string`	Текст слова
`bbox`	`Bbox`	Ограничивающий прямоугольник (`x`, `y`, `width`, `height`)
`font_name`	`std::string`	Имя шрифта PostScript
`font_size`	`float`	Размер шрифта в пунктах
`bold`	`bool`	Является ли текст полужирным

Извлечение на уровне символов и строк имеет такую же структуру: extract_chars(0) возвращает записи Char (кодовая точка Unicode + bbox), а extract_text_lines(0) возвращает записи TextLine (text, bbox, word_count).

Поиск

Ищите по отдельной странице с помощью search(page_index, term, case_sensitive) или по всему документу с помощью search_all(term, case_sensitive). Оба метода возвращают std::vector<pdf_oxide::SearchResult>.

auto doc = pdf_oxide::Document::open("manual.pdf");

// One page
auto hits = doc.search(0, "configuration", /*case_sensitive=*/false);

// Every page
auto all_hits = doc.search_all("configuration", /*case_sensitive=*/false);
for (const auto& r : all_hits) {
    std::cout << "page " << r.page << ": '" << r.text << "'"
              << " at (" << r.bbox.x << ", " << r.bbox.y << ")\n";
}

Создание PDF

Билдер pdf_oxide::Pdf создаёт документы из Markdown, HTML или обычного текста. Сериализуйте результат с помощью to_bytes() или записывайте сразу на диск с помощью save().

// From Markdown
auto pdf = pdf_oxide::Pdf::from_markdown("# Hello World\n\nThis is a PDF.\n");
pdf.save("output.pdf");

// From HTML
auto invoice = pdf_oxide::Pdf::from_html("<h1>Invoice</h1><p>Amount: $42</p>");
invoice.save("invoice.pdf");

// From plain text, or grab the bytes for in-memory use
auto notes = pdf_oxide::Pdf::from_text("Plain text body.");
std::vector<std::uint8_t> bytes = notes.to_bytes();

Прогоните только что созданный PDF обратно в Document за один шаг:

auto pdf  = pdf_oxide::Pdf::from_markdown("# Title\n\nbody\n");
auto doc  = pdf_oxide::Document::open_from_bytes(pdf.to_bytes());
std::cout << doc.to_markdown_all() << "\n";

Обработка ошибок

Любая операция, способная завершиться неудачей, выбрасывает pdf_oxide::Error, которое несёт нативное сообщение об ошибке (what()) и сырой код ошибки C ABI (code()). Дескрипторы также можно явно закрыть, и эта операция идемпотентна: doc.close() заблаговременно освобождает нативный дескриптор, а обращение к нему после закрытия выбрасывает исключение.

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    try {
        auto doc = pdf_oxide::Document::open("missing.pdf");
        std::cout << doc.extract_text(0) << "\n";
        doc.close();   // optional — happens automatically at scope exit
    } catch (const pdf_oxide::Error& e) {
        std::cerr << "pdf error (" << e.code() << "): " << e.what() << "\n";
        return 1;
    }
}

Дальнейшие шаги

Начало работы с Rust – использование PDF Oxide из Rust
Начало работы с Python – использование PDF Oxide из Python
Извлечение текста – подробные параметры извлечения и рецепты
Создание PDF – продвинутое создание с метаданными и шифрованием
Редактирование – изменение существующих PDF, аннотации и поля форм