What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Початок роботи з PDF Oxide (C++)

PDF Oxide постачає ідіоматичні header-only прив’язки C++17 поверх свого ядра на Rust — видобування тексту із середнім часом 0.8 мс і 100% проходженням на 3 830 PDF. Хендли є move-only RAII-обгортками, нативні рядки та буфери копіюються для вас у std::string / std::vector<std::uint8_t>, а коди помилок C ABI викидаються як pdf_oxide::Error. Нове у v0.3.69.

Встановлення

Прив’язки — це єдиний заголовний файл (cpp/include/pdf_oxide/pdf_oxide.hpp), який лінкується з нативною cdylib. Зберіть бібліотеку один раз із кореня репозиторію, а потім вкажіть на неї CMake:

# 1. build the native library (shipped binding feature set)
cargo build --release --lib \
  --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. configure + build with the header-only wrapper
cmake -S cpp -B cpp/build -DCMAKE_BUILD_TYPE=Release \
  -DPDF_OXIDE_LIB_DIR="$PWD/target/release"
cmake --build cpp/build -j

Далі підключіть заголовний файл у власних одиницях трансляції:

#include <pdf_oxide/pdf_oxide.hpp>

Заголовний файл C оголошує глобальний тип Pdf, тому не робіть using namespace pdf_oxide;. Кваліфікуйте імена (pdf_oxide::Pdf, pdf_oxide::Document) або вводьте їх точковими using-оголошеннями.

Швидкий старт

Відкрийте PDF і видобудьте текст сторінки в порядку читання. Кожен виклик, що може завершитися помилкою, викидає pdf_oxide::Error, тому загорніть свою роботу в try/catch.

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    try {
        auto doc = pdf_oxide::Document::open("research-paper.pdf");

        std::cout << "pages: " << doc.page_count() << "\n";

        pdf_oxide::Version v = doc.version();
        std::cout << "version: " << static_cast<int>(v.major) << "."
                  << static_cast<int>(v.minor) << "\n";

        std::string text = doc.extract_text(0);   // 0-based page index
        std::cout << text << "\n";
        return 0;
    } catch (const pdf_oxide::Error& e) {
        std::cerr << "error: " << e.what() << "\n";
        return 1;
    }
}

Щоб відкрити PDF, який уже є в пам’яті, скористайтеся Document::open_from_bytes:

std::vector<std::uint8_t> bytes = load_pdf_bytes();   // from S3, HTTP, a DB…
auto doc = pdf_oxide::Document::open_from_bytes(bytes);
std::string text = doc.extract_text(0);

Конвертація в Markdown та HTML

Конвертуйте окрему сторінку — або весь документ — у Markdown чи HTML.

auto doc = pdf_oxide::Document::open("paper.pdf");

std::string page_md = doc.to_markdown(0);   // one page
std::string all_md   = doc.to_markdown_all(); // every page

std::string page_html = doc.to_html(0);
std::string all_html  = doc.to_html_all();

std::cout << all_md << "\n";

Видобування на рівні слів

extract_words(page_index) повертає std::vector<pdf_oxide::Word> із текстом, обмежувальним прямокутником та метаданими шрифту для кожного слова на сторінці.

auto doc   = pdf_oxide::Document::open("paper.pdf");
auto words = doc.extract_words(0);

for (const auto& w : words) {
    std::cout << "'" << w.text << "'"
              << " at (" << w.bbox.x << ", " << w.bbox.y << ")"
              << " size=" << w.font_size
              << " font=" << w.font_name
              << (w.bold ? " [bold]" : "") << "\n";
}

Поля pdf_oxide::Word:

Поле	Тип	Опис
`text`	`std::string`	Текст слова
`bbox`	`Bbox`	Обмежувальний прямокутник (`x`, `y`, `width`, `height`)
`font_name`	`std::string`	Назва шрифту в PostScript
`font_size`	`float`	Розмір шрифту в пунктах
`bold`	`bool`	Чи є фрагмент жирним

Видобування на рівні символів і рядків має таку саму форму: extract_chars(0) повертає записи Char (кодова позиція Unicode + bbox), а extract_text_lines(0) повертає записи TextLine (text, bbox, word_count).

Пошук

Шукайте на окремій сторінці за допомогою search(page_index, term, case_sensitive) або по всьому документу за допомогою search_all(term, case_sensitive). Обидва методи повертають std::vector<pdf_oxide::SearchResult>.

auto doc = pdf_oxide::Document::open("manual.pdf");

// One page
auto hits = doc.search(0, "configuration", /*case_sensitive=*/false);

// Every page
auto all_hits = doc.search_all("configuration", /*case_sensitive=*/false);
for (const auto& r : all_hits) {
    std::cout << "page " << r.page << ": '" << r.text << "'"
              << " at (" << r.bbox.x << ", " << r.bbox.y << ")\n";
}

Створення PDF

Білдер pdf_oxide::Pdf створює документи з Markdown, HTML або звичайного тексту. Серіалізуйте через to_bytes() або записуйте одразу на диск через save().

// From Markdown
auto pdf = pdf_oxide::Pdf::from_markdown("# Hello World\n\nThis is a PDF.\n");
pdf.save("output.pdf");

// From HTML
auto invoice = pdf_oxide::Pdf::from_html("<h1>Invoice</h1><p>Amount: $42</p>");
invoice.save("invoice.pdf");

// From plain text, or grab the bytes for in-memory use
auto notes = pdf_oxide::Pdf::from_text("Plain text body.");
std::vector<std::uint8_t> bytes = notes.to_bytes();

Поверніть щойно створений PDF назад у Document без проміжних кроків:

auto pdf  = pdf_oxide::Pdf::from_markdown("# Title\n\nbody\n");
auto doc  = pdf_oxide::Document::open_from_bytes(pdf.to_bytes());
std::cout << doc.to_markdown_all() << "\n";

Обробка помилок

Кожна операція, що може завершитися помилкою, викидає pdf_oxide::Error, який несе нативне повідомлення про помилку (what()) і сирий код помилки C ABI (code()). Хендли також можна явно закривати, і це ідемпотентно: doc.close() вивільняє нативний хендл завчасно, а використання після закриття викидає помилку.

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    try {
        auto doc = pdf_oxide::Document::open("missing.pdf");
        std::cout << doc.extract_text(0) << "\n";
        doc.close();   // optional — happens automatically at scope exit
    } catch (const pdf_oxide::Error& e) {
        std::cerr << "pdf error (" << e.code() << "): " << e.what() << "\n";
        return 1;
    }
}

Наступні кроки

Початок роботи з Rust – використання PDF Oxide з Rust
Початок роботи з Python – використання PDF Oxide з Python
Видобування тексту – докладні опції та рецепти видобування
Створення PDF – розширене створення з метаданими та шифруванням
Редагування – зміна наявних PDF, анотацій та полів форм