What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Начало работы с PDF Oxide (Zig)

PDF Oxide — самая быстрая PDF-библиотека со встроенным извлечением текста: 0.8 мс в среднем и 100% успешных проходов на 3830 PDF. Привязка для Zig — это идиоматичный Zig поверх C ABI pdf_oxide через @cImport: без промежуточных прослоек, с полноценным interop с C. Дескрипторы представлены структурами с методом deinit, а возвращаемые C-строки и буферы копируются в аллокатор, переданный вызывающим кодом.

Привязка зафиксирована на Zig 0.15.1 — до релиза 1.0 как сама сборка, так и API импорта C меняются от версии к версии, поэтому CI использует ровно эту же версию.

Установка

Привязка линкуется с cdylib, собранным с набором функций по умолчанию (а не с Python-колесом). Соберите нативную библиотеку, после чего укажите Zig путь к заголовку и к cdylib:

# 1. build the native library (shipped binding feature set)
cargo build --release --lib \
  --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. test + run the example
cd zig
LD_LIBRARY_PATH="$PWD/../target/release" \
  zig build test \
    -DPDF_OXIDE_INCLUDE_DIR="$PWD/../include" \
    -DPDF_OXIDE_LIB_DIR="$PWD/../target/release"

LD_LIBRARY_PATH="$PWD/../target/release" \
  zig build example \
    -DPDF_OXIDE_INCLUDE_DIR="$PWD/../include" \
    -DPDF_OXIDE_LIB_DIR="$PWD/../target/release"

В собственном коде достаточно импортировать модуль — и можно начинать работу:

const pdf_oxide = @import("pdf_oxide");

Открытие PDF

Откройте файл через Document.open (или Document.openFromBytes для данных в памяти) и изучите его метаданные. Каждый дескриптор владеет C-ресурсами, поэтому сопровождайте его вызовом defer doc.deinit().

const std = @import("std");
const pdf_oxide = @import("pdf_oxide");

pub fn main() !void {
    const a = std.heap.page_allocator;

    var doc = try pdf_oxide.Document.open("research-paper.pdf");
    defer doc.deinit();

    std.debug.print("pages:   {d}\n", .{try doc.pageCount()});
    const v = doc.version();
    std.debug.print("version: {d}.{d}\n", .{ v.major, v.minor });
    std.debug.print("encrypted: {}\n", .{doc.isEncrypted()});
}

Извлечение текста

extractText возвращает текст одной страницы (нумерация с нуля). Результатом владеет переданный вами аллокатор, поэтому освобождайте его по завершении.

const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("report.pdf");
defer doc.deinit();

const text = try doc.extractText(a, 0);
defer a.free(text);
std.debug.print("{s}\n", .{text});

Варианты для всего документа извлекают сразу все страницы:

const all_text = try doc.toPlainTextAll(a);
defer a.free(all_text);
std.debug.print("{s}\n", .{all_text});

Конвертация в Markdown и HTML

Преобразуйте отдельную страницу или весь документ в Markdown или HTML. Каждый метод возвращает срез, которым владеет аллокатор.

const md = try doc.toMarkdown(a, 0);
defer a.free(md);
std.debug.print("{s}\n", .{md});

const md_all = try doc.toMarkdownAll(a);
defer a.free(md_all);

const html = try doc.toHtml(a, 0);
defer a.free(html);

Извлечение на уровне слов

extractWords возвращает срез структур Word с текстом, ограничивающим прямоугольником, шрифтом и флагом жирности. Освобождайте весь срез парным помощником freeWords — он высвобождает как строки отдельных слов, так и сам срез.

const words = try doc.extractWords(a, 0);
defer pdf_oxide.Document.freeWords(a, words);

for (words) |w| {
    std.debug.print("'{s}' at ({d:.1}, {d:.1}) font={s} size={d:.1} bold={}\n", .{
        w.text, w.bbox.x, w.bbox.y, w.fontName, w.fontSize, w.bold,
    });
}

Поля структуры Word:

Поле	Тип	Описание
`text`	`[]u8`	Текст слова (во владении аллокатора)
`bbox`	`Bbox`	`{ x, y, width, height }` в пунктах
`fontName`	`[]u8`	Имя шрифта PostScript
`fontSize`	`f32`	Размер шрифта в пунктах
`bold`	`bool`	Является ли фрагмент жирным

По той же схеме можно получить символы и строки:

const chars = try doc.extractChars(a, 0);
defer pdf_oxide.Document.freeChars(a, chars);

const lines = try doc.extractTextLines(a, 0);
defer pdf_oxide.Document.freeTextLines(a, lines);

Поиск

search ищет в пределах одной страницы; searchAll сканирует все страницы. Оба принимают NUL-терминированный запрос и флаг case_sensitive, а возвращают срез SearchResult.

const hits = try doc.searchAll(a, "configuration", false);
defer pdf_oxide.Document.freeSearchResults(a, hits);

for (hits) |hit| {
    std.debug.print("page {d}: '{s}' at ({d:.0}, {d:.0})\n", .{
        hit.page, hit.text, hit.bbox.x, hit.bbox.y,
    });
}

Чтобы ограничить поиск одной страницей, используйте search с индексом страницы:

const page_hits = try doc.search(a, 0, "Alpha", false);
defer pdf_oxide.Document.freeSearchResults(a, page_hits);

Создание PDF

Тип Pdf строит документы из Markdown, HTML или обычного текста. toBytes сериализует в память; save записывает на диск.

const a = std.heap.page_allocator;

var pdf = try pdf_oxide.Pdf.fromMarkdown("# Hello\n\nThis is a **Zig** PDF.\n");
defer pdf.deinit();

// Serialize to memory...
const bytes = try pdf.toBytes(a);
defer a.free(bytes);

// ...or write straight to disk.
try pdf.save("output.pdf");

Только что собранный PDF можно сразу же прогнать обратно через экстрактор:

var pdf = try pdf_oxide.Pdf.fromHtml("<h1>Invoice</h1><p>Amount: $42</p>");
defer pdf.deinit();

const bytes = try pdf.toBytes(a);
defer a.free(bytes);

var doc = try pdf_oxide.Document.openFromBytes(bytes);
defer doc.deinit();

const text = try doc.extractText(a, 0);
defer a.free(text);
std.debug.print("{s}\n", .{text});

Обработка ошибок

Методы, способные завершиться неудачно, возвращают Error!T, где Error — это error{ PdfOxide, OutOfMemory }. Поскольку значения ошибок в Zig не могут нести полезную нагрузку, исходный код ошибки из C ABI доступен через lastErrorCode() — читайте его сразу после перехвата error.PdfOxide.

const text = doc.extractText(a, 99) catch |err| switch (err) {
    error.PdfOxide => {
        std.debug.print("pdf_oxide error code: {d}\n", .{pdf_oxide.lastErrorCode()});
        return;
    },
    error.OutOfMemory => return err,
};
defer a.free(text);

Дальнейшие шаги

Начало работы с Rust — нативное ядро, на котором построен PDF Oxide
Начало работы с Python — использование PDF Oxide из Python
Извлечение текста — подробные параметры извлечения и рецепты
Создание PDF — продвинутое создание с метаданными и шифрованием