What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Справочник JavaScript API

PDF Oxide предоставляет WebAssembly-биндинги для JavaScript и TypeScript. npm-пакет pdf-oxide-wasm работает в Node.js, браузерах, сборщиках, Deno и Cloudflare Workers.

npm install pdf-oxide-wasm

Сборки под несколько целей (v0.3.38)

pdf-oxide-wasm теперь поставляется сразу с тремя сборками через условные экспорты в package.json. Выберите подпуть, соответствующий вашему рантайму, — для большинства окружений автоматически маршрутизируемый импорт верхнего уровня также корректно разрешается через поле exports.

Подпуть	Цель
`pdf-oxide-wasm/nodejs`	Node.js (CommonJS + ESM)
`pdf-oxide-wasm/bundler`	Vite, webpack, Rollup, esbuild, Bun
`pdf-oxide-wasm/web`	Браузеры, Deno, Cloudflare Workers

// Node.js
import { WasmPdfDocument } from "pdf-oxide-wasm/nodejs";

// Vite / webpack / Rollup
import init, { WasmPdfDocument } from "pdf-oxide-wasm/bundler";
await init();

// Browsers / Deno / Workers
import init, { WasmPdfDocument } from "pdf-oxide-wasm/web";
await init();

Это устраняет ошибку ReferenceError: Can't find variable: __dirname, возникавшую под браузерными сборщиками до версии v0.3.38.

Описание Rust API см. в Справочнике Rust API. Описание Python API см. в Справочнике Python API. Подробности по типам см. в разделе Типы и перечисления.

Часть методов доступна только при включении соответствующих Rust build-фич (rendering, signatures, barcodes, ocr-tract). Пакет pdf-oxide-wasm по умолчанию включает базовый набор; OCR поставляется в отдельной сборке wasm-ocr. См. Доступность функций.

Module Functions

Свободные функции, экспортируемые на верхнем уровне пакета.

import {
  setLogLevel, disableLogging,
  generateBarcodeSvg, generateQrSvg,
  planSplitByBookmarks, splitByBookmarks,
  setCryptoPolicy, cryptoPolicy, cryptoInventory, cryptoCbom,
  modelManifest, prefetchAvailable,
  signPdfBytes, signPdfBytesPades, hasDocumentTimestamp,
} from "pdf-oxide-wasm";

Логирование

setLogLevel(level)   // Set log verbosity: "off" | "error" | "warn" | "info" | "debug" | "trace"
disableLogging()     // Silence all log output

Штрихкоды

generateBarcodeSvg(barcodeType, data) -> string  // 1D barcode as SVG; type 0–7 (Code128, Code39, Ean13, Ean8, UpcA, Itf, Code93, Codabar)
generateQrSvg(data, errorCorrection, size) -> string  // QR code as SVG; errorCorrection 0=Low 1=Medium 2=Quartile 3=High

Разбиение по закладкам

planSplitByBookmarks(srcBytes, titlePrefix, ignoreCase, level, includeFrontMatter) -> Array  // Plan a split without producing PDFs; returns segment descriptors
splitByBookmarks(srcBytes, titlePrefix, ignoreCase, level, includeFrontMatter) -> Array       // Split at bookmark boundaries; returns [segment, bytes] pairs (level 0=all depths, 1=top-level)

Управление криптополитикой

setCryptoPolicy(spec)   // Install the process-wide crypto policy ("compat" | "strict" | "fips-strict"[;…]); fail-closed
cryptoPolicy() -> string  // The active crypto policy as its canonical grammar string
cryptoInventory() -> string[]  // Algorithm tokens exercised so far this process
cryptoCbom() -> string  // CycloneDX 1.6 Cryptographic Bill of Materials (JSON string)

Подготовка OCR-моделей

modelManifest() -> string   // JSON manifest of OCR detector/recognizer cache filenames and source URLs (host-side fetch)
prefetchAvailable() -> boolean  // Whether this build can download OCR models to a local cache (always false in WASM)

Подписание (свободные функции)

signPdfBytes(pdfData, cert, reason?, location?) -> Uint8Array  // Sign raw PDF bytes with a WasmCertificate; returns the signed PDF
signPdfBytesPades(pdfData, cert, level, timestampToken?, revocation?, reason?, location?) -> Uint8Array  // Sign at a PAdES baseline level (BB/BT/BLt); pass a pre-fetched RFC 3161 token for BT/BLt
hasDocumentTimestamp(pdfData) -> boolean  // Whether the PDF carries a document-scoped /DocTimeStamp (PAdES-B-LTA)

WasmPdfDocument

Основной класс для открытия, извлечения, редактирования и сохранения PDF.

import { WasmPdfDocument } from "pdf-oxide-wasm";

Конструктор

`new WasmPdfDocument(data, password?)`

Загружает PDF-документ из «сырых» байтов.

Параметр	Тип	Описание
`data`	`Uint8Array`	Содержимое PDF-файла
`password`	`string \| undefined`	Необязательный пароль для зашифрованных PDF

Исключения: Error, если PDF некорректен или не поддаётся разбору.

const bytes = new Uint8Array(readFileSync("document.pdf"));
const doc = new WasmPdfDocument(bytes);

Статические конструкторы

WasmPdfDocument.openFromDocxBytes(data) -> WasmPdfDocument  // Convert DOCX bytes to a PDF document
WasmPdfDocument.openFromPptxBytes(data) -> WasmPdfDocument  // Convert PPTX bytes to a PDF document
WasmPdfDocument.openFromXlsxBytes(data) -> WasmPdfDocument  // Convert XLSX bytes to a PDF document

Базовое чтение (только чтение)

`pageCount() -> number`

Возвращает количество страниц в документе.

`version() -> Uint8Array`

Возвращает версию PDF в виде [major, minor].

const [major, minor] = doc.version();
console.log(`PDF ${major}.${minor}`);

`authenticate(password) -> boolean`

Расшифровывает зашифрованный PDF. Возвращает true, если аутентификация прошла успешно.

Параметр	Тип	Описание
`password`	`string`	Строка пароля

`hasStructureTree() -> boolean`

Проверяет, является ли документ тегированным PDF (Tagged PDF) с деревом структуры.

Анализ подписей

signatureCount() -> number          // Number of digital signatures in the document
signatures() -> WasmSignature[]     // Parsed signatures (signer, reason, time, verify())
dss() -> Dss | null                 // Document Security Store (certs/CRLs/OCSP), or null

Извлечение текста

`extractText(pageIndex, region?) -> string`

Извлекает обычный текст с одной страницы. Передайте необязательную область [x, y, w, h], чтобы ограничить извлечение.

Параметр	Тип	Описание
`pageIndex`	`number`	Номер страницы, отсчёт с нуля
`region`	`number[] \| undefined`	Необязательная область отсечения `[x, y, width, height]`

const text = doc.extractText(0);

`extractAllText() -> string`

Извлекает обычный текст со всех страниц, разделённый символами перевода страницы (form feed).

`extractStructured(pageIndex) -> string`

Извлекает структурированное JSON-представление страницы (блоки, строки, стилизация).

`extractChars(pageIndex, region?) -> Array`

Извлекает отдельные символы с точным позиционированием и метаданными шрифта.

Параметр	Тип	Описание
`pageIndex`	`number`	Номер страницы, отсчёт с нуля
`region`	`number[] \| undefined`	Необязательная область отсечения `[x, y, width, height]`

Возвращает: массив объектов с полями:

Поле	Тип	Описание
`char`	`string`	Символ
`bbox`	`{x, y, width, height}`	Ограничивающий прямоугольник
`fontName`	`string`	Имя шрифта
`fontSize`	`number`	Размер шрифта в пунктах
`fontWeight`	`string`	Насыщенность (Normal, Bold и т. д.)
`isItalic`	`boolean`	Флаг курсива
`color`	`{r, g, b}`	Цвет RGB (0.0–1.0)

const chars = doc.extractChars(0);
for (const c of chars) {
  console.log(`'${c.char}' at (${c.bbox.x}, ${c.bbox.y})`);
}

`extractPageText(pageIndex, readingOrder?) -> object`

Возвращает спаны, символы и размеры страницы за один проход извлечения. Эффективнее, чем раздельные вызовы extractSpans() + extractChars(). Для многоколоночных PDF передайте "column_aware".

Параметр	Тип	Описание
`pageIndex`	`number`	Номер страницы, отсчёт с нуля
`readingOrder`	`string \| undefined`	`"column_aware"` или `"top_to_bottom"` (по умолчанию)

Возвращает: объект с полями:

Поле	Тип	Описание
`spans`	`Array`	Массив объектов спанов
`chars`	`Array`	Массив объектов символов
`pageWidth`	`number`	Ширина страницы в пунктах PDF
`pageHeight`	`number`	Высота страницы в пунктах PDF
`text`	`string`	Полное текстовое содержимое

const result = doc.extractPageText(0);
console.log(`Page: ${result.pageWidth}x${result.pageHeight} pt`);
for (const span of result.spans) {
  console.log(`'${span.text}' font=${span.fontName} size=${span.fontSize}`);
}

`extractSpans(pageIndex, region?, readingOrder?) -> Array`

Извлекает стилизованные текстовые спаны с метаданными шрифта. Для многоколоночных PDF передайте "column_aware" в качестве readingOrder.

Параметр	Тип	Описание
`pageIndex`	`number`	Номер страницы, отсчёт с нуля
`region`	`number[] \| undefined`	Необязательная область отсечения `[x, y, width, height]`
`readingOrder`	`string \| undefined`	`"column_aware"` или `"top_to_bottom"` (по умолчанию)

Возвращает: массив объектов с полями:

Поле	Тип	Описание
`text`	`string`	Текстовое содержимое
`bbox`	`{x, y, width, height}`	Ограничивающий прямоугольник
`fontName`	`string`	Имя шрифта
`fontSize`	`number`	Размер шрифта в пунктах
`fontWeight`	`string`	Насыщенность (Normal, Bold и т. д.)
`isItalic`	`boolean`	Флаг курсива
`isMonospace`	`boolean`	Является ли шрифт моноширинным
`charWidths`	`number[]`	Ширины продвижения по каждому глифу
`color`	`{r, g, b}`	Цвет RGB (0.0–1.0)

const spans = doc.extractSpans(0);
for (const span of spans) {
  console.log(`"${span.text}" size=${span.fontSize}`);
}

Слова, строки, таблицы

extractWords(pageIndex, region?) -> Array       // Word-level boxes with text + font metadata
extractTextLines(pageIndex, region?) -> Array   // Line-level boxes, each with its words
extractTables(pageIndex, region?) -> Array      // Detected tables with rows/cells (text + bboxes)

Колонтитулы и служебные артефакты

Обнаружение и удаление или стирание «бегущих» верхних/нижних колонтитулов и служебных элементов оформления страницы.

removeHeaders(threshold) -> number     // Remove detected headers across the document; returns count removed
removeFooters(threshold) -> number     // Remove detected footers; returns count removed
removeArtifacts(threshold) -> number   // Remove detected page artifacts; returns count removed
eraseHeader(pageIndex)                 // Queue an erase of the header region on a page
editHeader(pageIndex)                  // Mark the header region for editing on a page
eraseFooter(pageIndex)                 // Queue an erase of the footer region on a page
editFooter(pageIndex)                  // Mark the footer region for editing on a page
eraseArtifacts(pageIndex)              // Queue an erase of detected artifacts on a page

Извлечение по области

`within(pageIndex, region) -> WasmPdfPageRegion`

Ограничивает последующее извлечение прямоугольной областью страницы. region задаётся как [x, y, width, height]. См. WasmPdfPageRegion.

const region = doc.within(0, [50, 600, 400, 150]);
const text = region.extractText();

Конвертация форматов

`toMarkdown(pageIndex, detectHeadings?, includeImages?, includeFormFields?) -> string`

Конвертирует одну страницу в Markdown.

Параметр	Тип	По умолчанию	Описание
`pageIndex`	`number`	–	Номер страницы, отсчёт с нуля
`detectHeadings`	`boolean`	`true`	Определять заголовки по размеру шрифта
`includeImages`	`boolean`	`true`	Включать изображения
`includeFormFields`	`boolean`	`true`	Включать значения полей форм

`toMarkdownAll(detectHeadings?, includeImages?, includeFormFields?) -> string`

Конвертирует все страницы в Markdown.

`toHtml(pageIndex, preserveLayout?, detectHeadings?, includeFormFields?) -> string`

Конвертирует одну страницу в HTML.

Параметр	Тип	По умолчанию	Описание
`pageIndex`	`number`	–	Номер страницы, отсчёт с нуля
`preserveLayout`	`boolean`	`false`	Сохранять визуальную вёрстку
`detectHeadings`	`boolean`	`true`	Определять заголовки
`includeFormFields`	`boolean`	`true`	Включать значения полей форм

`toHtmlAll(preserveLayout?, detectHeadings?, includeFormFields?) -> string`

Конвертирует все страницы в HTML.

`toPlainText(pageIndex) -> string`

Конвертирует одну страницу в обычный текст.

`toPlainTextAll() -> string`

Конвертирует все страницы в обычный текст.

Round-trip в офисные форматы

toDocxBytes() -> Uint8Array   // Export the document as a DOCX file
toPptxBytes() -> Uint8Array   // Export the document as a PPTX file
toXlsxBytes() -> Uint8Array   // Export the document as an XLSX file

Поиск

`search(pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array`

Ищет текст по всем страницам.

Параметр	Тип	По умолчанию	Описание
`pattern`	`string`	–	Шаблон поиска (строка или регулярное выражение)
`caseInsensitive`	`boolean`	`false`	Поиск без учёта регистра
`literal`	`boolean`	`false`	Трактовать шаблон как литеральную строку
`wholeWord`	`boolean`	`false`	Искать только слова целиком
`maxResults`	`number`	`0`	Максимум результатов (0 = без ограничения)

Возвращает: массив объектов с полями:

Поле	Тип	Описание
`page`	`number`	Номер страницы
`text`	`string`	Найденный текст
`bbox`	`object`	Ограничивающий прямоугольник
`startIndex`	`number`	Начальный индекс в тексте страницы
`endIndex`	`number`	Конечный индекс в тексте страницы

`searchPage(pageIndex, pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array`

Ищет текст в пределах одной страницы.

Информация об изображениях

`extractImages(pageIndex) -> Array`

Возвращает метаданные изображений на странице.

Поле	Тип	Описание
`width`	`number`	Ширина изображения в пикселях
`height`	`number`	Высота изображения в пикселях
`colorSpace`	`string`	Цветовое пространство (например, `DeviceRGB`)
`bitsPerComponent`	`number`	Бит на цветовой канал
`bbox`	`object`	Положение на странице

`extractImageBytes(pageIndex) -> Array`

Извлекает «сырые» байты изображений со страницы. Возвращает массив объектов:

Поле	Тип	Описание
`width`	`number`	Ширина изображения в пикселях
`height`	`number`	Высота изображения в пикселях
`data`	`Uint8Array`	«Сырые» байты изображения
`format`	`string`	Формат изображения

`pageImages(pageIndex) -> Array`

Возвращает имена и границы изображений для операций позиционирования.

Поле	Тип	Описание
`name`	`string`	Имя XObject
`bounds`	`number[]`	`[x, y, width, height]`
`matrix`	`number[]`	Матрица преобразования `[a, b, c, d, e, f]`

Векторное содержимое

extractPaths(pageIndex, region?) -> Array   // Vector paths (lines, curves, shapes) on a page
extractRects(pageIndex, region?) -> Array   // Axis-aligned rectangles detected from path segments
extractLines(pageIndex, region?) -> Array   // Straight line segments detected from path data

Структура документа

`getOutline() -> Array | null`

Возвращает закладки документа / оглавление. Возвращает null, если оглавление отсутствует.

`getAnnotations(pageIndex) -> Array`

Возвращает метаданные аннотаций (тип, прямоугольник, содержимое и т. д.) для страницы.

`pageLabels() -> Array`

Возвращает диапазоны меток страниц. Возвращает массив объектов:

Поле	Тип	Описание
`startPage`	`number`	Первая страница в этом диапазоне
`style`	`string`	Стиль нумерации
`prefix`	`string`	Префикс метки
`startValue`	`number`	Начальный номер

`xmpMetadata() -> object | null`

Возвращает метаданные XMP. Возвращает null, если они отсутствуют. Поля объекта включают:

Поле	Тип	Описание
`dcTitle`	`string \| null`	Заголовок документа
`dcCreator`	`string[] \| null`	Список авторов
`dcDescription`	`string \| null`	Описание
`xmpCreatorTool`	`string \| null`	Инструмент-создатель
`xmpCreateDate`	`string \| null`	Дата создания
`xmpModifyDate`	`string \| null`	Дата изменения
`pdfProducer`	`string \| null`	Производитель PDF

Поля форм

`getFormFields() -> Array`

Возвращает все поля форм с именем, типом, значением и флагами.

Поле	Тип	Описание
`name`	`string`	Имя поля
`fieldType`	`string`	Тип поля (text, checkbox и т. д.)
`value`	`string`	Текущее значение
`flags`	`number`	Флаги поля

const fields = doc.getFormFields();
for (const f of fields) {
  console.log(`${f.name} (${f.fieldType}) = ${f.value}`);
}

`hasXfa() -> boolean`

Проверяет, содержит ли документ XFA-формы.

`getFormFieldValue(name) -> any`

Возвращает значение поля формы по имени. Возвращает string, boolean или null в зависимости от типа поля.

`setFormFieldValue(name, value) -> void`

Устанавливает значение поля формы по имени.

Параметр	Тип	Описание
`name`	`string`	Имя поля
`value`	`string \| boolean`	Новое значение поля

`exportFormData(format?) -> Uint8Array`

Экспортирует данные формы в формате FDF (по умолчанию) или XFDF.

Параметр	Тип	По умолчанию	Описание
`format`	`string`	`"fdf"`	Формат экспорта: `"fdf"` или `"xfdf"`

Уплощение форм

flattenForms()                    // Flatten all form fields into page content
flattenFormsOnPage(pageIndex)     // Flatten forms on a specific page
flattenWarnings() -> string[]     // Warnings produced by the last flatten operation

Редактирование

Метаданные

Метод	Параметры	Описание
`setTitle(title)`	`string`	Установить заголовок документа
`setAuthor(author)`	`string`	Установить автора документа
`setSubject(subject)`	`string`	Установить тему документа
`setKeywords(keywords)`	`string`	Установить ключевые слова документа

Поворот страниц

Метод	Параметры	Описание
`pageRotation(pageIndex)`	`number`	Получить текущий поворот (0, 90, 180, 270)
`setPageRotation(pageIndex, degrees)`	`number, number`	Установить абсолютный поворот
`rotatePage(pageIndex, degrees)`	`number, number`	Добавить к текущему повороту
`rotateAllPages(degrees)`	`number`	Повернуть все страницы

Размеры страниц

Метод	Параметры	Описание
`pageMediaBox(pageIndex)`	`number`	Получить MediaBox `[llx, lly, urx, ury]`
`setPageMediaBox(pageIndex, llx, lly, urx, ury)`	`number, ...`	Установить MediaBox
`pageCropBox(pageIndex)`	`number`	Получить CropBox (может быть null)
`setPageCropBox(pageIndex, llx, lly, urx, ury)`	`number, ...`	Установить CropBox
`cropMargins(left, right, top, bottom)`	`number, ...`	Обрезать поля всех страниц

Операции со страницами

deletePage(index)                 // Delete a page by index
movePage(fromIndex, toIndex)      // Move a page to a new position
extractPages(pages) -> Uint8Array // Build a new PDF from the given page indices

Стирание / замазывание

Метод	Параметры	Описание
`eraseRegion(pageIndex, llx, lly, urx, ury)`	`number, ...`	Стереть область
`eraseRegions(pageIndex, rects)`	`number, Float32Array`	Стереть несколько областей
`clearEraseRegions(pageIndex)`	`number`	Очистить отложенные операции стирания

Аннотации и редактирование (redaction)

Метод	Параметры	Описание
`flattenPageAnnotations(pageIndex)`	`number`	Уплощить аннотации на странице
`flattenAllAnnotations()`	–	Уплощить все аннотации
`applyPageRedactions(pageIndex)`	`number`	Применить redaction на странице
`applyAllRedactions()`	–	Применить все redaction
`addRedaction(page, x0, y0, x1, y1, fill?)`	`number, ...`	Поставить в очередь область redaction (необязательная заливка `[r,g,b]`)
`redactionCount(page)`	`number`	Количество redaction в очереди для страницы
`applyRedactionsDestructive(scrubMetadata?)`	`boolean`	Деструктивно удалить содержимое; возвращает отчёт о redaction
`sanitizeDocument(scrubMetadata?, removeJavascript?, removeEmbeddedFiles?)`	`boolean, ...`	Удалить метаданные, скрипты, встроенные файлы; возвращает отчёт

Объединение и встраивание

`mergeFrom(data) -> number`

Объединяет страницы из другого PDF. Возвращает количество объединённых страниц.

Параметр	Тип	Описание
`data`	`Uint8Array`	Байты исходного PDF-файла

`embedFile(name, data) -> void`

Прикрепляет файл к PDF.

Параметр	Тип	Описание
`name`	`string`	Имя файла вложения
`data`	`Uint8Array`	Содержимое файла

Манипуляции с изображениями

Метод	Параметры	Описание
`repositionImage(pageIndex, name, x, y)`	`number, string, number, number`	Переместить изображение
`resizeImage(pageIndex, name, w, h)`	`number, string, number, number`	Изменить размер изображения
`setImageBounds(pageIndex, name, x, y, w, h)`	`number, string, ...`	Установить границы изображения

Классификация и автоизвлечение

classifyDocument() -> string                 // Classify the whole document (e.g. born-digital vs scanned)
classifyPage(pageIndex) -> string            // Classify a single page
extractTextAuto(pageIndex) -> string         // Auto-pick native vs OCR extraction for a page
extractPageAuto(pageIndex, optionsJson?) -> string  // Auto-extraction returning a structured JSON page

Валидация

validatePdfA(level) -> object        // Validate against a PDF/A conformance level (e.g. "2b")
convertToPdfA(level) -> object       // Convert toward a PDF/A level; returns a report
validatePdfUa(level?) -> object      // Validate against PDF/UA accessibility
validatePdfX(level?) -> object       // Validate against a PDF/X print level

Рендеринг

Требует фичу rendering.

Метод	Параметры	Возвращает	Описание
`renderPage(pageIndex, dpi?)`	`number, number`	`Uint8Array`	Отрендерить страницу в байты PNG (по умолчанию 150 dpi)
`flattenToImages(dpi?)`	`number`	`Uint8Array`	Уплощить все страницы в PDF на основе изображений

OCR

Требует сборку wasm-ocr. См. WasmOcrEngine.

`extractTextOcr(pageIndex, engine) -> string`

Запускает встроенный в WASM конвейер OCR для страницы, используя собранный на стороне хоста WasmOcrEngine. Возвращает распознанный текст в порядке чтения.

const text = doc.extractTextOcr(0, engine);

Сохранение

`save() -> Uint8Array`

Сохраняет отредактированный PDF в виде байтов. Также доступен псевдоним saveToBytes().

`saveWithOptions(compress?, garbageCollect?, linearize?) -> Uint8Array`

Сохраняет с явными параметрами сериализации.

Параметр	Тип	По умолчанию	Описание
`compress`	`boolean`	`true`	Сжимать потоки объектов
`garbageCollect`	`boolean`	`true`	Удалять объекты без ссылок
`linearize`	`boolean`	`false`	Создавать линеаризованный PDF («fast web view»)

`saveEncryptedToBytes(password, ownerPassword?, allowPrint?, allowCopy?, allowModify?, allowAnnotate?) -> Uint8Array`

Сохраняет с шифрованием AES-256.

Параметр	Тип	По умолчанию	Описание
`password`	`string`	–	Пользовательский пароль
`ownerPassword`	`string`	пользовательский пароль	Пароль владельца
`allowPrint`	`boolean`	`true`	Разрешить печать
`allowCopy`	`boolean`	`true`	Разрешить копирование
`allowModify`	`boolean`	`true`	Разрешить изменение
`allowAnnotate`	`boolean`	`true`	Разрешить аннотирование

`free()`

Освобождает память WASM. Всегда вызывайте этот метод по завершении работы с документом.

WasmPdfPageRegion

Дескриптор области, возвращаемый методом WasmPdfDocument.within(pageIndex, region). Методы извлечения ограничены прямоугольником.

extractText() -> string       // Plain text within the region
extractChars() -> Array       // Characters within the region
extractWords() -> Array       // Words within the region
extractTextLines() -> Array   // Text lines within the region
extractTables() -> Array      // Tables within the region
extractImages() -> Array      // Images within the region
extractPaths() -> Array       // Vector paths within the region
extractRects() -> Array       // Rectangles within the region
extractLines() -> Array       // Line segments within the region
extractTextOcr(engine?) -> string  // OCR text within the region (wasm-ocr build)

WasmPdf

Фабричный класс для создания новых PDF.

import { WasmPdf } from "pdf-oxide-wasm";

Статические методы

WasmPdf.fromMarkdown(content, title?, author?) -> WasmPdf  // Create a PDF from Markdown text
WasmPdf.fromHtml(content, title?, author?) -> WasmPdf      // Create a PDF from HTML
WasmPdf.fromText(content, title?, author?) -> WasmPdf      // Create a PDF from plain text
WasmPdf.fromBytes(data) -> WasmPdf                         // Open an existing PDF from bytes for modification
WasmPdf.fromImageBytes(data) -> WasmPdf                    // Single-page PDF from one image (JPEG/PNG)
WasmPdf.fromMultipleImageBytes(imagesArray) -> WasmPdf     // Multi-page PDF, one page per image
WasmPdf.merge(pdfs) -> WasmPdf                             // Merge an array of PDF byte buffers into one
WasmPdf.fromHtmlCss(html, css, fontBytes) -> WasmPdf       // HTML + CSS with a single embedded font
WasmPdf.fromHtmlCssWithFonts(html, css, fonts) -> WasmPdf  // HTML + CSS with multiple [name, bytes] fonts

Параметр	Тип	Описание
`content`	`string`	Исходное содержимое (Markdown / HTML / текст)
`title`	`string \| undefined`	Заголовок документа
`author`	`string \| undefined`	Автор документа
`data`	`Uint8Array`	Байты PDF- или файла изображения
`imagesArray`	`Uint8Array[]`	Массив байтов файлов изображений
`pdfs`	`Uint8Array[]`	Массив байтов PDF-файлов для объединения

Методы экземпляра

`toBytes() -> Uint8Array`

Возвращает PDF в виде байтов.

`size -> number`

Размер PDF в байтах (геттер только для чтения).

const pdf = WasmPdf.fromMarkdown("# Hello World\n\nThis is a PDF.");
console.log(`PDF size: ${pdf.size} bytes`);
writeFileSync("output.pdf", pdf.toBytes());

WasmDocumentBuilder

Гибкий низкоуровневый построитель вёрстки для постраничного составления PDF. Используйте в паре с WasmFluentPageBuilder.

import { WasmDocumentBuilder } from "pdf-oxide-wasm";
const builder = new WasmDocumentBuilder();

Настройка документа

new WasmDocumentBuilder()          // Create an empty builder
title(title)                       // Set document title
author(author)                     // Set document author
subject(subject)                   // Set document subject
keywords(keywords)                 // Set document keywords
creator(creator)                   // Set the creator tool name
onOpen(script)                     // Set a document-level open JavaScript action
taggedPdfUa1()                     // Enable Tagged PDF / PDF/UA-1 output
language(lang)                     // Set the document language (e.g. "en-US")
roleMap(custom, standard)          // Map a custom structure tag to a standard role
registerEmbeddedFont(name, font)   // Register a WasmEmbeddedFont under a name

Создание страниц и вывод

a4Page() -> WasmFluentPageBuilder         // Start a new A4 page
letterPage() -> WasmFluentPageBuilder     // Start a new US Letter page
page(width, height) -> WasmFluentPageBuilder  // Start a custom-size page (points)
commitPage(page)                          // Commit a completed page builder
build() -> Uint8Array                     // Finish and return the PDF bytes
toBytesEncrypted(userPassword, ownerPassword?) -> Uint8Array  // Finish with AES-256 encryption

WasmFluentPageBuilder

Постраничный построитель, возвращаемый методами a4Page() / letterPage() / page(). Поставьте операции в очередь, затем зафиксируйте с помощью done(builder) (или builder.commitPage(page)).

Текст и поток

font(name, size)                 // Set the current font and size
at(x, y)                         // Move the cursor to an absolute position
text(text)                       // Draw text at the cursor
heading(level, text)             // Draw a heading (level 1–6)
paragraph(text)                  // Draw a wrapped paragraph
space(points)                    // Advance the cursor vertically
horizontalRule()                 // Draw a horizontal rule
newline()                        // Advance to the next line
columns(columnCount, gapPt, text)  // Lay text out in N balanced columns
footnote(refMark, noteText)      // Add a footnote marker + bottom-of-page note

Встроенные фрагменты (inline runs)

inline(text)                     // Append an inline text run
inlineBold(text)                 // Append a bold inline run
inlineItalic(text)               // Append an italic inline run
inlineColor(r, g, b, text)       // Append a colored inline run (RGB 0.0–1.0)

Действия ссылок и форм

linkUrl(url)                     // Wrap the last element in a URL link
linkPage(page)                   // Link to another page index
linkNamed(destination)           // Link to a named destination
linkJavascript(script)           // Attach a JavaScript link action
onOpen(script)                   // Page open action
onClose(script)                  // Page close action
fieldKeystroke(script)           // Keystroke JavaScript for the last field
fieldFormat(script)              // Format JavaScript for the last field
fieldValidate(script)            // Validate JavaScript for the last field
fieldCalculate(script)           // Calculate JavaScript for the last field

Аннотации разметки

highlight(r, g, b)               // Highlight the last text run (RGB 0.0–1.0)
underline(r, g, b)               // Underline the last text run
strikeout(r, g, b)               // Strike out the last text run
squiggly(r, g, b)                // Squiggly-underline the last text run
stickyNote(text)                 // Add a sticky note at the cursor
stickyNoteAt(x, y, text)         // Add a sticky note at an absolute position
stamp(name)                      // Add a rubber-stamp annotation (e.g. "Approved")
freeText(x, y, w, h, text)       // Add a free-text annotation box
watermark(text)                  // Add a text watermark
watermarkConfidential()          // Add a "CONFIDENTIAL" watermark
watermarkDraft()                 // Add a "DRAFT" watermark

Виджеты AcroForm

textField(name, x, y, w, h, defaultValue?)            // Add a text field
checkbox(name, x, y, w, h, checked)                   // Add a checkbox
comboBox(name, x, y, w, h, options, selected?)        // Add a dropdown combo box
radioGroup(name, values, xs, ys, ws, hs, selected?)   // Add a radio-button group (parallel arrays)
pushButton(name, x, y, w, h, caption)                 // Add a clickable push button
signatureField(name, x, y, w, h)                      // Add an unsigned signature placeholder

Штрихкоды и изображения

barcode1d(barcodeType, data, x, y, w, h)   // Draw a 1D barcode (type 0–7)
barcodeQr(data, x, y, size)                // Draw a QR code
imageWithAlt(bytes, x, y, w, h, altText)   // Embed an image with accessibility alt text
imageArtifact(bytes, x, y, w, h)           // Embed a decorative image as an /Artifact

Графические примитивы

rect(x, y, w, h)                                  // Stroked 1pt rectangle outline
filledRect(x, y, w, h, r, g, b)                   // Filled rectangle (RGB 0.0–1.0)
line(x1, y1, x2, y2)                              // 1pt black line
strokeRect(x, y, w, h, width, r, g, b)            // Stroked rectangle, explicit width + color
strokeRectDashed(x, y, w, h, width, r, g, b, dash, phase)  // Dashed rectangle border
strokeLine(x1, y1, x2, y2, width, r, g, b)        // Line with explicit width + color
strokeLineDashed(x1, y1, x2, y2, width, r, g, b, dash, phase)  // Dashed line
textInRect(x, y, w, h, text, align)               // Lay text inside a rectangle (align 0/1/2)

Помощники вёрстки и завершение

measure(text) -> number                  // Rendered width of text in the current font (points)
remainingSpace() -> number               // Vertical space left on the page (points)
newPageSameSize()                        // Start a new page with the same dimensions
table(spec)                              // Draw a buffered table from a spec object
streamingTable(spec) -> WasmStreamingTable  // Open a streaming table for large datasets
done(builder)                            // Commit this page's queued ops to the document builder

Объект-спецификация для table(spec) использует структуру { columns: [{ header, width, align }], rows: [[...]], hasHeader }. Спецификация streamingTable(spec) дополнительно включает { repeatHeader, mode, sampleRows, minColWidthPt, maxColWidthPt, maxRowspan, batchSize }.

WasmStreamingTable

Дескриптор потоковой таблицы (по строкам), возвращаемый методом WasmFluentPageBuilder.streamingTable(spec). Добавляйте строки инкрементально, затем вызовите finish().

columnCount() -> number       // Number of columns
pendingRowCount() -> number   // Rows in the current un-flushed batch
batchCount() -> number        // Number of completed batches
pushRow(cells)                // Push one row (array of cell strings)
pushRowSpan(cells)            // Push a row whose cells may carry rowspans
flush()                       // Flush the current batch
finish()                      // Finalize the table and replay it into the page

WasmEmbeddedFont

Шрифт, зарегистрированный для встраивания через WasmDocumentBuilder.registerEmbeddedFont.

WasmEmbeddedFont.fromBytes(data, name?) -> WasmEmbeddedFont  // Load a TTF/OTF font from bytes
font.name -> string                                          // The font's resolved name (getter)

Шаблоны страниц

Переиспользуемое оформление колонтитулов, применяемое ко всем страницам.

WasmArtifactStyle

new WasmArtifactStyle()        // Default style
font(name, size) -> this       // Set font family and size
bold() -> this                 // Make the text bold
color(r, g, b) -> this         // Set the text color (RGB 0.0–1.0)

WasmArtifact

new WasmArtifact()                       // Empty artifact
WasmArtifact.left(text) -> WasmArtifact   // Left-aligned artifact text
WasmArtifact.center(text) -> WasmArtifact // Center-aligned artifact text
WasmArtifact.right(text) -> WasmArtifact  // Right-aligned artifact text
withStyle(style) -> this                  // Apply a WasmArtifactStyle
withOffset(offset) -> this                // Set the vertical offset from the edge

WasmHeader / WasmFooter

new WasmHeader()                  // Empty header (WasmFooter is identical)
WasmHeader.left(text) -> WasmHeader     // Left-aligned header text
WasmHeader.center(text) -> WasmHeader   // Center-aligned header text
WasmHeader.right(text) -> WasmHeader    // Right-aligned header text

WasmPageTemplate

new WasmPageTemplate()         // Empty template
header(header) -> this         // Set the page header artifact
footer(footer) -> this         // Set the page footer artifact
skipFirstPage() -> this        // Omit header/footer on the first page

Электронные подписи

Требует фичу signatures.

WasmCertificate

WasmCertificate.load(data) -> WasmCertificate                  // Load a DER certificate + key bundle
WasmCertificate.loadPem(certPem, keyPem) -> WasmCertificate    // Load from PEM cert + key strings
WasmCertificate.loadPkcs12(data, password) -> WasmCertificate  // Load from a PKCS#12 (.p12/.pfx) blob
cert.subject -> string         // Subject distinguished name (getter)
cert.issuer -> string          // Issuer distinguished name (getter)
cert.serial -> string          // Serial number (getter)
cert.validity -> bigint[]      // [notBefore, notAfter] as unix seconds (getter)
cert.isValid -> boolean        // Whether the certificate is currently valid (getter)

WasmSignature

Возвращается методом WasmPdfDocument.signatures().

sig.signerName -> string | null          // Signer common name (getter)
sig.reason -> string | null              // Signing reason (getter)
sig.location -> string | null            // Signing location (getter)
sig.contactInfo -> string | null         // Signer contact info (getter)
sig.signingTime -> bigint | null         // Signing time as unix seconds (getter)
sig.coversWholeDocument -> boolean       // Whether the signature covers the entire file (getter)
sig.padesLevel -> PadesLevel             // PAdES baseline level of the signature (getter)
sig.verify() -> boolean                  // Verify the signature cryptographically
sig.verifyDetached(pdfData) -> boolean   // Verify including a messageDigest check against the bytes

WasmTimestamp

WasmTimestamp.parse(data) -> WasmTimestamp  // Parse a DER TimeStampToken / TSTInfo
ts.time -> bigint              // Timestamp time as unix seconds (getter)
ts.serial -> string            // Serial number (getter)
ts.policyOid -> string         // TSA policy OID (getter)
ts.tsaName -> string           // TSA name (getter)
ts.hashAlgorithm -> number     // Imprint hash algorithm id (getter)
ts.messageImprint -> Uint8Array  // The message imprint digest (getter)
ts.verify() -> boolean         // Verify the timestamp token

WasmRevocationMaterial

Материал для офлайн-валидации PAdES-B-LT, используемый функцией signPdfBytesPades.

new WasmRevocationMaterial()   // Empty material set
addCert(der)                   // Add a DER X.509 certificate
addCrl(der)                    // Add a DER CRL
addOcsp(der)                   // Add a DER OCSP response

Dss

Разобранное хранилище безопасности документа (Document Security Store), возвращаемое методом WasmPdfDocument.dss().

dss.certCount -> number        // Number of DER certificates (getter)
getCert(i) -> Uint8Array | undefined   // i-th DER certificate
dss.crlCount -> number         // Number of DER CRLs (getter)
getCrl(i) -> Uint8Array | undefined    // i-th DER CRL
dss.ocspCount -> number        // Number of DER OCSP responses (getter)
getOcsp(i) -> Uint8Array | undefined   // i-th DER OCSP response
dss.vri -> string[]            // Per-signature VRI keys (uppercase-hex SHA-1 of /Contents) (getter)

OCR

OCR выполняется целиком внутри WASM с помощью чисто-Rust бэкенда tract в отдельной сборке wasm-ocr. Модели поставляются со стороны хоста — загрузите ONNX-файлы детектора/распознавателя и словарь (см. modelManifest()), затем передайте байты в конструктор.

WasmOcrEngine

new WasmOcrEngine(detModel, recModel, dict, config?)  // Build from host-supplied model bytes
engine.ocrImage(imageBytes) -> string                 // OCR a raw image (PNG/JPEG/TIFF); returns JSON {text, confidence, spans}

Параметр	Тип	Описание
`detModel`	`Uint8Array`	Байты ONNX-детектора DBNet
`recModel`	`Uint8Array`	Байты ONNX-распознавателя SVTR
`dict`	`string`	Символьный словарь распознавателя, по одному символу на строку
`config`	`WasmOcrConfig \| undefined`	Зарезервировано (используются настроенные значения по умолчанию)

WasmOcrConfig

new WasmOcrConfig()   // OCR configuration object (reserved for future tuning)

Перечисления (Enums)

Align

Дискриминант выравнивания текста/ячеек, используемый методом textInRect и спецификациями колонок таблиц.

Align.Left   // 0
Align.Center // 1
Align.Right  // 2

PadesLevel

Базовый уровень PAdES, используемый функцией signPdfBytesPades и геттером WasmSignature.padesLevel.

PadesLevel.BB    // 0 — signed attrs incl. ESS signing-certificate-v2
PadesLevel.BT    // 1 — B-B + RFC 3161 signature-time-stamp
PadesLevel.BLt   // 2 — B-T + Document Security Store (DSS/VRI)
PadesLevel.BLta  // 3 — B-LT + document-scoped /DocTimeStamp

Доступность функций

Часть возможностей доступна только при включении соответствующих Rust build-фич. Пакет pdf-oxide-wasm по умолчанию включает базовый набор; OCR поставляется в отдельной сборке wasm-ocr.

Функция	WASM	Примечания
Извлечение текста	Да	Полная поддержка
Структурированное извлечение	Да	Символы, спаны, слова, строки, таблицы
Создание PDF	Да	Markdown, HTML, текст, изображения, DocumentBuilder
Редактирование PDF	Да	Метаданные, поворот, размеры, стирание, страницы
Поля форм	Да	Чтение, запись, экспорт, уплощение, построение
Поиск	Да	Полная поддержка регулярных выражений
Шифрование	Да	Чтение и запись AES-256
Аннотации	Да	Чтение, уплощение, redaction, очистка
Объединение / разбиение PDF	Да	Объединение страниц и разбиение по закладкам
Встроенные файлы	Да	Прикрепление файлов к PDF
Метки страниц / XMP	Да	Чтение меток страниц и метаданных XMP
Round-trip в офисные форматы	Да	Импорт и экспорт DOCX/PPTX/XLSX
Валидация	Да	PDF/A, PDF/UA, PDF/X
Штрихкоды	Да (`barcodes`)	1D + QR в виде SVG или изображений на странице
Рендеринг	Да (`rendering`)	Страница → PNG, уплощение в изображения
Электронные подписи	Да (`signatures`)	Подписание, PAdES B-LT, проверка, метки времени
OCR	сборка `wasm-ocr`	OCR на tract внутри WASM; модели загружаются со стороны хоста

Обработка ошибок

Все методы, способные завершиться неудачно, выбрасывают JavaScript-объекты Error:

try {
  const doc = new WasmPdfDocument(new Uint8Array([0, 1, 2]));
} catch (e) {
  console.error(`Failed to open: ${e.message}`);
}

TypeScript

Полные определения типов включены в пакет:

import { WasmPdfDocument, WasmPdf } from "pdf-oxide-wasm";

const doc: WasmPdfDocument = new WasmPdfDocument(bytes);
const text: string = doc.extractText(0);
const pdf: WasmPdf = WasmPdf.fromMarkdown("# Hello");

Other Language Bindings

PDF Oxide предоставляет нативные привязки для всех основных экосистем: Rust, Python, Node.js, C#, Golang, Java, PHP, Ruby, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Clojure, Objective-C, и Elixir.

Дальнейшие шаги

Типы и перечисления — все общие типы и перечисления
Справочник Page API — единообразная постраничная итерация для всех привязок
Начало работы с WASM — учебное руководство