What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

JavaScript API Reference

PDF Oxide provides WebAssembly bindings for JavaScript and TypeScript. The npm package pdf-oxide-wasm works in Node.js, browsers, bundlers, Deno, and Cloudflare Workers.

npm install pdf-oxide-wasm

Multi-target packaging (v0.3.38)

pdf-oxide-wasm now ships three builds side by side with package.json conditional exports. Pick the subpath that matches your runtime — the auto-routed top-level import also resolves correctly through the exports field for most environments.

Subpath	Target
`pdf-oxide-wasm/nodejs`	Node.js (CommonJS + ESM)
`pdf-oxide-wasm/bundler`	Vite, webpack, Rollup, esbuild, Bun
`pdf-oxide-wasm/web`	Browsers, Deno, Cloudflare Workers

// Node.js
import { WasmPdfDocument } from "pdf-oxide-wasm/nodejs";

// Vite / webpack / Rollup
import init, { WasmPdfDocument } from "pdf-oxide-wasm/bundler";
await init();

// Browsers / Deno / Workers
import init, { WasmPdfDocument } from "pdf-oxide-wasm/web";
await init();

This fixes the ReferenceError: Can't find variable: __dirname thrown under browser bundlers prior to v0.3.38.

For the Rust API, see the Rust API Reference. For the Python API, see the Python API Reference. For type details, see Types & Enums.

WasmPdfDocument

The primary class for opening, extracting, editing, and saving PDFs.

import { WasmPdfDocument } from "pdf-oxide-wasm";

Constructor

`new WasmPdfDocument(data)`

Load a PDF document from raw bytes.

Parameter	Type	Description
`data`	`Uint8Array`	The PDF file contents

Throws: Error if the PDF is invalid or cannot be parsed.

const bytes = new Uint8Array(readFileSync("document.pdf"));
const doc = new WasmPdfDocument(bytes);

Core Read-Only

`pageCount() -> number`

Get the number of pages in the document.

`version() -> Uint8Array`

Get the PDF version as [major, minor].

const [major, minor] = doc.version();
console.log(`PDF ${major}.${minor}`);

`authenticate(password) -> boolean`

Decrypt an encrypted PDF. Returns true if authentication succeeded.

Parameter	Type	Description
`password`	`string`	The password string

`hasStructureTree() -> boolean`

Check if the document is a Tagged PDF with a structure tree.

Text Extraction

`extractText(pageIndex) -> string`

Extract plain text from a single page.

Parameter	Type	Description
`pageIndex`	`number`	Zero-based page number

const text = doc.extractText(0);

`extractAllText() -> string`

Extract plain text from all pages, separated by form feed characters.

`extractChars(pageIndex) -> Array`

Extract individual characters with precise positioning and font metadata.

Parameter	Type	Description
`pageIndex`	`number`	Zero-based page number

Returns: Array of objects with fields:

Field	Type	Description
`char`	`string`	The character
`bbox`	`{x, y, width, height}`	Bounding box
`fontName`	`string`	Font name
`fontSize`	`number`	Font size in points
`fontWeight`	`string`	Weight (Normal, Bold, etc.)
`isItalic`	`boolean`	Italic flag
`color`	`{r, g, b}`	RGB color (0.0–1.0)

const chars = doc.extractChars(0);
for (const c of chars) {
  console.log(`'${c.char}' at (${c.bbox.x}, ${c.bbox.y})`);
}

`extractPageText(pageIndex) -> object`

Get spans, characters, and page dimensions from a single extraction pass. More efficient than calling extractSpans() + extractChars() separately.

Parameter	Type	Description
`pageIndex`	`number`	Zero-based page number

Returns: An object with fields:

Field	Type	Description
`spans`	`Array`	Array of span objects
`chars`	`Array`	Array of character objects
`pageWidth`	`number`	Page width in PDF points
`pageHeight`	`number`	Page height in PDF points
`text`	`string`	Full text content

const result = doc.extractPageText(0);
console.log(`Page: ${result.pageWidth}x${result.pageHeight} pt`);
for (const span of result.spans) {
  console.log(`'${span.text}' font=${span.fontName} size=${span.fontSize}`);
}

`extractSpans(pageIndex, config?, readingOrder?) -> Array`

Extract styled text spans with font metadata. Pass "column_aware" as readingOrder for multi-column PDFs.

Parameter	Type	Description
`pageIndex`	`number`	Zero-based page number
`config`	`object \| undefined`	Optional span merging config
`readingOrder`	`string \| undefined`	Reading order: `"column_aware"` or `undefined` for default

Returns: Array of objects with fields:

Field	Type	Description
`text`	`string`	The text content
`bbox`	`{x, y, width, height}`	Bounding box
`fontName`	`string`	Font name
`fontSize`	`number`	Font size in points
`fontWeight`	`string`	Weight (Normal, Bold, etc.)
`isItalic`	`boolean`	Italic flag
`isMonospace`	`boolean`	Whether the font is fixed-width
`charWidths`	`number[]`	Per-glyph advance widths
`color`	`{r, g, b}`	RGB color (0.0–1.0)

const spans = doc.extractSpans(0);
for (const span of spans) {
  console.log(`"${span.text}" size=${span.fontSize}`);
}

Format Conversion

`toMarkdown(pageIndex, detectHeadings?, includeImages?, includeFormFields?) -> string`

Convert a single page to Markdown.

Parameter	Type	Default	Description
`pageIndex`	`number`	–	Zero-based page number
`detectHeadings`	`boolean`	`true`	Detect headings from font size
`includeImages`	`boolean`	`true`	Include images
`includeFormFields`	`boolean`	`true`	Include form field values

`toMarkdownAll(detectHeadings?, includeImages?, includeFormFields?) -> string`

Convert all pages to Markdown.

`toHtml(pageIndex, preserveLayout?, detectHeadings?, includeFormFields?) -> string`

Convert a single page to HTML.

Parameter	Type	Default	Description
`pageIndex`	`number`	–	Zero-based page number
`preserveLayout`	`boolean`	`false`	Preserve visual layout
`detectHeadings`	`boolean`	`true`	Detect headings
`includeFormFields`	`boolean`	`true`	Include form field values

`toHtmlAll(preserveLayout?, detectHeadings?, includeFormFields?) -> string`

Convert all pages to HTML.

`toPlainText(pageIndex) -> string`

Convert a single page to plain text.

`toPlainTextAll() -> string`

Convert all pages to plain text.

Search

`search(pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array`

Search for text across all pages.

Parameter	Type	Default	Description
`pattern`	`string`	–	Search pattern (string or regex)
`caseInsensitive`	`boolean`	`false`	Case-insensitive search
`literal`	`boolean`	`false`	Treat pattern as literal string
`wholeWord`	`boolean`	`false`	Match whole words only
`maxResults`	`number`	–	Maximum results to return

Returns: Array of objects with fields:

Field	Type	Description
`page`	`number`	Page number
`text`	`string`	Matched text
`bbox`	`object`	Bounding box
`startIndex`	`number`	Start index in page text
`endIndex`	`number`	End index in page text

`searchPage(pageIndex, pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array`

Search for text within a single page.

Image Info

`extractImages(pageIndex) -> Array`

Get image metadata for a page.

Field	Type	Description
`width`	`number`	Image width in pixels
`height`	`number`	Image height in pixels
`colorSpace`	`string`	Color space (e.g. `DeviceRGB`)
`bitsPerComponent`	`number`	Bits per color channel
`bbox`	`object`	Position on page

`extractImageBytes(pageIndex) -> Array`

Extract raw image bytes from a page. Returns an array of objects:

Field	Type	Description
`width`	`number`	Image width in pixels
`height`	`number`	Image height in pixels
`data`	`Uint8Array`	Raw image bytes
`format`	`string`	Image format

`pageImages(pageIndex) -> Array`

Get image names and bounds for positioning operations.

Field	Type	Description
`name`	`string`	XObject name
`bounds`	`number[]`	`[x, y, width, height]`
`matrix`	`number[]`	Transform matrix `[a, b, c, d, e, f]`

Document Structure

`getOutline() -> Array | null`

Get document bookmarks / table of contents. Returns null if no outline exists.

`getAnnotations(pageIndex) -> Array`

Get annotation metadata (type, rect, contents, etc.) for a page.

`extractPaths(pageIndex) -> Array`

Get vector paths (lines, curves, shapes) from a page.

`pageLabels() -> Array`

Get page label ranges. Returns an array of objects:

Field	Type	Description
`startPage`	`number`	First page in this range
`style`	`string`	Numbering style
`prefix`	`string`	Label prefix
`startValue`	`number`	Starting number

`xmpMetadata() -> object | null`

Get XMP metadata. Returns null if not present. Object fields include:

Field	Type	Description
`dcTitle`	`string \| null`	Document title
`dcCreator`	`string[] \| null`	Creator list
`dcDescription`	`string \| null`	Description
`xmpCreatorTool`	`string \| null`	Creator tool
`xmpCreateDate`	`string \| null`	Creation date
`xmpModifyDate`	`string \| null`	Modification date
`pdfProducer`	`string \| null`	PDF producer

Form Fields

`getFormFields() -> Array`

Get all form fields with name, type, value, and flags.

Field	Type	Description
`name`	`string`	Field name
`fieldType`	`string`	Field type (text, checkbox, etc.)
`value`	`string`	Current value
`flags`	`number`	Field flags

const fields = doc.getFormFields();
for (const f of fields) {
  console.log(`${f.name} (${f.fieldType}) = ${f.value}`);
}

`hasXfa() -> boolean`

Check if the document contains XFA forms.

`getFormFieldValue(name) -> any`

Get a form field value by name. Returns a string, boolean, or null depending on the field type.

Parameter	Type	Description
`name`	`string`	Field name

`setFormFieldValue(name, value) -> void`

Set a form field value by name.

Parameter	Type	Description
`name`	`string`	Field name
`value`	`string \| boolean`	New field value

`exportFormData(format?) -> Uint8Array`

Export form data as FDF (default) or XFDF.

Parameter	Type	Default	Description
`format`	`string`	`"fdf"`	Export format: `"fdf"` or `"xfdf"`

Editing

Metadata

Method	Parameters	Description
`setTitle(title)`	`string`	Set document title
`setAuthor(author)`	`string`	Set document author
`setSubject(subject)`	`string`	Set document subject
`setKeywords(keywords)`	`string`	Set document keywords

Page Rotation

Method	Parameters	Description
`pageRotation(pageIndex)`	`number`	Get current rotation (0, 90, 180, 270)
`setPageRotation(pageIndex, degrees)`	`number, number`	Set absolute rotation
`rotatePage(pageIndex, degrees)`	`number, number`	Add to current rotation
`rotateAllPages(degrees)`	`number`	Rotate all pages

Page Dimensions

Method	Parameters	Description
`pageMediaBox(pageIndex)`	`number`	Get MediaBox `[llx, lly, urx, ury]`
`setPageMediaBox(pageIndex, llx, lly, urx, ury)`	`number, ...`	Set MediaBox
`pageCropBox(pageIndex)`	`number`	Get CropBox (may be null)
`setPageCropBox(pageIndex, llx, lly, urx, ury)`	`number, ...`	Set CropBox
`cropMargins(left, right, top, bottom)`	`number, ...`	Crop all page margins

Erase / Whiteout

Method	Parameters	Description
`eraseRegion(pageIndex, llx, lly, urx, ury)`	`number, ...`	Erase a region
`eraseRegions(pageIndex, rects)`	`number, Float32Array`	Erase multiple regions
`clearEraseRegions(pageIndex)`	`number`	Clear pending erases

Annotations & Redaction

Method	Parameters	Description
`flattenPageAnnotations(pageIndex)`	`number`	Flatten annotations on page
`flattenAllAnnotations()`	–	Flatten all annotations
`applyPageRedactions(pageIndex)`	`number`	Apply redactions on page
`applyAllRedactions()`	–	Apply all redactions

Form Flattening

Method	Parameters	Description
`flattenForms()`	–	Flatten all form fields into page content
`flattenFormsOnPage(pageIndex)`	`number`	Flatten forms on a specific page

Merge & Embed

`mergeFrom(data) -> number`

Merge pages from another PDF. Returns the number of pages merged.

Parameter	Type	Description
`data`	`Uint8Array`	The source PDF file bytes

`embedFile(name, data) -> void`

Attach a file to the PDF.

Parameter	Type	Description
`name`	`string`	Filename for the attachment
`data`	`Uint8Array`	File contents

Image Manipulation

Method	Parameters	Description
`repositionImage(pageIndex, name, x, y)`	`number, string, number, number`	Move image
`resizeImage(pageIndex, name, w, h)`	`number, string, number, number`	Resize image
`setImageBounds(pageIndex, name, x, y, w, h)`	`number, string, ...`	Set image bounds

Rendering

Method	Parameters	Returns	Description
`renderPage(pageIndex, dpi?)`	`number, number`	`Uint8Array`	Render a page to PNG bytes
`flattenToImages(dpi?)`	`number`	`Uint8Array`	Flatten all pages to image-based PDF

Save

`save() -> Uint8Array`

Save the edited PDF as bytes. saveToBytes() is available as an alias.

`saveEncryptedToBytes(password, ownerPassword?, allowPrint?, allowCopy?, allowModify?, allowAnnotate?) -> Uint8Array`

Save with AES-256 encryption.

Parameter	Type	Default	Description
`password`	`string`	–	User password
`ownerPassword`	`string`	user password	Owner password
`allowPrint`	`boolean`	`true`	Allow printing
`allowCopy`	`boolean`	`true`	Allow copying
`allowModify`	`boolean`	`false`	Allow modification
`allowAnnotate`	`boolean`	`true`	Allow annotations

`free()`

Release WASM memory. Always call this when done with the document.

WasmPdf

Factory class for creating new PDFs.

import { WasmPdf } from "pdf-oxide-wasm";

Static Methods

`WasmPdf.fromMarkdown(content, title?, author?) -> WasmPdf`

Create a PDF from Markdown text.

Parameter	Type	Default	Description
`content`	`string`	–	Markdown content
`title`	`string`	–	Document title
`author`	`string`	–	Document author

`WasmPdf.fromHtml(content, title?, author?) -> WasmPdf`

Create a PDF from HTML.

`WasmPdf.fromText(content, title?, author?) -> WasmPdf`

Create a PDF from plain text.

`WasmPdf.fromImageBytes(data) -> WasmPdf`

Create a single-page PDF from image bytes.

Parameter	Type	Description
`data`	`Uint8Array`	Image file bytes (JPEG, PNG)

`WasmPdf.fromMultipleImageBytes(imagesArray) -> WasmPdf`

Create a multi-page PDF from multiple images, one page per image.

Parameter	Type	Description
`imagesArray`	`Uint8Array[]`	Array of image file bytes

Instance Methods

`toBytes() -> Uint8Array`

Get the PDF as bytes.

`size -> number`

PDF size in bytes (readonly property).

const pdf = WasmPdf.fromMarkdown("# Hello World\n\nThis is a PDF.");
console.log(`PDF size: ${pdf.size} bytes`);
writeFileSync("output.pdf", pdf.toBytes());

Feature Availability

Some features require native dependencies and are not available in WebAssembly:

Feature	WASM	Notes
Text extraction	Yes	Full support
Structured extraction	Yes	Chars, spans
PDF creation	Yes	Markdown, HTML, text, images
PDF editing	Yes	Metadata, rotation, dimensions, erase
Form fields	Yes	Read, write, export, flatten
Search	Yes	Full regex support
Encryption	Yes	AES-256 read and write
Annotations	Yes	Read, flatten, redact
Merge PDFs	Yes	Merge pages from another PDF
Embedded files	Yes	Attach files to PDFs
Page labels	Yes	Read page label ranges
XMP metadata	Yes	Read XMP metadata
OCR	No	Requires native ONNX Runtime
Digital signatures	No	Requires native crypto libraries
Page rendering	No	Requires native tiny-skia
Barcodes	No	Requires native rendering
Office conversion	No	Requires native LibreOffice

Error Handling

All methods that can fail throw JavaScript Error objects:

try {
  const doc = new WasmPdfDocument(new Uint8Array([0, 1, 2]));
} catch (e) {
  console.error(`Failed to open: ${e.message}`);
}

TypeScript

Full type definitions are included in the package:

import { WasmPdfDocument, WasmPdf } from "pdf-oxide-wasm";

const doc: WasmPdfDocument = new WasmPdfDocument(bytes);
const text: string = doc.extractText(0);
const pdf: WasmPdf = WasmPdf.fromMarkdown("# Hello");