What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

JavaScript API リファレンス

PDF Oxide は JavaScript と TypeScript 向けの WebAssembly バインディングを提供します。npm パッケージ pdf-oxide-wasm は Node.js、ブラウザ、バンドラ、Deno、Cloudflare Workers で動作します。

npm install pdf-oxide-wasm

マルチターゲットパッケージング (v0.3.38)

pdf-oxide-wasm は 3 つのビルドを並行提供し、package.json の条件付き exports で出し分けます。使用するランタイムに合ったサブパスを選んでください。多くの環境ではトップレベルの自動ルーティングインポートも exports フィールド経由で正しく解決されます。

サブパス	対象
`pdf-oxide-wasm/nodejs`	Node.js (CommonJS + ESM)
`pdf-oxide-wasm/bundler`	Vite、webpack、Rollup、esbuild、Bun
`pdf-oxide-wasm/web`	ブラウザ、Deno、Cloudflare Workers

// Node.js
import { WasmPdfDocument } from "pdf-oxide-wasm/nodejs";

// Vite / webpack / Rollup
import init, { WasmPdfDocument } from "pdf-oxide-wasm/bundler";
await init();

// ブラウザ / Deno / Workers
import init, { WasmPdfDocument } from "pdf-oxide-wasm/web";
await init();

これで、v0.3.38 以前にブラウザ向けバンドラで発生していた ReferenceError: Can't find variable: __dirname が解消されます。

Rust API は Rust API リファレンス、Python API は Python API リファレンスを、型の詳細は Types & Enums を参照してください。

WasmPdfDocument

PDFの開く、抽出、編集、保存のための主要クラス.

import { WasmPdfDocument } from "pdf-oxide-wasm";

コンストラクタ

`new WasmPdfDocument(data)`

生のバイトからPDFドキュメントを読み込みます。

Parameter	Type	説明
`data`	`Uint8Array`	PDFファイルの内容

Throws: Error PDFが無効またはパースできない場合。

const bytes = new Uint8Array(readFileSync("document.pdf"));
const doc = new WasmPdfDocument(bytes);

基本（読み取り専用）

`pageCount() -> number`

ドキュメントのページ数を取得します。

`version() -> Uint8Array`

PDFバージョンを取得 as [major, minor].

const [major, minor] = doc.version();
console.log(`PDF ${major}.${minor}`);

`authenticate(password) -> boolean`

暗号化されたPDFを復号化します。認証が成功した場合 true を返します。

Parameter	Type	説明
`password`	`string`	パスワード文字列

`hasStructureTree() -> boolean`

ドキュメントが構造ツリーを持つタグ付きPDFかどうかを確認します。

テキスト抽出

`extractText(pageIndex) -> string`

単一ページからプレーンテキストを抽出.

Parameter	Type	説明
`pageIndex`	`number`	ゼロ始まりのページ番号

const text = doc.extractText(0);

`extractAllText() -> string`

すべてのページからプレーンテキストを抽出, separated by form feed characters.

`extractChars(pageIndex) -> Array`

精密な位置情報とフォントメタデータ付きで個別文字を抽出します。

Parameter	Type	説明
`pageIndex`	`number`	ゼロ始まりのページ番号

戻り値: Array of objects with fields:

Field	Type	説明
`char`	`string`	文字
`bbox`	`{x, y, width, height}`	バウンディングボックス
`font_name`	`string`	フォント名
`font_size`	`number`	フォントサイズ（ポイント）
`font_weight`	`string`	ウェイト（Normal、Boldなど）
`is_italic`	`boolean`	イタリックフラグ
`color`	`{r, g, b}`	RGB色 (0.0–1.0)

const chars = doc.extractChars(0);
for (const c of chars) {
  console.log(`'${c.char}' at (${c.bbox.x}, ${c.bbox.y})`);
}

`extractSpans(pageIndex) -> Array`

フォントメタデータ付きのスタイル付きテキストスパンを抽出します。

Parameter	Type	説明
`pageIndex`	`number`	ゼロ始まりのページ番号

戻り値: Array of objects with fields:

Field	Type	説明
`text`	`string`	テキストコンテンツ
`bbox`	`{x, y, width, height}`	バウンディングボックス
`font_name`	`string`	フォント名
`font_size`	`number`	フォントサイズ（ポイント）
`font_weight`	`string`	ウェイト（Normal、Boldなど）
`is_italic`	`boolean`	イタリックフラグ
`color`	`{r, g, b}`	RGB色 (0.0–1.0)

const result = doc.extractPageText(0);
console.log(`Page: ${result.pageWidth}x${result.pageHeight} pt`);
for (const span of result.spans) {
  console.log(`'${span.text}' font=${span.fontName} size=${span.fontSize}`);
}

const spans = doc.extractSpans(0);
for (const span of spans) {
  console.log(`"${span.text}" size=${span.fontSize}`);
}

フォーマット変換

`toMarkdown(pageIndex, detectHeadings?, includeImages?, includeFormFields?) -> string`

単一ページを Markdown に変換します。

Parameter	Type	Default	説明
`pageIndex`	`number`	–	ゼロ始まりのページ番号
`detectHeadings`	`boolean`	`true`	見出しを検出 from font size
`includeImages`	`boolean`	`true`	画像を含める
`includeFormFields`	`boolean`	`true`	フォームフィールド値を含める

`toMarkdownAll(detectHeadings?, includeImages?, includeFormFields?) -> string`

全ページを Markdown に変換します.

`toHtml(pageIndex, preserveLayout?, detectHeadings?, includeFormFields?) -> string`

単一ページを HTML に変換します。

Parameter	Type	Default	説明
`pageIndex`	`number`	–	ゼロ始まりのページ番号
`preserveLayout`	`boolean`	`false`	視覚的なレイアウトを保持
`detectHeadings`	`boolean`	`true`	見出しを検出
`includeFormFields`	`boolean`	`true`	フォームフィールド値を含める

`toHtmlAll(preserveLayout?, detectHeadings?, includeFormFields?) -> string`

全ページを HTML に変換します。

`toPlainText(pageIndex) -> string`

単一ページをプレーンテキストに変換します。

`toPlainTextAll() -> string`

すべてのページをプレーンテキストに変換.

検索

`search(pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array`

すべてのページでテキストを検索します。

Parameter	Type	Default	説明
`pattern`	`string`	–	検索パターン（文字列または正規表現）
`caseInsensitive`	`boolean`	`false`	大文字小文字を区別しない検索
`literal`	`boolean`	`false`	パターンをリテラル文字列として扱う
`wholeWord`	`boolean`	`false`	完全一致のみマッチ
`maxResults`	`number`	–	返す最大結果数

戻り値: Array of objects with fields:

Field	Type	説明
`page`	`number`	ページ番号
`text`	`string`	一致したテキスト
`bbox`	`object`	バウンディングボックス
`start_index`	`number`	ページテキスト内の開始インデックス
`end_index`	`number`	ページテキスト内の終了インデックス

`searchPage(pageIndex, pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array`

単一ページ内でテキストを検索します。

画像情報

`extractImages(pageIndex) -> Array`

ページの画像メタデータを取得します。

Field	Type	説明
`width`	`number`	画像の幅（ピクセル）
`height`	`number`	画像の高さ（ピクセル）
`color_space`	`string`	色空間（例：`DeviceRGB`）
`bits_per_component`	`number`	色チャネルあたりのビット数
`bbox`	`object`	ページ上の位置

`extractImageBytes(pageIndex) -> Array`

ページから生の画像バイトを抽出します。オブジェクトの配列を返します：

Field	Type	説明
`width`	`number`	画像の幅（ピクセル）
`height`	`number`	画像の高さ（ピクセル）
`data`	`Uint8Array`	生の画像バイト
`format`	`string`	画像形式

`pageImages(pageIndex) -> Array`

位置操作用の画像名と境界を取得します。

Field	Type	説明
`name`	`string`	XObjectの名前
`bounds`	`number[]`	`[x, y, width, height]`
`matrix`	`number[]`	変換行列 `[a, b, c, d, e, f]`

ドキュメント構造

`getOutline() -> Array | null`

ドキュメントのブックマーク/目次を取得します。アウトラインが存在しない場合は null を返します。

`getAnnotations(pageIndex) -> Array`

ページのアノテーションメタデータ（タイプ、矩形、内容など）を取得します。

`extractPaths(pageIndex) -> Array`

ページからベクターパス（線、曲線、図形）を取得します。

`pageLabels() -> Array`

ページラベルの範囲を取得します。オブジェクトの配列を返します：

Field	Type	説明
`start_page`	`number`	この範囲の最初のページ
`style`	`string`	ナンバリングスタイル
`prefix`	`string`	ラベルプレフィックス
`start_value`	`number`	開始番号

`xmpMetadata() -> object | null`

XMPメタデータを取得します。存在しない場合は null を返します。オブジェクトのフィールドは以下の通りです：

Field	Type	説明
`dc_title`	`string \| null`	ドキュメントタイトル
`dc_creator`	`string[] \| null`	作成者リスト
`dc_description`	`string \| null`	説明
`xmp_creator_tool`	`string \| null`	作成ツール
`xmp_create_date`	`string \| null`	作成日
`xmp_modify_date`	`string \| null`	変更日
`pdf_producer`	`string \| null`	PDF プロデューサー

フォームフィールド

`getFormFields() -> Array`

すべてのフォームフィールドを取得 with name, type, value, and flags.

Field	Type	説明
`name`	`string`	フィールド名
`field_type`	`string`	フィールドタイプ (text, checkbox, etc.)
`value`	`string`	現在の値
`flags`	`number`	フィールドフラグ

const fields = doc.getFormFields();
for (const f of fields) {
  console.log(`${f.name} (${f.field_type}) = ${f.value}`);
}

`hasXfa() -> boolean`

ドキュメントにXFAフォームが含まれるか確認します。

`getFormFieldValue(name) -> any`

名前でフォームフィールドの値を取得します。フィールドタイプに応じて string、boolean、または null を返します。

Parameter	Type	説明
`name`	`string`	フィールド名

`setFormFieldValue(name, value) -> void`

名前でフォームフィールドの値を設定します。

Parameter	Type	説明
`name`	`string`	フィールド名
`value`	`string \| boolean`	新しいフィールド値

`exportFormData(format?) -> Uint8Array`

フォームデータをFDF（デフォルト）またはXFDFとしてエクスポートします。

Parameter	Type	Default	説明
`format`	`string`	`"fdf"`	エクスポート形式： `"fdf"` or `"xfdf"`

編集

メタデータ

Method	Parameters	説明
`setTitle(title)`	`string`	ドキュメントタイトルを設定
`setAuthor(author)`	`string`	ドキュメント著者を設定
`setSubject(subject)`	`string`	ドキュメント件名を設定
`setKeywords(keywords)`	`string`	ドキュメントキーワードを設定

ページの回転

Method	Parameters	説明
`pageRotation(pageIndex)`	`number`	現在の回転角度を取得（0、90、180、270）
`setPageRotation(pageIndex, degrees)`	`number, number`	絶対回転を設定
`rotatePage(pageIndex, degrees)`	`number, number`	現在の回転に追加
`rotateAllPages(degrees)`	`number`	すべてのページを回転

ページサイズ

Method	Parameters	説明
`pageMediaBox(pageIndex)`	`number`	MediaBoxを取得 `[llx, lly, urx, ury]`
`setPageMediaBox(pageIndex, llx, lly, urx, ury)`	`number, ...`	MediaBoxを設定
`pageCropBox(pageIndex)`	`number`	CropBoxを取得 (may be null)
`setPageCropBox(pageIndex, llx, lly, urx, ury)`	`number, ...`	CropBoxを設定
`cropMargins(left, right, top, bottom)`	`number, ...`	全ページのマージンをトリミング

消去 / ホワイトアウト

Method	Parameters	説明
`eraseRegion(pageIndex, llx, lly, urx, ury)`	`number, ...`	領域を消去
`eraseRegions(pageIndex, rects)`	`number, Float32Array`	複数の領域を消去
`clearEraseRegions(pageIndex)`	`number`	Clear pending erases

アノテーションと墨消し

Method	Parameters	説明
`flattenPageAnnotations(pageIndex)`	`number`	ページ上のアノテーションをフラット化
`flattenAllAnnotations()`	–	すべての注釈をフラット化
`applyPageRedactions(pageIndex)`	`number`	ページの墨消しを適用
`applyAllRedactions()`	–	すべての墨消しを適用

フォームのフラット化

Method	Parameters	説明
`flattenForms()`	–	すべてのフォームフィールドをフラット化 into page content
`flattenFormsOnPage(pageIndex)`	`number`	特定ページのフォームをフラット化

結合と埋め込み

`mergeFrom(data) -> number`

別のPDFからページを結合. 結合されたページ数を返します。

Parameter	Type	説明
`data`	`Uint8Array`	ソースPDFファイルのバイト

`embedFile(name, data) -> void`

ファイルを添付 to the PDF.

Parameter	Type	説明
`name`	`string`	添付ファイルのファイル名
`data`	`Uint8Array`	ファイルの内容

画像操作

Method	Parameters	説明
`repositionImage(pageIndex, name, x, y)`	`number, string, number, number`	画像を移動
`resizeImage(pageIndex, name, w, h)`	`number, string, number, number`	画像のサイズを変更
`setImageBounds(pageIndex, name, x, y, w, h)`	`number, string, ...`	画像の境界を設定

レンダリング

メソッド	パラメータ	戻り値	説明
`renderPage(pageIndex, dpi?)`	`number, number`	`Uint8Array`	ページをPNGバイトにレンダリング
`flattenToImages(dpi?)`	`number`	`Uint8Array`	全ページを画像ベースPDFにフラット化

保存

`saveToBytes() -> Uint8Array`

編集済みPDFをバイトとして保存します。

`saveEncryptedToBytes(password, ownerPassword?, allowPrint?, allowCopy?, allowModify?, allowAnnotate?) -> Uint8Array`

AES-256暗号化で保存します。

Parameter	Type	Default	説明
`password`	`string`	–	ユーザーパスワード
`ownerPassword`	`string`	user password	オーナーパスワード
`allowPrint`	`boolean`	`true`	印刷を許可
`allowCopy`	`boolean`	`true`	コピーを許可
`allowModify`	`boolean`	`false`	変更を許可
`allowAnnotate`	`boolean`	`true`	アノテーションを許可

`free()`

WASMメモリを解放します。ドキュメントの使用が終了したら必ず呼び出してください。

WasmPdf

新しいPDFを作成するためのファクトリクラスです。

import { WasmPdf } from "pdf-oxide-wasm";

静的メソッド

`WasmPdf.fromMarkdown(content, title?, author?) -> WasmPdf`

Markdownテキストからpdfsを作成します。

Parameter	Type	Default	説明
`content`	`string`	–	Markdownコンテンツ
`title`	`string`	–	ドキュメントタイトル
`author`	`string`	–	ドキュメント著者

`WasmPdf.fromHtml(content, title?, author?) -> WasmPdf`

HTMLからPDFを作成します。

`WasmPdf.fromText(content, title?, author?) -> WasmPdf`

プレーンテキストからPDFを作成します。

`WasmPdf.fromImageBytes(data) -> WasmPdf`

画像バイトから単一ページのPDFを作成します。

Parameter	Type	説明
`data`	`Uint8Array`	画像ファイルのバイト（JPEG、PNG）

`WasmPdf.fromMultipleImageBytes(imagesArray) -> WasmPdf`

複数の画像から複数ページのPDFを作成します（1画像1ページ）。

Parameter	Type	説明
`imagesArray`	`Uint8Array[]`	画像ファイルバイトの配列

インスタンスメソッド

`toBytes() -> Uint8Array`

PDFをバイトとして取得します。

`size -> number`

PDFのサイズ（バイト単位、読み取り専用プロパティ）。

const pdf = WasmPdf.fromMarkdown("# Hello World\n\nThis is a PDF.");
console.log(`PDF size: ${pdf.size} bytes`);
writeFileSync("output.pdf", pdf.toBytes());

機能の利用可否

一部の機能はネイティブ依存関係が必要であり、WebAssemblyでは利用できません：

Feature	WASM	Notes
Text extraction	Yes	完全サポート
Structured extraction	Yes	文字、スパン
PDF creation	Yes	Markdown、HTML、テキスト、画像
PDF editing	Yes	メタデータ、回転、寸法、消去
Form fields	Yes	読み取り、書き込み、エクスポート、フラット化
Search	Yes	完全な正規表現サポート
Encryption	Yes	AES-256の読み書き
Annotations	Yes	読み取り、フラット化、墨消し
Merge PDFs	Yes	別のPDFからページを結合
Embedded files	Yes	PDFにファイルを添付
Page labels	Yes	ページラベルの範囲を読み取り
XMP metadata	Yes	XMPメタデータを読み取り
OCR	No	ネイティブONNX Runtimeが必要
Digital signatures	No	ネイティブ暗号ライブラリが必要
Page rendering	No	ネイティブtiny-skiaが必要
Barcodes	No	ネイティブレンダリングが必要
Office conversion	No	ネイティブLibreOfficeが必要

エラーハンドリング

失敗する可能性のあるすべてのメソッドはJavaScript Error オブジェクトをスローします：

try {
  const doc = new WasmPdfDocument(new Uint8Array([0, 1, 2]));
} catch (e) {
  console.error(`Failed to open: ${e.message}`);
}

TypeScript

完全な型定義がパッケージに含まれています：

import { WasmPdfDocument, WasmPdf } from "pdf-oxide-wasm";

const doc: WasmPdfDocument = new WasmPdfDocument(bytes);
const text: string = doc.extractText(0);
const pdf: WasmPdf = WasmPdf.fromMarkdown("# Hello");