What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Node.js API Reference

The pdf-oxide npm package provides a native N-API addon with full TypeScript type definitions. Prebuilt platform binaries ship via per-platform subpackages.

npm install pdf-oxide

For the WASM build targeting browsers / Deno / Bun / edge runtimes, see WASM API Reference. For other languages see Python, Go, C#, or Rust.

Package exports

import {
  PdfDocument,
  Pdf,
  DocumentEditor,
  OcrEngine,
  SearchManager,
  SearchStream,
} from "pdf-oxide";

All classes implement Symbol.dispose where appropriate — use using (Node.js 22+) for automatic cleanup.

PdfDocument

Read-only access to PDF documents.

Constructors

new PdfDocument(path: string)
PdfDocument.openFromBytes(data: Buffer | Uint8Array): PdfDocument
PdfDocument.openWithPassword(path: string, password: string): PdfDocument

Document info

getPageCount(): number
getVersion(): { major: number; minor: number }
hasStructureTree(): boolean
authenticate(password: string): boolean
close(): void

Pages (v0.3.34)

page(index: number): PdfPage           // negative indexing supported
[Symbol.iterator](): Iterator<PdfPage> // for (const p of doc) { ... }

PdfPage is a lightweight handle with cached width / height / rotation and dispatches extraction to the parent document:

class PdfPage {
  readonly index: number;
  readonly width: number;
  readonly height: number;
  readonly rotation: number;

  text(): string;
  markdown(): string;
  html(): string;
  plainText(): string;
  words(): Word[];
  lines(): TextLine[];
  tables(): Table[];
  images(): ImageInfo[];
  paths(): Path[];
  annotations(): AnnotationInfo[];
  fonts(): FontInfo[];
  search(query: string, caseSensitive?: boolean): SearchResult[];
}

Text extraction

extractText(pageIndex: number): string
extractTextAsync(pageIndex: number): Promise<string>
extractAllText(): string
toMarkdown(pageIndex: number, options?: { detectHeadings?: boolean; includeImages?: boolean }): string
toMarkdownAll(): string
toHtml(pageIndex: number): string
toHtmlAll(): string
toPlainText(pageIndex: number): string

Structured extraction

extractChars(pageIndex: number): Char[]
extractSpans(pageIndex: number): Span[]
extractWords(pageIndex: number, options?: WordOptions): Word[]
extractTextLines(pageIndex: number, options?: LineOptions): TextLine[]
extractTables(pageIndex: number, config?: TableDetectionConfig): Table[]
extractPaths(pageIndex: number): Path[]
pageLayoutParams(pageIndex: number): ExtractionProfile

WordOptions / LineOptions accept wordGapThreshold, lineGapThreshold, and a profile string (see text extraction).

Region-based

extractTextInRect(pageIndex: number, x: number, y: number, width: number, height: number): string
extractWordsInRect(pageIndex: number, x: number, y: number, width: number, height: number): Word[]
extractImagesInRect(pageIndex: number, x: number, y: number, width: number, height: number): ImageInfo[]

Images & resources

extractImages(pageIndex: number): ImageInfo[]
getFonts(pageIndex: number): FontInfo[]
getAnnotations(pageIndex: number): AnnotationInfo[]
getFormFields(): FormField[]
getPageElements(pageIndex: number): Element[]
pageInfo(pageIndex: number): PageInfo

Search

searchPage(pageIndex: number, query: string, options?: { caseSensitive?: boolean }): SearchResult[]
searchAll(query: string, options?: { caseSensitive?: boolean }): SearchResult[]

Rendering (optional feature)

renderPage(pageIndex: number, format: "png" | "jpeg"): Buffer
renderPageZoom(pageIndex: number, zoom: number, format: "png" | "jpeg"): Buffer
renderThumbnail(pageIndex: number, width: number, format: "png" | "jpeg"): Buffer

Pdf — creation

Pdf.fromMarkdown(markdown: string): Pdf
Pdf.fromHtml(html: string): Pdf
Pdf.fromText(text: string): Pdf
Pdf.fromImage(path: string): Pdf
Pdf.fromImageBytes(data: Buffer | Uint8Array): Pdf

save(path: string): void
saveAsync(path: string): Promise<void>
toBytes(): Buffer
close(): void

DocumentEditor

DocumentEditor.open(path: string): DocumentEditor
DocumentEditor.openFromBytes(data: Buffer | Uint8Array): DocumentEditor

Metadata

setTitle(title: string): void
setAuthor(author: string): void
setSubject(subject: string): void
setKeywords(keywords: string): void
applyMetadata(meta: Metadata): void

Page operations

rotatePage(pageIndex: number, degrees: 0 | 90 | 180 | 270): void
deletePage(pageIndex: number): void
movePage(from: number, to: number): void
cropMargins(left: number, bottom: number, right: number, top: number): void
eraseRegion(pageIndex: number, x: number, y: number, width: number, height: number): void

Annotations & forms

flattenAnnotations(pageIndex: number): void
flattenAllAnnotations(): void
flattenForms(): void
setFormFieldValue(name: string, value: string): void

Merging

mergeFrom(path: string): number

Save

save(path: string): void
saveAsync(path: string): Promise<void>
saveEncrypted(path: string, userPassword: string, ownerPassword: string): void
toBytes(): Buffer
close(): void

OcrEngine (feature `ocr`)

new OcrEngine()
pageNeedsOcr(doc: PdfDocument, pageIndex: number): boolean
extractText(doc: PdfDocument, pageIndex: number): string
close(): void

Build with --features ocr (see OCR guide).

Streams

new SearchManager(doc: PdfDocument)
new SearchStream(manager: SearchManager, query: string, options?: { caseSensitive?: boolean })
// SearchStream is a standard Node.js Readable in object mode.

See the Node.js streams guide for SearchStream, PageIteratorStream, and TableStream patterns.

Types

interface Char {
  char: string;
  x: number;
  y: number;
  fontSize: number;
  fontName: string;
  bbox: [number, number, number, number];
}

interface Span {
  text: string;
  fontName: string;
  fontSize: number;
  bbox: [number, number, number, number];
}

interface Word {
  text: string;
  x: number;
  y: number;
  width: number;
  height: number;
}

interface TextLine {
  text: string;
  y: number;
  spans: Span[];
}

interface ImageInfo {
  width: number;
  height: number;
  format: "png" | "jpeg" | "tiff";
  colorspace: "rgb" | "gray" | "cmyk" | "indexed";
  bitsPerComponent: number;
  data: Buffer;
}

interface FontInfo {
  name: string;
  type: string;
  encoding: string;
  isEmbedded: boolean;
  isSubset: boolean;
  size: number;
}

interface AnnotationInfo {
  type: string;
  subtype: string;
  content: string;
  x: number; y: number;
  width: number; height: number;
  author: string;
  linkUri?: string;
}

interface FormField {
  name: string;
  fieldType: string;
  value: string;
  pageIndex: number;
}

interface SearchResult {
  text: string;
  page: number;
  x: number; y: number;
  width: number; height: number;
}

interface PageInfo {
  width: number;
  height: number;
  rotation: 0 | 90 | 180 | 270;
  mediaBox: Rect;
  cropBox: Rect;
}

interface Metadata {
  title?: string;
  author?: string;
  subject?: string;
  keywords?: string;
  producer?: string;
  creationDate?: string;
}

interface ExtractionProfile {
  wordGapThreshold: number;
  lineGapThreshold: number;
  columnCount: number;
}

Error handling

All methods throw on failure. Catch with try/catch and inspect err.message:

try {
  const text = doc.extractText(0);
} catch (err) {
  console.error(`Extraction failed: ${err.message}`);
}

Async method suffix

Every sync method listed above has an *Async variant returning a Promise — extractText → extractTextAsync, save → saveAsync, etc. The async variants dispatch to the libuv thread pool and do not block the event loop. See the async guide.

Thread safety

PdfDocument is Send + Sync on the Rust side — safe to share across Node.js Worker threads. See the concurrency guide.

Generated types

TypeScript definitions ship with the package at node_modules/pdf-oxide/index.d.ts — the canonical source of truth for types, including any fields added after this page was last updated.

v0.3.38 の追加

`DocumentBuilder` / `FluentPageBuilder` / `EmbeddedFont`

import { DocumentBuilder, EmbeddedFont, StampType } from "pdf-oxide";

const font = await EmbeddedFont.fromFile("DejaVuSans.ttf");
// Alt: EmbeddedFont.fromBytes(data: Uint8Array, name?: string)

const bytes = new DocumentBuilder()
  .registerEmbeddedFont("DejaVu", font)
  .letterPage()         // or .a4Page() / .page(width, height)
    .at(72, 720).font("DejaVu", 12).text("Hello")
    .heading(1, "Title")
    .paragraph("Body text")
    // Annotations
    .linkUrl("https://example.com")
    .linkPage(2)
    .linkNamed("glossary")
    .highlight([1.0, 1.0, 0.0])
    .underline([0.0, 0.0, 1.0])
    .strikeout([1.0, 0.0, 0.0])
    .squiggly([1.0, 0.5, 0.0])
    .stickyNote("Review this")
    .stamp(StampType.Approved)
    .freetext(100, 500, 200, 50, "Comment")
    .watermark("DRAFT")
    .watermarkConfidential()
    .watermarkDraft()
    // AcroForm widgets
    .textField("name", 150, 400, 200, 20, "Jane Doe")
    .checkbox("agree", 72, 380, 15, 15, true)
    .comboBox("country", 150, 360, 200, 20, ["US", "UK"], "US")
    .radioGroup("tier", [
      { value: "free", x: 72, y: 340, w: 15, h: 15 },
      { value: "pro",  x: 120, y: 340, w: 15, h: 15 },
    ], "pro")
    .pushButton("submit", 72, 300, 80, 25, "Submit")
    // Graphics primitives
    .rect(50, 270, 500, 2)
    .filledRect(50, 260, 500, 2, [0.9, 0.9, 0.9])
    .line(50, 250, 550, 250)
  .done()
  .build();
// Alt: await builder.save("out.pdf");
// Alt: await builder.saveEncrypted("out.pdf", "user-pw", "owner-pw");
// Alt: builder.toBytesEncrypted("user-pw", "owner-pw"): Uint8Array;

HTML + CSS パイプライン

const pdf = Pdf.fromHtmlCss(html, css, fontBytes);
const pdf = Pdf.fromHtmlCssWithFonts(html, css, [
  ["DejaVu Sans",   font1],
  ["Noto Sans CJK", font2],
]);

署名検証

for (const sig of doc.signatures()) {
  sig.signerName; sig.reason; sig.location; sig.signingTime;
  sig.verify();                  // "Valid" | "Invalid" | "Unknown"
  sig.verifyDetached(pdfBytes);  // boolean

  const cert = sig.getCertificate();
  cert.subject; cert.issuer; cert.serial;
  cert.notBefore; cert.notAfter; cert.isValid;
}

Timestamp の解析と TsaClient は Node バインディングにはまだ配線されていません。これらを使いたい場合は WASM または Rust 側の API を利用してください。

レンダリング

const region: Uint8Array = doc.renderPageRegion(0, 72, 200, 468, 300, 0);   // x, y, w, h, format (0=PNG, 1=JPEG)
const fitted: Uint8Array = doc.renderPageFit(0, 1024, 768, 0);              // fit_width, fit_height, format

マルチターゲット WASM パッケージング

Node.js アプリで WASM ビルド（pdf-oxide-wasm）を使う場合は pdf-oxide-wasm/nodejs からインポートしてください。JavaScript (WASM) API リファレンス → マルチターゲットパッケージングを参照してください。