What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Getting Started with PDF Oxide (Node.js)

PDF Oxide is the fastest Node.js PDF library — 0.8ms mean text extraction, 5× faster than PyMuPDF, 15× faster than pypdf, 100% pass rate on 3,830 PDFs. One package for extracting, creating, and editing PDFs, with TypeScript definitions included. MIT / Apache-2.0 licensed.

Running in a browser, Deno, Bun, or Cloudflare Workers? Use the WASM build instead — same API, no native binaries. The native addon on this page is for Node.js and Electron only.

Installation

npm install pdf-oxide

Requirements: Node.js 18 or newer. No system dependencies. No Rust toolchain required. Prebuilt .node addons for Linux (glibc + musl) x64/arm64, macOS x64/arm64, and Windows x64/arm64 are fetched automatically via platform-specific optionalDependencies — nothing compiles on install.

Opening a PDF

JavaScript

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("research-paper.pdf");
console.log(`Pages: ${doc.getPageCount()}`);

const { major, minor } = doc.getVersion();
console.log(`PDF version: ${major}.${minor}`);

doc.close();

TypeScript

import { PdfDocument } from "pdf-oxide";

const doc: PdfDocument = new PdfDocument("research-paper.pdf");
const pageCount: number = doc.getPageCount();
const { major, minor }: { major: number; minor: number } = doc.getVersion();
console.log(`${pageCount} pages, PDF ${major}.${minor}`);
doc.close();

On Node.js 22+ use using for automatic cleanup:

{
  using doc = new PdfDocument("report.pdf");
  const text = doc.extractText(0);
} // doc.close() called automatically

Page API

Since v0.3.34 PdfDocument is iterable and doc.page(i) returns a PdfPage with cached width / height / rotation plus per-page extraction methods.

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("paper.pdf");
for (const page of doc) {
  console.log(`Page ${page.index}: ${page.width}x${page.height} (rotation ${page.rotation})`);
  const md = page.markdown();
  const tables = page.tables();       // rows + cells with bboxes
}
doc.close();

Indexing: doc.page(0), doc.page(-1) (last page). Page methods: text(), markdown(), html(), plainText(), words(), lines(), tables(), images(), paths(), annotations(), fonts(), search(query, caseSensitive).

Text Extraction

Single Page

JavaScript

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const text = doc.extractText(0);
console.log(text);
doc.close();

TypeScript

import { PdfDocument } from "pdf-oxide";

const doc: PdfDocument = new PdfDocument("report.pdf");
const text: string = doc.extractText(0);
console.log(text);
doc.close();

All Pages

const doc = new PdfDocument("book.pdf");
const pageCount = doc.getPageCount();

for (let i = 0; i < pageCount; i++) {
  console.log(`--- Page ${i + 1} ---`);
  console.log(doc.extractText(i));
}

doc.close();

Async Extraction

Every sync method has an *Async counterpart that runs on the libuv thread pool. Use these in HTTP handlers and other concurrent server code so extraction doesn’t block the event loop.

const { PdfDocument } = require("pdf-oxide");

async function extract(path) {
  const doc = new PdfDocument(path);
  try {
    return await doc.extractTextAsync(0);
  } finally {
    doc.close();
  }
}

See the async guide for patterns including Promise.all fan-out over pages.

Structured Extraction

Character-level and span-level data with positions and font metadata:

const chars = doc.extractChars(0);
for (const ch of chars.slice(0, 10)) {
  console.log(`'${ch.char}' at (${ch.x.toFixed(1)}, ${ch.y.toFixed(1)}) ` +
              `size=${ch.fontSize.toFixed(1)} font=${ch.fontName}`);
}

const spans = doc.extractSpans(0);
for (const span of spans) {
  console.log(`"${span.text}" font=${span.fontName} size=${span.fontSize}`);
}

Word- and line-level extraction with tunable segmentation:

const words = doc.extractWords(0);
const lines = doc.extractTextLines(0, { wordGapThreshold: 2.5, lineGapThreshold: 1.2 });

Markdown Conversion

JavaScript

const md = doc.toMarkdown(0, { detectHeadings: true });
console.log(md);

// All pages
const allMd = doc.toMarkdownAll();

TypeScript

const md: string = doc.toMarkdown(0, { detectHeadings: true });
const allMd: string = doc.toMarkdownAll();

HTML Conversion

const html = doc.toHtml(0);
const allHtml = doc.toHtmlAll();

Image Extraction

const { writeFileSync } = require("fs");

const doc = new PdfDocument("brochure.pdf");
const images = doc.extractImages(0);

for (const [i, img] of images.entries()) {
  console.log(`Image ${i}: ${img.width}x${img.height} ${img.format} (${img.data.length} bytes)`);
  writeFileSync(`image_${i}.${img.format}`, img.data);
}

doc.close();

Images extracted from Indexed-color PDFs are automatically expanded to RGB, including 1/2/4/8 bpc indexed palettes with RGB, Grayscale, or CMYK base colour spaces.

Opening from Bytes

Open a PDF from in-memory bytes — useful when downloading from S3, HTTP, or databases:

const { PdfDocument } = require("pdf-oxide");
const { readFileSync } = require("fs");

const bytes = readFileSync("document.pdf");
const doc = PdfDocument.openFromBytes(bytes);
const text = doc.extractText(0);
doc.close();

Password-Protected PDFs

const doc = PdfDocument.openWithPassword("confidential.pdf", "secret");
const text = doc.extractText(0);
doc.close();

You can also authenticate after opening:

const doc = new PdfDocument("confidential.pdf");
if (doc.authenticate("secret")) {
  console.log(doc.extractText(0));
}
doc.close();

AES-256 (V=5, R=6) PDFs are fully supported — including push-button widget captions and correctly-invalidated object caches after late authentication.

PDF Creation

The Pdf class provides factory methods to create PDFs from various source formats.

From Markdown

const { Pdf } = require("pdf-oxide");
const { writeFileSync } = require("fs");

const pdf = Pdf.fromMarkdown("# Hello World\n\nThis is a PDF.");
writeFileSync("output.pdf", pdf.toBytes());

From HTML

const pdf = Pdf.fromHtml("<h1>Invoice</h1><p>Amount due: $42.00</p>");
writeFileSync("invoice.pdf", pdf.toBytes());

From Plain Text

const pdf = Pdf.fromText("Plain text document.\n\nSecond paragraph.");
writeFileSync("notes.pdf", pdf.toBytes());

From Images

const pdf = Pdf.fromImage("scan.jpg");
writeFileSync("scan.pdf", pdf.toBytes());

Search

const doc = new PdfDocument("manual.pdf");

// Search all pages
const results = doc.searchAll("configuration", { caseSensitive: false });
for (const r of results) {
  console.log(`Page ${r.page}: "${r.text}" at (${r.x.toFixed(0)}, ${r.y.toFixed(0)})`);
}

// Search a single page
const pageResults = doc.searchPage(0, "configuration");
doc.close();

For streaming search over large documents, use SearchStream:

const { PdfDocument, SearchStream, SearchManager } = require("pdf-oxide");

const doc = new PdfDocument("large.pdf");
const manager = new SearchManager(doc);
const stream = new SearchStream(manager, "invoice");

stream.on("data", (r) => console.log(`page ${r.pageIndex + 1}: ${r.text}`));
stream.on("end", () => doc.close());

See the Node.js streams guide for details.

Editing

Use DocumentEditor for metadata, page operations, annotations, and form fields:

const { DocumentEditor } = require("pdf-oxide");

const editor = DocumentEditor.open("document.pdf");

// Metadata
editor.setTitle("Updated Title");
editor.setAuthor("Jane Doe");

// Page operations
editor.rotatePage(0, 90);
editor.deletePage(5);
editor.movePage(2, 0);

// Forms
editor.setFormFieldValue("employee.name", "Jane Doe");
editor.flattenForms();

editor.save("edited.pdf");
editor.close();

OCR

Opt into the ocr feature at install time to enable OCR on scanned pages:

npm install pdf-oxide --build-from-source -- --features ocr

const { PdfDocument, OcrEngine } = require("pdf-oxide");

const doc = new PdfDocument("scanned.pdf");
const ocr = new OcrEngine();

if (ocr.pageNeedsOcr(doc, 0)) {
  const text = ocr.extractText(doc, 0);
  console.log(text);
}

ocr.close();
doc.close();

See the OCR guide for end-to-end recipes.

Thread Safety

PdfDocument is Send + Sync — you can share a single document across worker threads for parallel page reads. The *Async method family performs this automatically using the libuv thread pool; see concurrency for manual worker patterns.

Error Handling

All methods throw on failure:

const { PdfDocument } = require("pdf-oxide");

try {
  const doc = new PdfDocument("document.pdf");
  const text = doc.extractText(0);
  doc.close();
} catch (err) {
  console.error(`Extraction failed: ${err.message}`);
}

Next Steps

Python Getting Started — using PDF Oxide from Python
WASM Getting Started — browser / Deno / Bun / edge runtimes
Node.js API Reference — full native API documentation
Async Guide — *Async methods and Promise.all patterns
Node.js Streams — SearchStream and friends
Text Extraction — detailed extraction options