What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Getting Started with PDF Oxide (WASM)

PDF Oxide compiles to WebAssembly for browsers, Deno, Bun, and edge runtimes (Cloudflare Workers, Vercel Edge). The same Rust core that powers the Python, Rust, Node.js, Go, and C# bindings runs directly in any JavaScript environment with near-native performance.

Using Node.js? For server-side Node.js prefer the native pdf-oxide N-API addon — it’s faster and supports OCR, rendering, and signatures. The WASM build on this page is the right choice for browsers and edge runtimes where native addons can’t load.

Installation

npm install pdf-oxide-wasm

import { WasmPdfDocument, WasmPdf } from "pdf-oxide-wasm";

Quick Start

Node.js

import { readFileSync } from "fs";
import { WasmPdfDocument } from "pdf-oxide-wasm";

const bytes = new Uint8Array(readFileSync("document.pdf"));
const doc = new WasmPdfDocument(bytes);

console.log(`Pages: ${doc.pageCount()}`);
console.log(doc.extractText(0));

doc.free();

Browser

<script type="module">
import init, { WasmPdfDocument } from "pdf-oxide-wasm";

await init();

const response = await fetch("document.pdf");
const bytes = new Uint8Array(await response.arrayBuffer());
const doc = new WasmPdfDocument(bytes);

console.log(`Pages: ${doc.pageCount()}`);
console.log(doc.extractText(0));
doc.free();
</script>

Browser with File Input

<input type="file" id="pdfInput" accept=".pdf" />
<pre id="output"></pre>

<script type="module">
import init, { WasmPdfDocument } from "pdf-oxide-wasm";
await init();

document.getElementById("pdfInput").addEventListener("change", async (e) => {
  const file = e.target.files[0];
  const bytes = new Uint8Array(await file.arrayBuffer());
  const doc = new WasmPdfDocument(bytes);

  let result = `Pages: ${doc.pageCount()}\n\n`;
  for (let i = 0; i < doc.pageCount(); i++) {
    result += `--- Page ${i + 1} ---\n`;
    result += doc.extractText(i) + "\n\n";
  }

  document.getElementById("output").textContent = result;
  doc.free();
});
</script>

Text Extraction

Single Page

const doc = new WasmPdfDocument(bytes);
const text = doc.extractText(0);

All Pages

const allText = doc.extractAllText();

Structured Extraction

Get character-level and span-level data with positions and font metadata:

// Character-level data
const chars = doc.extractChars(0);
for (const c of chars) {
  console.log(`'${c.char}' at (${c.bbox.x}, ${c.bbox.y}) font=${c.fontName}`);
}

// Span-level data
const spans = doc.extractSpans(0);
for (const span of spans) {
  console.log(`"${span.text}" size=${span.fontSize}`);
}

Markdown Conversion

const markdown = doc.toMarkdown(0);

// With options
const md = doc.toMarkdown(0, true, true); // detect_headings, include_images

// All pages
const allMarkdown = doc.toMarkdownAll();

HTML Conversion

const html = doc.toHtml(0);

// All pages
const allHtml = doc.toHtmlAll();

PDF Creation

Create new PDFs from Markdown, HTML, or plain text using WasmPdf:

import { WasmPdf } from "pdf-oxide-wasm";

// From Markdown
const pdf = WasmPdf.fromMarkdown("# Hello World\n\nThis is a PDF.");
const pdfBytes = pdf.toBytes(); // Uint8Array

// From HTML
const invoice = WasmPdf.fromHtml("<h1>Invoice</h1><p>Amount: $42</p>");

// From plain text
const notes = WasmPdf.fromText("Plain text content.");

// Save to file (Node.js)
import { writeFileSync } from "fs";
writeFileSync("output.pdf", pdf.toBytes());

Form Fields

const fields = doc.getFormFields();
for (const f of fields) {
  console.log(`${f.name} (${f.fieldType}) = ${f.value}`);
}

// Export form data
const fdfBytes = doc.exportFormData();        // FDF format
const xfdfBytes = doc.exportFormData("xfdf"); // XFDF format

Search

// Search all pages
const results = doc.search("configuration", true); // case_insensitive
for (const r of results) {
  console.log(`Found "${r.text}" on page ${r.page}`);
}

// Search single page
const pageResults = doc.searchPage(0, "configuration", true);

Opening from Bytes

The WasmPdfDocument constructor already takes Uint8Array bytes directly — no separate from_bytes method is needed:

// Already works — WasmPdfDocument takes bytes
const doc = new WasmPdfDocument(uint8Array);

Encrypted PDFs

const doc = new WasmPdfDocument(encryptedBytes);
const success = doc.authenticate("password");
if (success) {
  console.log(doc.extractText(0));
}

Editing

const doc = new WasmPdfDocument(bytes);

// Metadata
doc.setTitle("Updated Title");
doc.setAuthor("Jane Doe");

// Page rotation
doc.rotatePage(0, 90);

// Save with changes
const edited = doc.save();

// Save with encryption
const encrypted = doc.saveEncryptedToBytes(
  "user-password",
  "owner-password",
  true,   // allow_print
  true,   // allow_copy
  false,  // allow_modify
  true    // allow_annotate
);

Memory Management

WASM objects hold Rust memory that must be freed explicitly:

const doc = new WasmPdfDocument(bytes);
try {
  const text = doc.extractText(0);
} finally {
  doc.free();
}

Feature Availability

Some features require native dependencies and are not available in WebAssembly builds:

Feature	WASM	Notes
Text extraction	Yes	Full support
PDF creation	Yes	Markdown, HTML, text
PDF editing	Yes	Full support
Encryption	Yes	AES-256
OCR	No	Requires native ONNX Runtime
Digital signatures	No	Requires native crypto libraries
Page rendering	No	Requires native tiny-skia

For OCR or rendering support, use the Python or Rust bindings.

Next Steps

Python Getting Started – using PDF Oxide from Python
Rust Getting Started – using PDF Oxide from Rust
JavaScript API Reference – full WASM API documentation
Text Extraction – detailed extraction options
PDF Creation – advanced creation