What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Node.js Streams API

The pdf-oxide native binding ships readable streams for search results, pages, and tables — idiomatic for Node.js pipelines and memory-efficient for large documents.

All streams implement the standard Node.js Readable interface in object mode, support backpressure, integrate with pipe(), and work with for await async iteration.

Streams are native to the Node.js binding. For the WASM build, iterate synchronously.

SearchStream

Emits one SearchResult at a time as the underlying SearchManager produces matches.

const { PdfDocument, SearchManager, SearchStream } = require("pdf-oxide");

const doc = new PdfDocument("large.pdf");
const manager = new SearchManager(doc);
const stream = new SearchStream(manager, "invoice");

stream.on("data", (r) => {
  console.log(`page ${r.pageIndex + 1}: ${r.text}`);
});

stream.on("end", () => {
  console.log("search complete");
  doc.close();
});

stream.on("error", (err) => {
  console.error(err);
  doc.close();
});

Case-sensitive search

const stream = new SearchStream(manager, "Invoice", { caseSensitive: true });

Async iteration

for await (const result of stream) {
  if (result.pageIndex > 50) break;
  console.log(result.text);
}

pipe() compatibility

const { Writable } = require("stream");

const sink = new Writable({
  objectMode: true,
  write(result, _enc, cb) {
    console.log(`${result.pageIndex}:${result.text}`);
    cb();
  },
});

stream.pipe(sink);

PageIteratorStream

Emits one page’s extracted text at a time. Useful for line-oriented output or when feeding an LLM with a rate-limited queue.

const { PageIteratorStream } = require("pdf-oxide");

const stream = new PageIteratorStream(doc, { format: "markdown" });

for await (const { pageIndex, content } of stream) {
  await indexPage(pageIndex, content);
}

format accepts "text" (default), "markdown", "html", "plain".

TableStream

Emits one table at a time as it’s detected.

const { TableStream } = require("pdf-oxide");

const stream = new TableStream(doc);

stream.on("data", (table) => {
  console.log(`${table.rows.length}x${table.rows[0].length} on page ${table.pageIndex}`);
});

Backpressure

All streams implement standard Node.js backpressure. If your consumer is slow, the stream pauses extraction until .read() resumes:

stream.on("data", async (result) => {
  stream.pause();
  await slowIndex(result);
  stream.resume();
});

Or use for await, which handles pausing automatically.

Error handling

Errors during extraction are emitted as standard error events:

stream.on("error", (err) => {
  if (err.code === "PDF_INVALID_PAGE") {
    console.warn("skipping invalid page", err.pageIndex);
  } else {
    throw err;
  }
});

Memory efficiency

Streams keep only one result in flight. On a 10,000-page PDF producing 50,000 matches, a SearchStream uses constant memory — the entire result set is never materialised.

Cleanup

Closing the parent PdfDocument ends all attached streams. Streams also clean up their manager reference on end / error.

const doc = new PdfDocument("big.pdf");
const stream = new SearchStream(new SearchManager(doc), "TODO");

stream.on("end", () => doc.close());
stream.on("error", () => doc.close());

For Node.js 22+, the using keyword releases the document when the scope exits:

{
  using doc = new PdfDocument("big.pdf");
  const stream = new SearchStream(new SearchManager(doc), "TODO");
  for await (const r of stream) console.log(r);
} // doc.close() called automatically

Node.js Getting Started — install, quick start
Node.js API Reference
Search — non-streaming search options