What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Async PDF Processing

PDF Oxide ships first-class async APIs in Python, C#, and Node.js so extraction never blocks your event loop or HTTP request handler. The same PdfDocument methods are exposed as async wrappers that dispatch to a background thread / thread pool.

Python — AsyncPdfDocument

Every method on PdfDocument has an awaitable counterpart on AsyncPdfDocument. Each call wraps the sync method with asyncio.to_thread().

Python

import asyncio
from pdf_oxide import AsyncPdfDocument

async def extract(path):
    doc = await AsyncPdfDocument.open(path)
    text = await doc.extract_text(0)
    return text

asyncio.run(extract("report.pdf"))

Fan-out across pages

Python

import asyncio
from pdf_oxide import AsyncPdfDocument

async def extract_all(path):
    doc = await AsyncPdfDocument.open(path)
    page_count = await doc.page_count()
    pages = await asyncio.gather(*[doc.extract_text(i) for i in range(page_count)])
    return pages

Creation

AsyncPdf mirrors the Pdf class for creation flows:

Python

from pdf_oxide import AsyncPdf

pdf = await AsyncPdf.from_markdown("# Hello")
await pdf.save_async("out.pdf")

Office conversion

AsyncOfficeConverter handles async DOCX / XLSX / PPTX → PDF.

Python

from pdf_oxide import AsyncOfficeConverter

converter = AsyncOfficeConverter()
pdf_bytes = await converter.docx_to_pdf_bytes("input.docx")

Free-threaded Python (cp314t)

The pdf_oxide extension module declares gil_used = false, making it safe to use under cp314t (Python 3.14 free-threaded builds). Multiple threads can call PdfDocument methods in parallel without GIL serialisation.

Python

from concurrent.futures import ThreadPoolExecutor
from pdf_oxide import PdfDocument

doc = PdfDocument("large.pdf")

with ThreadPoolExecutor(max_workers=8) as pool:
    pages = list(pool.map(doc.extract_text, range(doc.page_count())))

C# — `async Task<T>` with `CancellationToken`

Every extraction and save method has an *Async variant returning Task<T> and accepting an optional CancellationToken.

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");

string text = await doc.ExtractTextAsync(0);

using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(30));
var tasks = Enumerable.Range(0, doc.PageCount)
    .Select(i => doc.ExtractTextAsync(i, cts.Token));
string[] pages = await Task.WhenAll(tasks);

Editor

using var editor = DocumentEditor.Open("form.pdf");
editor.SetFormFieldValue("name", "Jane Doe");
await editor.SaveAsync("filled.pdf");

Node.js — `*Async` methods

Every sync method has an async sibling — extractText → extractTextAsync, save → saveAsync, etc. Async calls run on the libuv thread pool.

Node.js

const { PdfDocument } = require("pdf-oxide");

async function extractAll(path) {
  const doc = new PdfDocument(path);
  try {
    const pageCount = doc.getPageCount();
    const pages = await Promise.all(
      Array.from({ length: pageCount }, (_, i) => doc.extractTextAsync(i))
    );
    return pages;
  } finally {
    doc.close();
  }
}

HTTP handler example

Node.js

import express from "express";
import { PdfDocument } from "pdf-oxide";

const app = express();

app.post("/extract", express.raw({ type: "application/pdf", limit: "50mb" }), async (req, res) => {
  const doc = PdfDocument.openFromBytes(req.body);
  try {
    const text = await doc.extractTextAsync(0);
    res.json({ text });
  } finally {
    doc.close();
  }
});

Concurrency model

Language	Mechanism
Python	`asyncio.to_thread` dispatches sync call to the default executor
Python (cp314t)	True thread parallelism — GIL is optional
C#	Task-based Asynchronous Pattern, dispatched to ThreadPool
Node.js	libuv worker threads (`Napi::AsyncWorker`)
Go	Not needed — goroutines call sync methods directly
Rust	Not provided — use `tokio::task::spawn_blocking` or your executor of choice

See the concurrency guide for sharing a single PdfDocument across threads.

Concurrency — thread safety + parallel reads
Python Getting Started
Node.js Getting Started
C# Getting Started

Async PDF Processing

Python — AsyncPdfDocument

Fan-out across pages

Creation

Office conversion

Free-threaded Python (cp314t)

C# — async Task<T> with CancellationToken

Editor

Node.js — *Async methods

HTTP handler example

Concurrency model

Related

C# — `async Task<T>` with `CancellationToken`

Node.js — `*Async` methods