What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Concurrency — Thread-Safe PDF Reads

PdfDocument has been Send + Sync on the Rust side since v0.3.22. A single document can be shared across OS threads, goroutines, worker threads, or asyncio tasks for parallel page extraction. Write operations still need serialisation — that’s what DocumentEditor is for.

What changed in v0.3.22

All 16 RefCell<T> wrappers inside PdfDocument were replaced with Mutex<T>, and Cell<usize> became AtomicUsize. The language bindings dropped the unsendable marker on Python classes (PdfDocument, PdfPage, FormField), which previously raised RuntimeError the moment they crossed a thread boundary.

Net effect: thread pools, async runtimes, and free-threaded Python all now just work.

Rust

Rust

use pdf_oxide::PdfDocument;
use std::sync::Arc;
use std::thread;

let doc = Arc::new(PdfDocument::open("report.pdf")?);
let page_count = doc.page_count();

let handles: Vec<_> = (0..page_count)
    .map(|i| {
        let doc = Arc::clone(&doc);
        thread::spawn(move || doc.extract_text(i))
    })
    .collect();

for h in handles {
    let text = h.join().unwrap()?;
    println!("{}", text);
}

With tokio:

Rust

use std::sync::Arc;
use tokio::task;

let doc = Arc::new(pdf_oxide::PdfDocument::open("report.pdf")?);

let tasks: Vec<_> = (0..doc.page_count())
    .map(|i| {
        let doc = Arc::clone(&doc);
        task::spawn_blocking(move || doc.extract_text(i))
    })
    .collect();

for t in tasks {
    let text = t.await??;
}

Python

Python

from concurrent.futures import ThreadPoolExecutor
from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")

with ThreadPoolExecutor(max_workers=8) as pool:
    pages = list(pool.map(doc.extract_text, range(doc.page_count())))

Under stock CPython the GIL still serialises Python-level work, but the extraction itself releases the GIL during Rust execution — so this is genuinely parallel on the Rust side. Under cp314t (free-threaded Python 3.14+), the GIL is optional and the bindings declare gil_used = false so there is no implicit serialisation at all.

With asyncio:

Python

import asyncio
from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")

async def main():
    pages = await asyncio.gather(
        *[asyncio.to_thread(doc.extract_text, i) for i in range(doc.page_count())]
    )

Or use the ready-made AsyncPdfDocument from the async guide.

Go

Reads on *PdfDocument are protected by an internal sync.RWMutex — goroutine-safe by construction.

package main

import (
    "sync"

    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, _ := pdfoxide.Open("report.pdf")
    defer doc.Close()

    count, _ := doc.PageCount()
    results := make([]string, count)

    var wg sync.WaitGroup
    for i := 0; i < count; i++ {
        wg.Add(1)
        go func(page int) {
            defer wg.Done()
            text, _ := doc.ExtractText(page)
            results[page] = text
        }(i)
    }
    wg.Wait()
}

*DocumentEditor serialises writes internally, but do not pipeline independent edits from multiple goroutines — collect mutations on one goroutine.

C#

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");
var tasks = Enumerable.Range(0, doc.PageCount)
    .Select(i => Task.Run(() => doc.ExtractText(i)));
string[] pages = await Task.WhenAll(tasks);

If you need fine-grained reader/writer semantics around a DocumentEditor:

var locker = new ReaderWriterLockSlim();

locker.EnterReadLock();
try
{
    string text = doc.ExtractText(0);
}
finally
{
    locker.ExitReadLock();
}

Node.js

A PdfDocument can be passed to worker threads by transferring the backing handle. The simpler pattern is to let the *Async methods do the dispatching:

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
try {
  const pageCount = doc.getPageCount();
  const pages = await Promise.all(
    Array.from({ length: pageCount }, (_, i) => doc.extractTextAsync(i))
  );
} finally {
  doc.close();
}

Each *Async call runs on the libuv thread pool.

Writer serialisation

Writes (DocumentEditor, Pdf, PdfCreator) are not lock-free. If multiple threads need to modify the same document, funnel mutations through one writer goroutine / task and fan out the reads.

A common pattern:

1 reader PdfDocument shared across N reader threads.
1 writer DocumentEditor owned by a single coordinator task that collects edits from a channel or queue.

Async Processing — awaitable wrappers and CancellationToken plumbing.
Batch Processing — processing many files concurrently.
Node.js Getting Started — worker-thread patterns.