What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Reading Order & XY-Cut — Extract Multi-Column PDFs in Natural Order

Multi-column PDFs — academic papers, textbooks, magazine articles, policy briefs — trip up most extraction tools. A naïve top-to-bottom read pulls a word from column 1, then a word from column 2, then back to column 1, producing garbled output like accompaally ("accompa" from column 1 joined to "ally" from column 2).

PDF Oxide uses an XY-cut algorithm to detect columns and produce natural reading order automatically. Since v0.3.34 it also guards against sparse-layout false positives (copyright pages, title pages) and correctly handles mixed layouts where a table sits inside body text.

Quick Example

Extraction is column-aware by default — no flag needed:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("academic-paper.pdf")
text = doc.extract_text(0)
# Columns are read top-to-bottom within each column, not interleaved.

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("academic-paper.pdf")?;
let text = doc.extract_text(0)?;

JavaScript / TypeScript (Node)

const { PdfDocument } = require("pdf-oxide");
const doc = new PdfDocument("academic-paper.pdf");
const text = doc.extractText(0);
doc.close();

JavaScript (WASM)

import { WasmPdfDocument } from "pdf-oxide-wasm";
const doc = new WasmPdfDocument(bytes);
console.log(doc.extractText(0));
doc.free();

doc, _ := pdfoxide.Open("academic-paper.pdf")
defer doc.Close()

text, _ := doc.ExtractText(0)
fmt.Println(text)

using PdfOxide;

using var doc = PdfDocument.Open("academic-paper.pdf");
Console.WriteLine(doc.ExtractText(0));

What XY-cut Does

The XY-cut algorithm recursively splits a page into rectangular regions by alternating vertical and horizontal cuts along whitespace gutters:

Project all characters onto the X axis. If a tall, wide vertical gap shows up (the column gutter), split the page into two regions at that X coordinate.
Within each region, project onto the Y axis and split on horizontal gutters (paragraph breaks, section boundaries).
Recurse until each leaf region has no strong gutter — these are the atomic blocks.
Serialize blocks in top-to-bottom, left-to-right order.

This matches how a human reads: column 1 top to bottom, then column 2 top to bottom, then any full-width footer.

When XY-cut Activates

XY-cut runs automatically when extract_text detects a multi-column layout. It is skipped for:

Single-column pages (no vertical gutter is found, so the default row-aware sort is used)
Sparse pages with fewer than ~10 text spans per apparent column — these are typically title or copyright pages where two X-center peaks are an artefact rather than real columns (fixed in v0.3.34)

No configuration is needed for the common case. If you want to force one mode or the other, see “Opt-out” below.

What v0.3.34 Fixed

Interleaved multi-column output on untagged PDFs

On untagged multi-column PDFs (academic textbooks, genetics references), extract_text previously applied XY-cut inside extract_spans() and then re-sorted the result with a row-aware sort in extract_text_with_options, undoing the column structure. Result: garbled fragments like accompaally.

Fix: the row-aware re-sort is now skipped on pages that are genuinely multi-column. Verified clean on Hartwell Genetics, Murphy ML, and Kandel Neural Science textbooks.

Table-within-text pages

Mixed-layout pages (a table embedded in running body text) could trick the column detector because tab-expanded table rows filled the column gutter. Fix:

Wide spans (>55 % of region width) are excluded from the projection density — tab-padded rows no longer mask the gutter.
Single-character spans (table cell values like G, T) are excluded from the projection so they don’t scatter across the gutter.
Coverage uses a character-count estimate rather than raw bbox width, so tab-padded rows no longer masquerade as dense body text.

Sparse-layout false positives

Copyright pages, title pages, and colophons can produce two X-center peaks with only 7–10 spans per “column”. These are no longer treated as multi-column, preventing XY-cut from splitting sentences whose halves sit at different X positions on the same line.

Structured Access per Column

Going lower-level than extract_text, you can pull words or character-level data with the same column ordering applied:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
for w in doc.extract_words(0):
    print(f"{w.text}  ({w.x0:.0f},{w.y0:.0f})")

Rust

let mut doc = PdfDocument::open("paper.pdf")?;
for w in doc.extract_words(0)? {
    println!("{}  ({:.0},{:.0})", w.text, w.x0, w.y0);
}

doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()

words, _ := doc.ExtractWords(0)
for _, w := range words {
    fmt.Printf("%s  (%.0f,%.0f)\n", w.Text, w.X0, w.Y0)
}

using var doc = PdfDocument.Open("paper.pdf");
// Node/C# return rows of (text, x, y, w, h):
var lines = doc.ExtractTextLines(0);
foreach (var (text, x, y, w, h) in lines)
    Console.WriteLine($"{text}  ({x:F0},{y:F0})");

Each word / line carries its bounding box so you can group by column and re-order yourself if you need a custom policy (e.g. read the right column first for Arabic layouts).

Detecting Multi-Column Pages Manually

If you want to branch on whether a page is multi-column before extracting:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("mixed.pdf")
for i in range(doc.page_count()):
    words = doc.extract_words(i)
    # Heuristic: distinct X-center clusters
    x_centers = {round((w.x0 + w.x1) / 2 / 50) * 50 for w in words}
    if len(x_centers) >= 2:
        print(f"Page {i}: likely multi-column ({len(x_centers)} X-centers)")

For production use, prefer extract_text and let the library’s XY-cut + sparse-layout guard make the call.

Opt-out or Custom Ordering

If you want raw, position-ordered spans (e.g. for a custom layout engine), use extract_chars or extract_words — these return records with bounding boxes, and you can apply your own sort:

Python

chars = doc.extract_chars(0)
# Top-to-bottom, then left-to-right — ignores columns
chars_sorted = sorted(chars, key=lambda c: (-c.y, c.x))

Rust

let mut chars = doc.extract_chars(0)?;
chars.sort_by(|a, b| b.y.partial_cmp(&a.y).unwrap()
    .then(a.x.partial_cmp(&b.x).unwrap()));

Text Extraction — full extraction API
Extraction Profiles — tune space detection per document type
Extract Tables from PDF — structured table output
Changelog — v0.3.34 multi-column and mixed-layout fixes

Reading Order & XY-Cut — Extract Multi-Column PDFs in Natural Order

Quick Example

What XY-cut Does

When XY-cut Activates

What v0.3.34 Fixed

Interleaved multi-column output on untagged PDFs

Table-within-text pages

Sparse-layout false positives

Structured Access per Column

Detecting Multi-Column Pages Manually

Opt-out or Custom Ordering

Related Pages