What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Page API Reference

Since v0.3.34 every binding exposes a Page object so you can iterate a document and call extraction methods on the page directly, instead of threading page_index through every extraction call. The type is named Page consistently in Python, Node.js, C#, and Go; Rust exposes the same shape through PdfPage.

Quick Example

Python

from pdf_oxide import PdfDocument

with PdfDocument("paper.pdf") as doc:
    for page in doc:                       # len(doc), doc[i], doc[-1] also work
        print(page.text[:80])
        md = page.markdown(detect_headings=True)

Rust

use pdf_oxide::api::Pdf;

let mut doc = Pdf::open("paper.pdf")?;
for i in 0..doc.page_count()? {
    let page = doc.page(i)?;
    println!("{}", &page.text()?[..80]);
}

JavaScript / TypeScript (Node)

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("paper.pdf");
for (const page of doc) {
  console.log(page.extractText().slice(0, 80));
}
doc.close();

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("paper.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    pages, _ := doc.Pages()
    for _, page := range pages {
        text, _ := page.ExtractText()
        fmt.Println(text[:80])
    }
}

using PdfOxide;

using var doc = PdfDocument.Open("paper.pdf");
foreach (var page in doc.Pages)
{
    Console.WriteLine(page.ExtractText()[..Math.Min(80, page.ExtractText().Length)]);
}

Python — `Page`

Lazy property surface — content is parsed on first access and cached on the Page.

Member	Returns	Description
`page.text`	`str`	Extracted text (column-aware)
`page.chars`	`list[Char]`	Character-level records with bbox, font
`page.words`	`list[Word]`	Word-level records with bbox
`page.lines`	`list[TextLine]`	Text lines with bbox
`page.spans`	`list[Span]`	Styled spans (font, size, weight)
`page.tables`	`list[Table]`	Structured table rows + cell bboxes
`page.images`	`list[Image]`	Image metadata
`page.paths`	`list[Path]`	Vector path records
`page.annotations`	`list[Annotation]`	Annotations on this page
`page.markdown(detect_headings=True)`	`str`	Markdown conversion
`page.plain_text()`	`str`	Plain text (no layout hints)
`page.html()`	`str`	HTML conversion
`page.render(format="png")`	`bytes`	Render page as PNG / JPEG
`page.search(term, case_sensitive=False)`	`list[SearchResult]`	Find text on this page
`page.region(rect)`	`PageRegion`	Scoped extraction inside a rect

with PdfDocument("paper.pdf") as doc:
    page = doc[0]                 # or doc.page(0)
    for word in page.words:       # first access parses; subsequent calls cached
        print(word.text, word.bbox)

    # Scoped extraction
    header = page.region((0, 700, 612, 92)).extract_text()

The pre-existing editor PdfPage class (for writing) is unchanged; the new Page is strictly read-only.

Rust — `PdfPage`

use pdf_oxide::api::Pdf;

let mut doc = Pdf::open("paper.pdf")?;
let page = doc.page(0)?;

let text = page.text()?;
let words = page.extract_words()?;
let tables = page.extract_tables()?;
let md = page.to_markdown(true)?;

Methods available on PdfPage:

text(), plain_text(), to_markdown(detect_headings), to_html()
extract_chars(), extract_words(), extract_lines(), extract_spans()
extract_tables(), extract_paths(), extract_images()
annotations(), render(format)
search(term) — scoped search
find_text_containing(substring) — DOM-level hit list with IDs

Node.js — `Page`

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("paper.pdf");
const page = doc.page(0);

console.log(page.width, page.height, page.rotation);  // cached
console.log(page.extractText());
const words = page.extractWords();
const tables = page.extractTables();
const md = page.toMarkdown();

PdfDocument supports for..of via Symbol.iterator, plus doc.page(i) and doc.pageCount().

Six previously native-only methods are now available on both Page and PdfDocument via the TS layer:

extractWords
extractTextLines
extractTables
extractPaths
getEmbeddedImages
ocrExtractText

Each method has an async sibling — extractTextAsync, toMarkdownAsync, etc.

Go — `Page`

doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()

page, _ := doc.Page(0)
text, _ := page.ExtractText()
md, _   := page.ToMarkdown()
tables, _ := page.ExtractTables()

// Iterate every page
all, _ := doc.Pages()
for i, p := range all {
    t, _ := p.ExtractText()
    fmt.Printf("page %d: %d chars\n", i, len(t))
}

Go’s Page struct has the full method surface: ExtractText, ToMarkdown, ToHtml, ToPlainText, ExtractWords, ExtractTextLines, ExtractTables, ExtractChars, ExtractPaths, Annotations, Images, Fonts, RenderPage, Search.

C# — `Page`

using PdfOxide;

using var doc = PdfDocument.Open("paper.pdf");

Page page = doc[0];                            // or doc.Pages[0] or doc.Page(0)
string text = page.ExtractText();
string md   = page.ToMarkdown();
Table[] tables = page.ExtractTables();

// Async variants
string textAsync = await page.ExtractTextAsync();
string mdAsync   = await page.ToMarkdownAsync();

doc.Pages is IReadOnlyList<Page>. Every sync method has an async Task<T> counterpart with CancellationToken support.

Structured Table Shape

extract_tables() (available on both PdfDocument and Page) returns a consistent Table type across languages:

Language	Type	Cell access
Rust	`Table`	iterate `rows[i].cells[j]`
Python	`dict`	`row["cells"][i]["text"]`
Go	`Table`	`table.CellText(row, col)`
C#	`Table`	`table.CellText(row, col)`
Node.js	`Table` interface	`table.cells[row][col]`

Each cell carries text plus a bounding box so you can correlate the extraction back to coordinates on the page.

Migration from `doc.extract_*(page_index)`

Old (still supported):

doc = PdfDocument("paper.pdf")
for i in range(doc.page_count()):
    print(doc.extract_text(i))
    print(doc.to_markdown(i, detect_headings=True))
    print(doc.extract_tables(i))

New (v0.3.34+):

with PdfDocument("paper.pdf") as doc:
    for page in doc:
        print(page.text)
        print(page.markdown(detect_headings=True))
        print(page.tables)

Both styles stay supported; the Page style reads better for per-page pipelines and avoids repeated index bookkeeping.

Python API Reference
Rust API Reference
Node.js API Reference
Go API Reference
C# API Reference
Text Extraction — underlying extraction methods
Changelog — v0.3.34 Page API introduction

Page API Reference

Quick Example

Python — Page

Rust — PdfPage

Node.js — Page

Go — Page

C# — Page