What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Page API Reference

Since v0.3.34 every binding exposes a Page object so you can iterate a document and call extraction methods on the page directly, instead of threading page_index through every extraction call. The type is named Page consistently in Python, Node.js, C#, and Go; Rust exposes the same shape through PdfPage.

Quick Example

Python

from pdf_oxide import PdfDocument

with PdfDocument("paper.pdf") as doc:
    for page in doc:                       # len(doc), doc[i], doc[-1] also work
        print(page.text[:80])
        md = page.markdown(detect_headings=True)

Rust

use pdf_oxide::api::Pdf;

let mut doc = Pdf::open("paper.pdf")?;
for i in 0..doc.page_count()? {
    let page = doc.page(i)?;
    println!("{}", &page.text()?[..80]);
}

JavaScript / TypeScript (Node)

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("paper.pdf");
for (const page of doc) {
  console.log(page.extractText().slice(0, 80));
}
doc.close();

package main

import (
    "fmt"
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("paper.pdf")
    if err != nil { log.Fatal(err) }
    defer doc.Close()

    pages, _ := doc.Pages()
    for _, page := range pages {
        text, _ := page.ExtractText()
        fmt.Println(text[:80])
    }
}

using PdfOxide;

using var doc = PdfDocument.Open("paper.pdf");
foreach (var page in doc.Pages)
{
    Console.WriteLine(page.ExtractText()[..Math.Min(80, page.ExtractText().Length)]);
}

Java

import fyi.oxide.pdf.PdfDocument;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("paper.pdf"))) {
    for (int i = 0; i < doc.pageCount(); i++) {
        String text = doc.extractText(i);
        System.out.println(text.substring(0, Math.min(80, text.length())));
        String md = doc.toMarkdown(i);
    }
}

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(java.nio.file.Path.of("paper.pdf")).use { doc ->
    for (i in 0 until doc.pageCount()) {
        val text = doc.extractText(i)
        println(text.substring(0, minOf(80, text.length)))
        val md = doc.toMarkdown(i)
    }
}

Scala

import fyi.oxide.pdf.PdfDocument
import scala.util.Using

Using.resource(PdfDocument.open("paper.pdf")) { doc =>
  for (i <- 0 until doc.pageCount()) {
    val text = doc.extractText(i)
    println(text.substring(0, math.min(80, text.length)))
    val md = doc.toMarkdown(i)
  }
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "paper.pdf")]
  (doseq [i (range (pdf/page-count doc))]
    (let [text (pdf/extract-text doc i)]
      (println (subs text 0 (min 80 (count text))))
      (pdf/to-markdown doc i))))

Ruby

require 'pdf_oxide'

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  (0...doc.page_count).each do |i|
    text = doc.extract_text(i)
    puts text[0, 80]
    md = doc.to_markdown(i)
  end
end

PHP

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');
for ($i = 0; $i < $doc->pageCount(); $i++) {
    $text = $doc->extractText($i);
    echo substr($text, 0, 80), "\n";
    $md = $doc->toMarkdown($i);
}
$doc->close();

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("paper.pdf");
for (int i = 0; i < doc.page_count(); i++) {
    auto text = doc.extract_text(i);
    std::cout << text.substr(0, 80) << "\n";
    auto md = doc.to_markdown(i);
}

Swift

import PdfOxide

let doc = try Document.open("paper.pdf")
for i in 0..<(try doc.pageCount()) {
    let text = try doc.extractText(i)
    print(text.prefix(80))
    let md = try doc.toMarkdown(i)
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('paper.pdf');
for (var i = 0; i < doc.pageCount; i++) {
  final text = doc.extractText(i);
  print(text.substring(0, text.length < 80 ? text.length : 80));
  final md = doc.toMarkdown(i);
}
doc.close();

library(pdfoxide)

doc <- pdf_open("paper.pdf")
for (i in 0:(pdf_page_count(doc) - 1)) {
  text <- pdf_extract_text(doc, i)
  cat(substr(text, 1, 80), "\n")
  md <- pdf_to_markdown(doc, i)
}

Julia

using PdfOxide

doc = open_document("paper.pdf")
for i in 0:(page_count(doc) - 1)
    text = extract_text(doc, i)
    println(first(text, 80))
    md = to_markdown(doc, i)
end

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("paper.pdf");
var i: usize = 0;
while (i < try doc.pageCount()) : (i += 1) {
    const text = try doc.extractText(a, i);
    std.debug.print("{s}\n", .{text[0..@min(80, text.len)]});
    const md = try doc.toMarkdown(a, i);
}

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"paper.pdf" error:&err];
for (NSInteger i = 0; i < [doc pageCountError:&err]; i++) {
    NSString *text = [doc extractText:i error:&err];
    NSLog(@"%@", [text substringToIndex:MIN(80, text.length)]);
    NSString *md = [doc toMarkdown:i error:&err];
}

Elixir

{:ok, doc} = PdfOxide.open("paper.pdf")
{:ok, n} = PdfOxide.page_count(doc)
for i <- 0..(n - 1) do
  {:ok, text} = PdfOxide.extract_text(doc, i)
  IO.puts(String.slice(text, 0, 80))
  {:ok, md} = PdfOxide.to_markdown(doc, i)
end

Python — `Page`

Lazy property surface — content is parsed on first access and cached on the Page.

Member	Returns	Description
`page.text`	`str`	Extracted text (column-aware)
`page.chars`	`list[Char]`	Character-level records with bbox, font
`page.words`	`list[Word]`	Word-level records with bbox
`page.lines`	`list[TextLine]`	Text lines with bbox
`page.spans`	`list[Span]`	Styled spans (font, size, weight)
`page.tables`	`list[Table]`	Structured table rows + cell bboxes
`page.images`	`list[Image]`	Image metadata
`page.paths`	`list[Path]`	Vector path records
`page.annotations`	`list[Annotation]`	Annotations on this page
`page.markdown(detect_headings=True)`	`str`	Markdown conversion
`page.plain_text()`	`str`	Plain text (no layout hints)
`page.html()`	`str`	HTML conversion
`page.render(format="png")`	`bytes`	Render page as PNG / JPEG
`page.search(term, case_sensitive=False)`	`list[SearchResult]`	Find text on this page
`page.region(rect)`	`PageRegion`	Scoped extraction inside a rect

with PdfDocument("paper.pdf") as doc:
    page = doc[0]                 # or doc.page(0)
    for word in page.words:       # first access parses; subsequent calls cached
        print(word.text, word.bbox)

    # Scoped extraction
    header = page.region((0, 700, 612, 92)).extract_text()

The pre-existing editor PdfPage class (for writing) is unchanged; the new Page is strictly read-only.

Rust — `PdfPage`

use pdf_oxide::api::Pdf;

let mut doc = Pdf::open("paper.pdf")?;
let page = doc.page(0)?;

let text = page.text()?;
let words = page.extract_words()?;
let tables = page.extract_tables()?;
let md = page.to_markdown(true)?;

Methods available on PdfPage:

text(), plain_text(), to_markdown(detect_headings), to_html()
extract_chars(), extract_words(), extract_lines(), extract_spans()
extract_tables(), extract_paths(), extract_images()
annotations(), render(format)
search(term) — scoped search
find_text_containing(substring) — DOM-level hit list with IDs

Node.js — `Page`

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("paper.pdf");
const page = doc.page(0);

console.log(page.width, page.height, page.rotation);  // cached
console.log(page.extractText());
const words = page.extractWords();
const tables = page.extractTables();
const md = page.toMarkdown();

PdfDocument supports for..of via Symbol.iterator, plus doc.page(i) and doc.pageCount().

Six previously native-only methods are now available on both Page and PdfDocument via the TS layer:

extractWords
extractTextLines
extractTables
extractPaths
getEmbeddedImages
ocrExtractText

Each method has an async sibling — extractTextAsync, toMarkdownAsync, etc.

Go — `Page`

doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()

page, _ := doc.Page(0)
text, _ := page.ExtractText()
md, _   := page.ToMarkdown()
tables, _ := page.ExtractTables()

// Iterate every page
all, _ := doc.Pages()
for i, p := range all {
    t, _ := p.ExtractText()
    fmt.Printf("page %d: %d chars\n", i, len(t))
}

Go’s Page struct has the full method surface: ExtractText, ToMarkdown, ToHtml, ToPlainText, ExtractWords, ExtractTextLines, ExtractTables, ExtractChars, ExtractPaths, Annotations, Images, Fonts, RenderPage, Search.

C# — `Page`

using PdfOxide;

using var doc = PdfDocument.Open("paper.pdf");

Page page = doc[0];                            // or doc.Pages[0] or doc.Page(0)
string text = page.ExtractText();
string md   = page.ToMarkdown();
Table[] tables = page.ExtractTables();

// Async variants
string textAsync = await page.ExtractTextAsync();
string mdAsync   = await page.ToMarkdownAsync();

doc.Pages is IReadOnlyList<Page>. Every sync method has an async Task<T> counterpart with CancellationToken support.

Structured Table Shape

extract_tables() (available on both PdfDocument and Page) returns a consistent Table type across languages:

Language	Type	Cell access
Rust	`Table`	iterate `rows[i].cells[j]`
Python	`dict`	`row["cells"][i]["text"]`
Go	`Table`	`table.CellText(row, col)`
C#	`Table`	`table.CellText(row, col)`
Node.js	`Table` interface	`table.cells[row][col]`

Each cell carries text plus a bounding box so you can correlate the extraction back to coordinates on the page.

Migration from `doc.extract_*(page_index)`

Old (still supported):

doc = PdfDocument("paper.pdf")
for i in range(doc.page_count()):
    print(doc.extract_text(i))
    print(doc.to_markdown(i, detect_headings=True))
    print(doc.extract_tables(i))

New (v0.3.34+):

with PdfDocument("paper.pdf") as doc:
    for page in doc:
        print(page.text)
        print(page.markdown(detect_headings=True))
        print(page.tables)

Both styles stay supported; the Page style reads better for per-page pipelines and avoids repeated index bookkeeping.

Python API Reference
Rust API Reference
Node.js API Reference
Go API Reference
C# API Reference
Text Extraction — underlying extraction methods
Changelog — v0.3.34 Page API introduction

Page API Reference

Quick Example

Python — Page

Rust — PdfPage

Node.js — Page

Go — Page

C# — Page