What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Conversión a Markdown

PDF Oxide convierte páginas PDF a Markdown limpio y legible. El pipeline extrae spans de texto, los agrupa en líneas, consulta /StructTreeRoot para recuperar encabezados y roles de lista en PDFs etiquetados, detecta separadores multicolumna y saltos de orden de lectura, agrupa párrafos y emite sintaxis Markdown.

Desde la v0.3.36, en PDFs etiquetados el convertidor lee StructRole(Heading(1..6) | ListItem | ListItemLabel | ListItemBody) directamente de /StructTreeRoot en lugar de deducir el nivel del encabezado por el tamaño de la fuente. La información de rol se propaga a través de MCRs anidados (H1 → Span → MCR, LI → LBody → Span → MCR). En documentos no etiquetados sigue vigente el fallback geométrico: negrita + 5 % más de tamaño promociona a H4, y is_ordered_list_marker reconoce 1. / 12. / a) / iv. / A. y descarta pies de figura y años.

Multicolumna: los spans con la misma baseline separados por > max(3 × font_size, 30 pt) se tratan como cruce de columna. Los saltos de orden de lectura hacia atrás en X (patrón column-major último→primer-span) ahora cortan párrafos en lugar de fusionarlos en tokens sin sentido.

RTL: el reorden bidi está desactivado por defecto — el reorden visual→lógico incondicional de borradores previos rompía PDFs ya en orden lógico (el nombre hebreo בנימין se invertía). Los marcadores **bold** espurios alrededor de glifos árabes contextuales se eliminan. Cuando la entrada venga en orden visual, invoca manualmente text::bidi::reorder_visual_to_logical (Rust).

Imágenes inline están limitadas a 200 KB de payload base64 (añadido en la v0.3.36). Las imágenes que superan el límite emiten un comentario HTML con el tamaño original; usa image_output_dir para volcarlas al disco.

Quick Example

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("paper.pdf");
const md = doc.toMarkdown(0, { detectHeadings: true });
console.log(md);
doc.close();

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()
md, _ := doc.ToMarkdown(0)
fmt.Println(md)

using PdfOxide.Core;

using var doc = PdfDocument.Open("paper.pdf");
var md = doc.ToMarkdown(0);
Console.WriteLine(md);

WASM

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0, true);
console.log(md);

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;

let mut doc = PdfDocument::open("paper.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let md = doc.to_markdown(0, &options)?;
println!("{}", md);

API Reference

`to_markdown(page_index, ...) -> str`

Convert a single page to Markdown.

Python Signature

doc.to_markdown(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
) -> str

JavaScript Signature

doc.toMarkdown(pageIndex, detectHeadings?, includeImages?, includeFormFields?) -> string

Rust Signature

pub fn to_markdown(
    &mut self,
    page_index: usize,
    options: &ConversionOptions,
) -> Result<String>

Parameter	Type	Default	Description
`page_index`	`int` / `usize` / `number`	–	Zero-based page index
`preserve_layout`	`bool`	`false`	Preserve visual layout positioning
`detect_headings`	`bool`	`true`	Detect headings based on font size and weight
`include_images`	`bool`	`true`	Include images in output
`image_output_dir`	`str` / `None`	`None`	Directorio donde guardar las imágenes extraídas (solo Python/Rust). No se ve afectado por el límite inline de 200 KB.
`embed_images`	`bool`	`true`	Incrustar las imágenes como data URIs base64 (solo Python/Rust). Los payloads de más de 200 KB emiten un comentario HTML placeholder con el tamaño original (v0.3.36).
`include_form_fields`	`bool`	`true`	Include form field values (Python/JS)

Returns: Markdown string for the page.

`to_markdown_all(...) -> str`

Convert all pages to Markdown, separated by horizontal rules (---).

Python Signature

doc.to_markdown_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
) -> str

JavaScript Signature

doc.toMarkdownAll(detectHeadings?, includeImages?, includeFormFields?) -> string

Rust Signature

pub fn to_markdown_all(
    &mut self,
    options: &ConversionOptions,
) -> Result<String>

Parameter	Type	Default	Description
`preserve_layout`	`bool`	`false`	Preserve visual layout
`detect_headings`	`bool`	`true`	Detect headings
`include_images`	`bool`	`true`	Include images
`image_output_dir`	`str` / `None`	`None`	Image output directory
`embed_images`	`bool`	`true`	Embed images as base64

Returns: Markdown string for all pages joined with --- separators.

`to_markdown_with_ocr(page_index, model_path, options) -> str`

Convert a page to Markdown with OCR fallback for scanned pages. When the page has little or no extractable text, OCR is used to recognize text from the rendered page image. Requires the ocr feature.

Parameter	Type	Description
`page_index`	`usize`	Zero-based page index
`model_path`	`&str`	Path to the OCR model files
`options`	`&ConversionOptions`	Conversion options

Rust

let mut doc = PdfDocument::open("scanned.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let md = doc.to_markdown_with_ocr(0, "/path/to/models", &options)?;
println!("{}", md);

ConversionOptions

The ConversionOptions struct controls all conversion behavior.

Field	Type	Default	Description
`preserve_layout`	`bool`	`false`	Preserve visual layout with positioning
`detect_headings`	`bool`	`true`	Auto-detect headings from font size clusters
`extract_tables`	`bool`	`false`	Extract tables (experimental)
`include_images`	`bool`	`true`	Include images in output
`image_output_dir`	`Option<String>`	`None`	Save images to this directory
`embed_images`	`bool`	`true`	Embed images as base64 data URIs
`reading_order_mode`	`ReadingOrderMode`	`Auto`	How to determine reading order
`bold_marker_behavior`	`BoldMarkerBehavior`	`Conservative`	Bold marker application strategy

How It Works

The Markdown conversion pipeline operates in several stages:

Text Extraction – Extracts TextSpan objects from the page content stream, capturing text, position, font, size, weight, and color.
Character Clustering – Groups characters into words based on inter-character gaps, then words into lines based on vertical proximity.
Reading Order – Determines reading order using either the Tagged PDF structure tree (preferred) or a graph-based spatial analysis of text block positions.
Heading Detection – When detect_headings is enabled, clusters font sizes across the page to identify heading levels. Larger and bolder text is mapped to #, ##, ### headings.
Formatting – Applies bold (**text**) and italic (*text*) markers based on font weight and style metadata.
Table Detection – Identifies tabular layouts using spatial analysis of aligned text blocks and emits GFM-style Markdown tables.
Whitespace Cleanup – Normalizes spacing, removes redundant blank lines, and ensures consistent paragraph breaks.

Advanced Examples

Convert entire PDF to a Markdown file

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")
md = doc.to_markdown_all(detect_headings=True)

with open("book.md", "w", encoding="utf-8") as f:
    f.write(md)

Node.js

const fs = require("node:fs");

const doc = new PdfDocument("book.pdf");
const md = doc.toMarkdownAll();
fs.writeFileSync("book.md", md);
doc.close();

doc, _ := pdfoxide.Open("book.pdf")
defer doc.Close()
md, _ := doc.ToMarkdownAll()
os.WriteFile("book.md", []byte(md), 0644)

using var doc = PdfDocument.Open("book.pdf");
var md = doc.ToMarkdownAll();
File.WriteAllText("book.md", md);

WASM

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdownAll(true);
writeFileSync("book.md", md);
doc.free();

Convert with images saved to a directory

use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;

let mut doc = PdfDocument::open("report.pdf")?;
let options = ConversionOptions {
    detect_headings: true,
    include_images: true,
    embed_images: false,
    image_output_dir: Some("output/images".to_string()),
    ..Default::default()
};

let md = doc.to_markdown_all(&options)?;
std::fs::write("output/report.md", &md)?;

Page-by-page conversion with progress

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
pages = doc.page_count()

parts = []
for i in range(pages):
    md = doc.to_markdown(i, detect_headings=True)
    parts.append(md)
    print(f"Converted page {i + 1}/{pages}")

full_md = "\n\n---\n\n".join(parts)
with open("report.md", "w") as f:
    f.write(full_md)

Disable heading detection for flat text

doc = PdfDocument("form.pdf")
md = doc.to_markdown(0, detect_headings=False)
# All text rendered as paragraphs, no # headings

Text Extraction – Raw text and span extraction
HTML Conversion – Convert to HTML instead of Markdown
Image Extraction – Extract images separately