What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Getting Started with PDF Oxide (Go)

PDF Oxide is the fastest Go PDF library — 0.8ms mean text extraction, 5× faster than PyMuPDF, 15× faster than pypdf, 100% pass rate on 3,830 PDFs. One module for extracting, creating, and editing PDFs. Goroutine-safe reads via sync.RWMutex. MIT / Apache-2.0 licensed.

Installation

Two backends ship as of v0.3.38. Pick one:

Option A — CGo (static link, default)

go get github.com/yfedoseev/pdf_oxide/go
go run github.com/yfedoseev/pdf_oxide/go/cmd/install@latest

Requires Go 1.21+ with CGO_ENABLED=1 (the default) and a C toolchain on PATH. Full API surface. The installer fetches a pdf_oxide-go-ffi-<platform>.tar.gz static archive, SHA-256 verifies it, and prints CGO_CFLAGS / CGO_LDFLAGS to export. The Rust core is statically linked, so the resulting binary is self-contained — no runtime LD_LIBRARY_PATH / DYLD_LIBRARY_PATH / PATH setup needed. Just go build and ship.

Option B — purego (no C toolchain, `CGO_ENABLED=0`)

go get github.com/yfedoseev/pdf_oxide/go
go run github.com/yfedoseev/pdf_oxide/go/cmd/install@latest -shared

Added in v0.3.38 via ebitengine/purego. The installer fetches a pdf_oxide-go-ffi-shared-<platform>.tar.gz cdylib (libpdf_oxide.so / .dylib / .dll) and prints the env vars to export:

export CGO_ENABLED=0
export PDF_OXIDE_LIB_PATH="$HOME/.cache/pdf_oxide/v0.3.38/lib/linux_amd64/libpdf_oxide.so"

Backend selection is automatic via Go’s built-in cgo build tag: //go:build cgo → CGo API, //go:build !cgo → purego.

Purego surface (what compiles under !cgo): PdfDocument open (path / bytes / password), page count, version, text / Markdown / HTML / plain-text extraction, Page API, fonts, annotations, page elements, search, page dimensions, logging, plus PdfCreator.FromMarkdown / .FromHtml / .FromText for test fixtures.

CGo-only (compile-time error under !cgo): DocumentEditor, DocumentBuilder + FluentPageBuilder + EmbeddedFont, rendering (RenderPage, RenderPageZoom, RenderThumbnail, RenderPageRegion, RenderPageFit), barcodes (GenerateQRCode, GenerateBarcode), signatures (Signatures, Signature.Verify), TSA (TsaClient), OCR (OcrEngine), and SetFormFieldValue / FlattenForms.

Installer flags

Flag	Default	Purpose
`-version`	version baked into the module	Pin to a specific release
`-dir`	`os.UserCacheDir()/pdf_oxide/v<ver>`	Override install directory
`-shared`	off	Fetch the cdylib (purego backend) instead of staticlib
`-write-flags`	empty (just print env)	Directory to write a generated `cgo_flags.go`
`-env-only`	off	Skip download; only print env vars for an existing install
`-skip-checksum`	off	Skip SHA-256 verification (not recommended)

Cache locations (v0.3.38+)

The install root moved to os.UserCacheDir() to match Go’s own GOCACHE convention:

OS	Path
Linux	`$XDG_CACHE_HOME/pdf_oxide` or `~/.cache/pdf_oxide`
macOS	`~/Library/Caches/pdf_oxide`
Windows	`%LocalAppData%\pdf_oxide`

Upgrading from v0.3.30 – v0.3.37: first go build will fail at link time (undefined reference to pdf_document_open ...) until the installer runs once into the new path. The old ~/.pdf_oxide/ directory is not auto-migrated; delete it manually if you want to reclaim disk.

Monorepo / source-tree builds: add -tags pdf_oxide_dev to point CGo at a local target/release/libpdf_oxide.a — no installer needed.

Prebuilt platform matrix: Linux x64/arm64, macOS x64/arm64 (Apple Silicon), Windows x64 (via x86_64-pc-windows-gnu).

Opening a PDF

package main

import (
    "fmt"
    "log"

    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("research-paper.pdf")
    if err != nil {
        log.Fatal(err)
    }
    defer doc.Close()

    count, _ := doc.PageCount()
    major, minor, _ := doc.Version()
    fmt.Printf("%d pages, PDF %d.%d\n", count, major, minor)
}

Page API

Since v0.3.34 you can work page-first. doc.Page(i) returns a lightweight *Page handle that dispatches to the parent document.

page, _ := doc.Page(0)
text, _ := page.Text()
md, _   := page.Markdown()

pages, _ := doc.Pages()
for _, p := range pages {
    t, _ := p.Text()
    fmt.Printf("--- Page %d ---\n%s\n", p.Index+1, t)
}

Each Page exposes Text(), Markdown(), Html(), PlainText(), Chars(), Words(), Lines(), Tables(), Images(), Paths(), Fonts(), Annotations(), Info(), Search(), NeedsOcr(), and TextWithOcr().

Text Extraction

Single Page

text, err := doc.ExtractText(0)
if err != nil {
    log.Fatal(err)
}
fmt.Println(text)

All Pages

allText, err := doc.ExtractAllText()
if err != nil {
    log.Fatal(err)
}
fmt.Println(allText)

Walk Pages Manually

pages, _ := doc.Pages()
for _, p := range pages {
    text, err := p.Text()
    if err != nil {
        log.Printf("page %d: %v", p.Index, err)
        continue
    }
    fmt.Printf("--- Page %d ---\n%s\n", p.Index+1, text)
}

Structured Extraction

words, _  := doc.ExtractWords(0)        // []Word
lines, _  := doc.ExtractTextLines(0)    // []TextLine
chars, _  := doc.ExtractChars(0)        // []Char
tables, _ := doc.ExtractTables(0)       // []Table — rows + cells with bboxes (v0.3.34)
paths, _  := doc.ExtractPaths(0)        // []Path

for _, w := range words {
    fmt.Printf("%q at (%.1f, %.1f)\n", w.Text, w.X, w.Y)
}

for _, t := range tables {
    fmt.Printf("%dx%d (header=%v)\n", t.RowCount, t.ColCount, t.HasHeader)
    for r := 0; r < t.RowCount; r++ {
        for c := 0; c < t.ColCount; c++ {
            fmt.Printf("%s\t", t.CellText(r, c))
        }
        fmt.Println()
    }
}

Region-based extraction:

region, _ := doc.ExtractTextInRect(0, 50, 700, 200, 50) // x, y, w, h
words, _  := doc.ExtractWordsInRect(0, 50, 700, 200, 50)

Markdown Conversion

md, err := doc.ToMarkdown(0)
if err != nil {
    log.Fatal(err)
}
fmt.Println(md)

// All pages
allMd, _ := doc.ToMarkdownAll()

HTML Conversion

html, _  := doc.ToHtml(0)
allHtml, _ := doc.ToHtmlAll()

Image Extraction

import "os"

images, err := doc.Images(0)
if err != nil {
    log.Fatal(err)
}

for i, img := range images {
    fmt.Printf("Image %d: %dx%d %s %s %dbpc (%d bytes)\n",
        i, img.Width, img.Height, img.Format, img.Colorspace, img.BitsPerComponent, len(img.Data))
    os.WriteFile(fmt.Sprintf("image_%d.%s", i, img.Format), img.Data, 0644)
}

Opening from Bytes and Readers

// From bytes
data, _ := os.ReadFile("document.pdf")
doc, err := pdfoxide.OpenFromBytes(data)

// From any io.Reader
doc, err := pdfoxide.OpenReader(someReader)

// With password
doc, err := pdfoxide.OpenWithPassword("secure.pdf", "user-password")

PDF Creation

// From Markdown (works under purego)
pdf, _ := pdfoxide.FromMarkdown("# Hello\n\nBody text.")
defer pdf.Close()
pdf.Save("out.pdf")

// From HTML (works under purego)
htmlPdf, _ := pdfoxide.FromHtml("<h1>Invoice</h1><p>Amount: $42</p>")
defer htmlPdf.Close()
htmlPdf.Save("invoice.pdf")

// From text (works under purego)
txt, _ := pdfoxide.FromText("Plain text document.")
defer txt.Close()

// CGo-only beyond this point:

// From image
img, _ := pdfoxide.FromImage("photo.jpg")
defer img.Close()

// Merge several PDFs
merged, _ := pdfoxide.Merge([]string{"a.pdf", "b.pdf"})
os.WriteFile("merged.pdf", merged, 0644)

DocumentBuilder (CGo-only, v0.3.38)

The fluent DocumentBuilder API lands in Go in v0.3.38. Annotations, AcroForm widgets (TextField, Checkbox, ComboBox, RadioGroup, PushButton), graphics primitives (Rect, FilledRect, Line), embedded fonts (CJK / Cyrillic / Greek), and AES-256 encryption all ship here:

font, _ := pdfoxide.EmbeddedFontFromFile("DejaVuSans.ttf")
defer font.Close()

builder := pdfoxide.NewDocumentBuilder()
builder.RegisterEmbeddedFont("DejaVu", font)
builder.A4Page().
    Font("DejaVu", 12).At(72, 720).Text("Privet, mir!").
    Highlight(1.0, 1.0, 0.0).
    TextField("name", 150, 680, 200, 20, "Jane Doe").
    Checkbox("subscribe", 72, 650, 15, 15, true).
    Done()
_ = builder.SaveEncrypted("out.pdf", "user-pw", "owner-pw")

See DocumentBuilder Fluent API for the full method surface (same shape across all bindings).

Rendering

All rendering APIs are CGo-only (compile-time error under CGO_ENABLED=0).

// Format: 0 = PNG, 1 = JPEG
img, err := doc.RenderPage(0, 0)
if err != nil {
    log.Fatal(err)
}
defer img.Close()
img.SaveToFile("page.png")

// Zoom (2x)
zoomed, _ := doc.RenderPageZoom(0, 2.0, 0)
defer zoomed.Close()

// Thumbnail (200px width)
thumb, _ := doc.RenderThumbnail(0, 200, 0)
defer thumb.Close()

// Clipped region (v0.3.38)
region, _ := doc.RenderPageRegion(0, 72, 200, 468, 300, 0)
defer region.Close()

// Fit into a target box (v0.3.38)
fitted, _ := doc.RenderPageFit(0, 1024, 768, 0)
defer fitted.Close()

Search

// Search all pages (case-insensitive)
hits, _ := doc.SearchAll("configuration", false)
for _, r := range hits {
    fmt.Printf("page %d: %q at (%.0f, %.0f)\n", r.Page, r.Text, r.X, r.Y)
}

// Search one page
pageHits, _ := doc.SearchPage(0, "configuration", false)

Editing

DocumentEditor is CGo-only. Use it for metadata, page operations, annotations, and forms:

editor, err := pdfoxide.OpenEditor("in.pdf")
if err != nil {
    log.Fatal(err)
}
defer editor.Close()

// Metadata — one field at a time
_ = editor.SetTitle("Quarterly Report")
_ = editor.SetAuthor("Finance Team")

// Or apply several fields at once
_ = editor.ApplyMetadata(pdfoxide.Metadata{
    Title:   "Q1 2026 Report",
    Author:  "Finance Team",
    Subject: "Results",
})

// Page operations
_ = editor.SetPageRotation(0, 90)
_ = editor.MovePage(2, 0)
_ = editor.DeletePage(5)

// Forms
_ = editor.SetFormFieldValue("employee.name", "Jane Doe")
_ = editor.FlattenForms()

// Save
_ = editor.Save("out.pdf")
_ = editor.SaveEncrypted("secret.pdf", "user", "owner")

Barcodes

Barcode generation is CGo-only.

qr, _ := pdfoxide.GenerateQRCode("https://example.com", 0, 256)
defer qr.Close()
_ = os.WriteFile("qr.png", qr.PNGData(), 0644)

bc, _ := pdfoxide.GenerateBarcode("123456789", 0, 128)
defer bc.Close()

OCR

Build with the ocr feature to enable OCR on scanned pages:

go build -tags ocr ./...

ocr, _ := pdfoxide.NewOcrEngine()
defer ocr.Close()

if ocr.NeedsOcr(doc, 0) {
    text, _ := ocr.ExtractTextWithOcr(doc, 0)
    fmt.Println(text)
}

See the OCR guide for complete recipes.

Concurrency

PdfDocument reads are goroutine-safe — multiple goroutines can share a single document for parallel page extraction:

import "sync"

var wg sync.WaitGroup
count, _ := doc.PageCount()
out := make(chan string, count)

for i := 0; i < count; i++ {
    wg.Add(1)
    go func(page int) {
        defer wg.Done()
        text, err := doc.ExtractText(page)
        if err == nil {
            out <- text
        }
    }(i)
}

go func() { wg.Wait(); close(out) }()

for text := range out {
    _ = text
}

DocumentEditor serializes writes internally, but don’t pipeline independent edits from multiple goroutines — collect changes on one goroutine. See the concurrency guide for patterns.

Error Handling

import "errors"

text, err := doc.ExtractText(0)
if err != nil {
    switch {
    case errors.Is(err, pdfoxide.ErrDocumentClosed):
        log.Print("document is closed")
    case errors.Is(err, pdfoxide.ErrInvalidPageIndex):
        log.Print("invalid page index")
    case errors.Is(err, pdfoxide.ErrExtractionFailed):
        log.Print("extraction failed")
    default:
        log.Printf("unexpected: %v", err)
    }
}

Available sentinel errors:

ErrInvalidPath        ErrDocumentNotFound   ErrInvalidFormat
ErrExtractionFailed   ErrParseError         ErrInvalidPageIndex
ErrSearchFailed       ErrInternal           ErrDocumentClosed
ErrEditorClosed       ErrCreatorClosed      ErrIndexOutOfBounds
ErrEmptyContent

Extract the numeric Code and Message with errors.As:

var e *pdfoxide.Error
if errors.As(err, &e) {
    fmt.Printf("code=%d message=%s\n", e.Code, e.Message)
}

Next Steps

Python Getting Started — using PDF Oxide from Python
Go API Reference — full API documentation
Concurrency Guide — goroutine patterns
Text Extraction — detailed extraction options
PDF Creation — advanced creation
Package on pkg.go.dev — generated API docs