What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Go PDF 库 — PDF Oxide

PDF Oxide 是 Go 最快的 PDF 库：文本提取平均 0.8 ms，比 PyMuPDF 快 5 倍，比 pypdf 快 15 倍，在 3830 个 PDF 上 100% 通过率。一个模块覆盖提取、生成、编辑；读取操作通过 sync.RWMutex 保证 goroutine 安全。MIT / Apache-2.0 双协议。

安装

自 v0.3.38 起提供两种后端，二选一：

方案 A — CGo（静态链接，默认）

go get github.com/yfedoseev/pdf_oxide/go
go run github.com/yfedoseev/pdf_oxide/go/cmd/install@latest

需要 Go 1.21+，并启用 CGo（默认 CGO_ENABLED=1）以及可用的 C 工具链。提供完整的 API。安装器会下载 pdf_oxide-go-ffi-<platform>.tar.gz 静态归档、校验 SHA-256，并打印需要导出的 CGO_CFLAGS / CGO_LDFLAGS。Rust 内核以静态库形式链接，生成的二进制是自包含的 — 运行时无需配置 LD_LIBRARY_PATH / DYLD_LIBRARY_PATH / PATH。执行 go build 即可直接分发。

方案 B — purego（无需 C 工具链，`CGO_ENABLED=0`）

go get github.com/yfedoseev/pdf_oxide/go
go run github.com/yfedoseev/pdf_oxide/go/cmd/install@latest -shared

v0.3.38 通过 ebitengine/purego 引入。安装器会下载 pdf_oxide-go-ffi-shared-<platform>.tar.gz 动态库（libpdf_oxide.so / .dylib / .dll），并打印需要导出的环境变量：

export CGO_ENABLED=0
export PDF_OXIDE_LIB_PATH="$HOME/.cache/pdf_oxide/v0.3.38/lib/linux_amd64/libpdf_oxide.so"

后端选择通过 Go 自带的 cgo build tag 自动完成：//go:build cgo → CGo API，//go:build !cgo → purego。

purego 下可编译的能力： PdfDocument 打开（路径 / 字节 / 密码）、页数、版本、文本 / Markdown / HTML / 纯文本提取、Page API、字体、注释、页面元素、搜索、页面尺寸、日志，以及用于测试样例的 PdfCreator.FromMarkdown / .FromHtml / .FromText。

仅 CGo 可用（在 !cgo 下会产生编译错误）： DocumentEditor、DocumentBuilder + FluentPageBuilder + EmbeddedFont、渲染（RenderPage、RenderPageZoom、RenderThumbnail、RenderPageRegion、RenderPageFit）、条码（GenerateQRCode、GenerateBarcode）、签名（Signatures、Signature.Verify）、TSA（TsaClient）、OCR（OcrEngine），以及 SetFormFieldValue / FlattenForms。

安装器参数

参数	默认	用途
`-version`	模块中内置的版本	指定具体发行版
`-dir`	`os.UserCacheDir()/pdf_oxide/v<ver>`	覆盖安装目录
`-shared`	关闭	下载 cdylib（purego 后端）而不是 staticlib
`-write-flags`	空（只打印环境变量）	写入生成的 `cgo_flags.go` 的目录
`-env-only`	关闭	跳过下载；仅为已安装的版本打印环境变量
`-skip-checksum`	关闭	跳过 SHA-256 校验（不推荐）

缓存路径（v0.3.38+）

安装根目录已迁移至 os.UserCacheDir()，与 Go 自身的 GOCACHE 约定保持一致：

操作系统	路径
Linux	`$XDG_CACHE_HOME/pdf_oxide` 或 `~/.cache/pdf_oxide`
macOS	`~/Library/Caches/pdf_oxide`
Windows	`%LocalAppData%\pdf_oxide`

从 v0.3.30 – v0.3.37 升级： 第一次 go build 会在链接阶段失败（undefined reference to pdf_document_open ...），直到在新路径下重新运行一次安装器。旧的 ~/.pdf_oxide/ 目录不会自动迁移；如需回收磁盘，请手动删除。

Monorepo 或源码树构建： 添加 -tags pdf_oxide_dev，让 CGo 指向本地的 target/release/libpdf_oxide.a，不需要安装脚本。

提供预编译的平台：Linux x64/arm64、macOS x64/arm64（Apple Silicon）以及 Windows x64（通过 x86_64-pc-windows-gnu）。

打开 PDF

package main

import (
    "fmt"
    "log"

    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("research-paper.pdf")
    if err != nil {
        log.Fatal(err)
    }
    defer doc.Close()

    count, _ := doc.PageCount()
    major, minor, _ := doc.Version()
    fmt.Printf("%d 页, PDF %d.%d\n", count, major, minor)
}

页面 API

自 v0.3.34 起可以按页操作。doc.Page(i) 返回一个轻量的 *Page 句柄，调用会转发给父文档。

page, _ := doc.Page(0)
text, _ := page.Text()
md, _   := page.Markdown()

pages, _ := doc.Pages()
for _, p := range pages {
    t, _ := p.Text()
    fmt.Printf("--- 第 %d 页 ---\n%s\n", p.Index+1, t)
}

每个 Page 都提供 Text()、Markdown()、Html()、PlainText()、Chars()、Words()、Lines()、Tables()、Images()、Paths()、Fonts()、Annotations()、Info()、Search()、NeedsOcr() 和 TextWithOcr()。

文本提取

单页

text, err := doc.ExtractText(0)
if err != nil {
    log.Fatal(err)
}
fmt.Println(text)

全部页面

allText, err := doc.ExtractAllText()
if err != nil {
    log.Fatal(err)
}
fmt.Println(allText)

手动遍历页面

pages, _ := doc.Pages()
for _, p := range pages {
    text, err := p.Text()
    if err != nil {
        log.Printf("第 %d 页: %v", p.Index, err)
        continue
    }
    fmt.Printf("--- 第 %d 页 ---\n%s\n", p.Index+1, text)
}

结构化提取

words, _  := doc.ExtractWords(0)        // []Word
lines, _  := doc.ExtractTextLines(0)    // []TextLine
chars, _  := doc.ExtractChars(0)        // []Char
tables, _ := doc.ExtractTables(0)       // []Table — 带 bbox 的行与单元格 (v0.3.34)
paths, _  := doc.ExtractPaths(0)        // []Path

for _, w := range words {
    fmt.Printf("%q 位于 (%.1f, %.1f)\n", w.Text, w.X, w.Y)
}

for _, t := range tables {
    fmt.Printf("%dx%d (表头=%v)\n", t.RowCount, t.ColCount, t.HasHeader)
    for r := 0; r < t.RowCount; r++ {
        for c := 0; c < t.ColCount; c++ {
            fmt.Printf("%s\t", t.CellText(r, c))
        }
        fmt.Println()
    }
}

按区域提取：

region, _ := doc.ExtractTextInRect(0, 50, 700, 200, 50) // x, y, 宽, 高
words, _  := doc.ExtractWordsInRect(0, 50, 700, 200, 50)

转 Markdown

md, err := doc.ToMarkdown(0)
if err != nil {
    log.Fatal(err)
}
fmt.Println(md)

// 全部页面
allMd, _ := doc.ToMarkdownAll()

转 HTML

html, _  := doc.ToHtml(0)
allHtml, _ := doc.ToHtmlAll()

图像提取

import "os"

images, err := doc.Images(0)
if err != nil {
    log.Fatal(err)
}

for i, img := range images {
    fmt.Printf("图像 %d: %dx%d %s %s %dbpc (%d 字节)\n",
        i, img.Width, img.Height, img.Format, img.Colorspace, img.BitsPerComponent, len(img.Data))
    os.WriteFile(fmt.Sprintf("image_%d.%s", i, img.Format), img.Data, 0644)
}

从字节与 Reader 打开

// 从字节
data, _ := os.ReadFile("document.pdf")
doc, err := pdfoxide.OpenFromBytes(data)

// 从任意 io.Reader
doc, err := pdfoxide.OpenReader(someReader)

// 带密码
doc, err := pdfoxide.OpenWithPassword("secure.pdf", "user-password")

生成 PDF

// 从 Markdown（purego 下也可用）
pdf, _ := pdfoxide.FromMarkdown("# 你好\n\n正文内容。")
defer pdf.Close()
pdf.Save("out.pdf")

// 从 HTML（purego 下也可用）
htmlPdf, _ := pdfoxide.FromHtml("<h1>发票</h1><p>金额: $42</p>")
defer htmlPdf.Close()
htmlPdf.Save("invoice.pdf")

// 从文本（purego 下也可用）
txt, _ := pdfoxide.FromText("纯文本文档。")
defer txt.Close()

// 下面的 API 仅在 CGo 下可用：

// 从图像
img, _ := pdfoxide.FromImage("photo.jpg")
defer img.Close()

// 合并多个 PDF
merged, _ := pdfoxide.Merge([]string{"a.pdf", "b.pdf"})
os.WriteFile("merged.pdf", merged, 0644)

DocumentBuilder（仅 CGo，v0.3.38）

流式 DocumentBuilder API 在 v0.3.38 登陆 Go。注释、AcroForm 控件（TextField、Checkbox、ComboBox、RadioGroup、PushButton）、图形基元（Rect、FilledRect、Line）、嵌入字体（CJK / 西里尔 / 希腊文）以及 AES-256 加密都在这里提供：

font, _ := pdfoxide.EmbeddedFontFromFile("DejaVuSans.ttf")
defer font.Close()

builder := pdfoxide.NewDocumentBuilder()
builder.RegisterEmbeddedFont("DejaVu", font)
builder.A4Page().
    Font("DejaVu", 12).At(72, 720).Text("Privet, mir!").
    Highlight(1.0, 1.0, 0.0).
    TextField("name", 150, 680, 200, 20, "Jane Doe").
    Checkbox("subscribe", 72, 650, 15, 15, true).
    Done()
_ = builder.SaveEncrypted("out.pdf", "user-pw", "owner-pw")

方法全貌（与所有绑定一致）参见 DocumentBuilder 流式 API。

渲染

所有渲染 API 仅在 CGo 下可用（在 CGO_ENABLED=0 下会产生编译错误）。

// 格式: 0 = PNG, 1 = JPEG
img, err := doc.RenderPage(0, 0)
if err != nil {
    log.Fatal(err)
}
defer img.Close()
img.SaveToFile("page.png")

// 缩放 (2×)
zoomed, _ := doc.RenderPageZoom(0, 2.0, 0)
defer zoomed.Close()

// 缩略图 (宽 200px)
thumb, _ := doc.RenderThumbnail(0, 200, 0)
defer thumb.Close()

// 裁剪区域 (v0.3.38)
region, _ := doc.RenderPageRegion(0, 72, 200, 468, 300, 0)
defer region.Close()

// 适配到目标框 (v0.3.38)
fitted, _ := doc.RenderPageFit(0, 1024, 768, 0)
defer fitted.Close()

搜索

// 搜索全部页面（忽略大小写）
hits, _ := doc.SearchAll("configuration", false)
for _, r := range hits {
    fmt.Printf("第 %d 页: %q 位于 (%.0f, %.0f)\n", r.Page, r.Text, r.X, r.Y)
}

// 搜索单页
pageHits, _ := doc.SearchPage(0, "configuration", false)

编辑

DocumentEditor 仅在 CGo 下可用。用它处理元数据、页面操作、注释与表单：

editor, err := pdfoxide.OpenEditor("in.pdf")
if err != nil {
    log.Fatal(err)
}
defer editor.Close()

// 元数据 — 逐字段设置
_ = editor.SetTitle("季度报告")
_ = editor.SetAuthor("财务团队")

// 或一次性设置多个字段
_ = editor.ApplyMetadata(pdfoxide.Metadata{
    Title:   "2026 年 Q1 报告",
    Author:  "财务团队",
    Subject: "业绩",
})

// 页面操作
_ = editor.SetPageRotation(0, 90)
_ = editor.MovePage(2, 0)
_ = editor.DeletePage(5)

// 表单
_ = editor.SetFormFieldValue("employee.name", "Jane Doe")
_ = editor.FlattenForms()

// 保存
_ = editor.Save("out.pdf")
_ = editor.SaveEncrypted("secret.pdf", "user", "owner")

条码

条码生成仅在 CGo 下可用。

qr, _ := pdfoxide.GenerateQRCode("https://example.com", 0, 256)
defer qr.Close()
_ = os.WriteFile("qr.png", qr.PNGData(), 0644)

bc, _ := pdfoxide.GenerateBarcode("123456789", 0, 128)
defer bc.Close()

OCR

要为扫描页启用 OCR，请启用 ocr feature 构建：

go build -tags ocr ./...

ocr, _ := pdfoxide.NewOcrEngine()
defer ocr.Close()

if ocr.NeedsOcr(doc, 0) {
    text, _ := ocr.ExtractTextWithOcr(doc, 0)
    fmt.Println(text)
}

完整示例见 OCR 指南。

并发

PdfDocument 的读取是 goroutine 安全的，多个 goroutine 可以共享同一个文档并行提取页面。

import "sync"

var wg sync.WaitGroup
count, _ := doc.PageCount()
out := make(chan string, count)

for i := 0; i < count; i++ {
    wg.Add(1)
    go func(page int) {
        defer wg.Done()
        text, err := doc.ExtractText(page)
        if err == nil {
            out <- text
        }
    }(i)
}

go func() { wg.Wait(); close(out) }()

for text := range out {
    _ = text
}

DocumentEditor 在内部会串行化写入，但不要从多个 goroutine 以流水线方式提交互不相关的编辑——请在单个 goroutine 中汇总变更。模式参见并发指南。

错误处理

import "errors"

text, err := doc.ExtractText(0)
if err != nil {
    switch {
    case errors.Is(err, pdfoxide.ErrDocumentClosed):
        log.Print("文档已关闭")
    case errors.Is(err, pdfoxide.ErrInvalidPageIndex):
        log.Print("页面索引无效")
    case errors.Is(err, pdfoxide.ErrExtractionFailed):
        log.Print("提取失败")
    default:
        log.Printf("未预期的错误: %v", err)
    }
}

可用的哨兵错误：

ErrInvalidPath        ErrDocumentNotFound   ErrInvalidFormat
ErrExtractionFailed   ErrParseError         ErrInvalidPageIndex
ErrSearchFailed       ErrInternal           ErrDocumentClosed
ErrEditorClosed       ErrCreatorClosed      ErrIndexOutOfBounds
ErrEmptyContent

用 errors.As 取出数字 Code 与 Message：

var e *pdfoxide.Error
if errors.As(err, &e) {
    fmt.Printf("code=%d message=%s\n", e.Code, e.Message)
}

下一步

Python 快速上手 — 从 Python 使用 PDF Oxide
Go API 参考 — 完整 API 文档
并发指南 — goroutine 模式
文本提取 — 更详细的提取选项
PDF 生成 — 进阶生成
pkg.go.dev 上的包 — 自动生成的 API 文档