What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Go PDF ライブラリ — PDF Oxide

PDF Oxide は Go 向けの最速 PDF ライブラリです。テキスト抽出はページ平均 0.8 ms、PyMuPDF の 5 倍、pypdf の 15 倍高速、3,830 件の PDF でパス率 100%。抽出・作成・編集をひとつのモジュールに。読み取りは sync.RWMutex により goroutine セーフ。ライセンスは MIT / Apache-2.0 です。

インストール

v0.3.38 以降は 2 種類のバックエンドを提供しています。用途に応じて選んでください。

オプション A — CGo（スタティックリンク、既定）

go get github.com/yfedoseev/pdf_oxide/go
go run github.com/yfedoseev/pdf_oxide/go/cmd/install@latest

Go 1.21+、CGO_ENABLED=1（既定値）、PATH 上の C ツールチェインが必要です。全 API が使えます。インストーラーはプラットフォーム別の pdf_oxide-go-ffi-<platform>.tar.gz スタティックアーカイブを取得し、SHA-256 を検証したうえでエクスポートすべき CGO_CFLAGS / CGO_LDFLAGS を表示します。Rust コアがスタティックリンクされるため、生成されたバイナリは完全に自己完結しており、実行時の LD_LIBRARY_PATH / DYLD_LIBRARY_PATH / PATH の設定は不要です。go build したものをそのまま配布できます。

オプション B — purego（C ツールチェイン不要、`CGO_ENABLED=0`）

go get github.com/yfedoseev/pdf_oxide/go
go run github.com/yfedoseev/pdf_oxide/go/cmd/install@latest -shared

v0.3.38 で ebitengine/purego を使って追加されました。インストーラーは pdf_oxide-go-ffi-shared-<platform>.tar.gz の cdylib（libpdf_oxide.so / .dylib / .dll）を取得し、エクスポートすべき環境変数を表示します。

export CGO_ENABLED=0
export PDF_OXIDE_LIB_PATH="$HOME/.cache/pdf_oxide/v0.3.38/lib/linux_amd64/libpdf_oxide.so"

バックエンドは Go 組み込みの cgo ビルドタグで自動的に選択されます。//go:build cgo なら CGo API、//go:build !cgo なら purego です。

purego で使える API（!cgo でコンパイル可能）： PdfDocument のオープン（パス／バイト列／パスワード）、ページ数、バージョン、テキスト／Markdown／HTML／プレーンテキスト抽出、Page API、フォント、注釈、ページ要素、検索、ページ寸法、ロギング、加えてテスト用フィクスチャ向けの PdfCreator.FromMarkdown / .FromHtml / .FromText。

CGo 専用（!cgo ではコンパイルエラー）： DocumentEditor、DocumentBuilder + FluentPageBuilder + EmbeddedFont、レンダリング（RenderPage、RenderPageZoom、RenderThumbnail、RenderPageRegion、RenderPageFit）、バーコード（GenerateQRCode、GenerateBarcode）、署名（Signatures、Signature.Verify）、TSA（TsaClient）、OCR（OcrEngine）、SetFormFieldValue / FlattenForms。

インストーラーのフラグ

フラグ	既定値	用途
`-version`	モジュールに埋め込まれたバージョン	特定リリースへのピン留め
`-dir`	`os.UserCacheDir()/pdf_oxide/v<ver>`	インストール先の上書き
`-shared`	off	staticlib の代わりに cdylib（purego バックエンド）を取得
`-write-flags`	空（環境変数を表示するのみ）	生成した `cgo_flags.go` を出力するディレクトリ
`-env-only`	off	ダウンロードをスキップし、既存インストールの環境変数のみ表示
`-skip-checksum`	off	SHA-256 検証をスキップ（非推奨）

キャッシュ配置（v0.3.38 以降）

インストール先は Go 自体の GOCACHE 慣例に合わせて os.UserCacheDir() に移動しました。

OS	パス
Linux	`$XDG_CACHE_HOME/pdf_oxide` または `~/.cache/pdf_oxide`
macOS	`~/Library/Caches/pdf_oxide`
Windows	`%LocalAppData%\pdf_oxide`

v0.3.30 – v0.3.37 からのアップグレード： 最初の go build はリンク時に undefined reference to pdf_document_open ... で失敗するので、新しいパスに一度インストーラーを走らせてください。旧来の ~/.pdf_oxide/ は自動移行されません。ディスク容量を取り戻したい場合は手動で削除してください。

モノレポ／ソースツリーからのビルド： -tags pdf_oxide_dev を付けるとローカルの target/release/libpdf_oxide.a を CGo が参照します。インストーラーは不要です。

ビルド済みプラットフォーム：Linux x64/arm64、macOS x64/arm64（Apple Silicon）、Windows x64（x86_64-pc-windows-gnu 経由）。

PDF を開く

package main

import (
    "fmt"
    "log"

    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("research-paper.pdf")
    if err != nil {
        log.Fatal(err)
    }
    defer doc.Close()

    count, _ := doc.PageCount()
    major, minor, _ := doc.Version()
    fmt.Printf("%d ページ, PDF %d.%d\n", count, major, minor)
}

Page API

v0.3.34 以降はページ単位の操作が可能です。doc.Page(i) は親ドキュメントに処理を委譲する軽量な *Page ハンドルを返します。

page, _ := doc.Page(0)
text, _ := page.Text()
md, _   := page.Markdown()

pages, _ := doc.Pages()
for _, p := range pages {
    t, _ := p.Text()
    fmt.Printf("--- ページ %d ---\n%s\n", p.Index+1, t)
}

各 Page は Text()、Markdown()、Html()、PlainText()、Chars()、Words()、Lines()、Tables()、Images()、Paths()、Fonts()、Annotations()、Info()、Search()、NeedsOcr()、TextWithOcr() を提供します。

テキスト抽出

単一ページ

text, err := doc.ExtractText(0)
if err != nil {
    log.Fatal(err)
}
fmt.Println(text)

全ページ

allText, err := doc.ExtractAllText()
if err != nil {
    log.Fatal(err)
}
fmt.Println(allText)

ページを手動で走査

pages, _ := doc.Pages()
for _, p := range pages {
    text, err := p.Text()
    if err != nil {
        log.Printf("ページ %d: %v", p.Index, err)
        continue
    }
    fmt.Printf("--- ページ %d ---\n%s\n", p.Index+1, text)
}

構造化抽出

words, _  := doc.ExtractWords(0)        // []Word
lines, _  := doc.ExtractTextLines(0)    // []TextLine
chars, _  := doc.ExtractChars(0)        // []Char
tables, _ := doc.ExtractTables(0)       // []Table — 行・セルと bbox 付き（v0.3.34）
paths, _  := doc.ExtractPaths(0)        // []Path

for _, w := range words {
    fmt.Printf("%q @ (%.1f, %.1f)\n", w.Text, w.X, w.Y)
}

for _, t := range tables {
    fmt.Printf("%dx%d (ヘッダー=%v)\n", t.RowCount, t.ColCount, t.HasHeader)
    for r := 0; r < t.RowCount; r++ {
        for c := 0; c < t.ColCount; c++ {
            fmt.Printf("%s\t", t.CellText(r, c))
        }
        fmt.Println()
    }
}

矩形領域による抽出:

region, _ := doc.ExtractTextInRect(0, 50, 700, 200, 50) // x, y, 幅, 高さ
words, _  := doc.ExtractWordsInRect(0, 50, 700, 200, 50)

Markdown 変換

md, err := doc.ToMarkdown(0)
if err != nil {
    log.Fatal(err)
}
fmt.Println(md)

// 全ページ
allMd, _ := doc.ToMarkdownAll()

HTML 変換

html, _  := doc.ToHtml(0)
allHtml, _ := doc.ToHtmlAll()

画像抽出

import "os"

images, err := doc.Images(0)
if err != nil {
    log.Fatal(err)
}

for i, img := range images {
    fmt.Printf("画像 %d: %dx%d %s %s %dbpc (%d バイト)\n",
        i, img.Width, img.Height, img.Format, img.Colorspace, img.BitsPerComponent, len(img.Data))
    os.WriteFile(fmt.Sprintf("image_%d.%s", i, img.Format), img.Data, 0644)
}

バイト列・io.Reader から開く

// バイト列から
data, _ := os.ReadFile("document.pdf")
doc, err := pdfoxide.OpenFromBytes(data)

// 任意の io.Reader から
doc, err := pdfoxide.OpenReader(someReader)

// パスワード付き
doc, err := pdfoxide.OpenWithPassword("secure.pdf", "user-password")

PDF の作成

// Markdown から（purego でも動作）
pdf, _ := pdfoxide.FromMarkdown("# こんにちは\n\n本文です。")
defer pdf.Close()
pdf.Save("out.pdf")

// HTML から（purego でも動作）
htmlPdf, _ := pdfoxide.FromHtml("<h1>請求書</h1><p>金額: $42</p>")
defer htmlPdf.Close()
htmlPdf.Save("invoice.pdf")

// テキストから（purego でも動作）
txt, _ := pdfoxide.FromText("プレーンテキスト文書。")
defer txt.Close()

// ここから先は CGo 専用:

// 画像から
img, _ := pdfoxide.FromImage("photo.jpg")
defer img.Close()

// 複数 PDF を結合
merged, _ := pdfoxide.Merge([]string{"a.pdf", "b.pdf"})
os.WriteFile("merged.pdf", merged, 0644)

DocumentBuilder（CGo 専用、v0.3.38）

流暢な DocumentBuilder API は v0.3.38 で Go にも着地しました。注釈、AcroForm ウィジェット（TextField、Checkbox、ComboBox、RadioGroup、PushButton）、図形プリミティブ（Rect、FilledRect、Line）、埋め込みフォント（CJK／キリル／ギリシャ）、AES-256 暗号化までここで扱えます。

font, _ := pdfoxide.EmbeddedFontFromFile("DejaVuSans.ttf")
defer font.Close()

builder := pdfoxide.NewDocumentBuilder()
builder.RegisterEmbeddedFont("DejaVu", font)
builder.A4Page().
    Font("DejaVu", 12).At(72, 720).Text("Privet, mir!").
    Highlight(1.0, 1.0, 0.0).
    TextField("name", 150, 680, 200, 20, "Jane Doe").
    Checkbox("subscribe", 72, 650, 15, 15, true).
    Done()
_ = builder.SaveEncrypted("out.pdf", "user-pw", "owner-pw")

全バインディング共通のメソッド一覧は DocumentBuilder 流暢 API を参照してください。

レンダリング

レンダリング系の API はすべて CGo 専用です（CGO_ENABLED=0 ではコンパイルエラーになります）。

// フォーマット: 0 = PNG, 1 = JPEG
img, err := doc.RenderPage(0, 0)
if err != nil {
    log.Fatal(err)
}
defer img.Close()
img.SaveToFile("page.png")

// ズーム (2×)
zoomed, _ := doc.RenderPageZoom(0, 2.0, 0)
defer zoomed.Close()

// サムネイル (幅 200px)
thumb, _ := doc.RenderThumbnail(0, 200, 0)
defer thumb.Close()

// 領域クリップ (v0.3.38)
region, _ := doc.RenderPageRegion(0, 72, 200, 468, 300, 0)
defer region.Close()

// 指定サイズにフィット (v0.3.38)
fitted, _ := doc.RenderPageFit(0, 1024, 768, 0)
defer fitted.Close()

検索

// 全ページを検索（大文字小文字を区別しない）
hits, _ := doc.SearchAll("configuration", false)
for _, r := range hits {
    fmt.Printf("ページ %d: %q @ (%.0f, %.0f)\n", r.Page, r.Text, r.X, r.Y)
}

// 単一ページを検索
pageHits, _ := doc.SearchPage(0, "configuration", false)

編集

DocumentEditor は CGo 専用です。メタデータ、ページ操作、注釈、フォームの編集に使用します。

editor, err := pdfoxide.OpenEditor("in.pdf")
if err != nil {
    log.Fatal(err)
}
defer editor.Close()

// メタデータ — 個別に設定
_ = editor.SetTitle("四半期レポート")
_ = editor.SetAuthor("財務チーム")

// まとめて設定
_ = editor.ApplyMetadata(pdfoxide.Metadata{
    Title:   "2026 Q1 レポート",
    Author:  "財務チーム",
    Subject: "業績",
})

// ページ操作
_ = editor.SetPageRotation(0, 90)
_ = editor.MovePage(2, 0)
_ = editor.DeletePage(5)

// フォーム
_ = editor.SetFormFieldValue("employee.name", "Jane Doe")
_ = editor.FlattenForms()

// 保存
_ = editor.Save("out.pdf")
_ = editor.SaveEncrypted("secret.pdf", "user", "owner")

バーコード

バーコード生成は CGo 専用です。

qr, _ := pdfoxide.GenerateQRCode("https://example.com", 0, 256)
defer qr.Close()
_ = os.WriteFile("qr.png", qr.PNGData(), 0644)

bc, _ := pdfoxide.GenerateBarcode("123456789", 0, 128)
defer bc.Close()

OCR

スキャンされたページで OCR を使うには ocr フィーチャーを有効にしてビルドします。

go build -tags ocr ./...

ocr, _ := pdfoxide.NewOcrEngine()
defer ocr.Close()

if ocr.NeedsOcr(doc, 0) {
    text, _ := ocr.ExtractTextWithOcr(doc, 0)
    fmt.Println(text)
}

詳しいレシピは OCR ガイドを参照してください。

並行性

PdfDocument の読み取りは goroutine セーフで、複数の goroutine で 1 つのドキュメントを共有して並列にページを抽出できます。

import "sync"

var wg sync.WaitGroup
count, _ := doc.PageCount()
out := make(chan string, count)

for i := 0; i < count; i++ {
    wg.Add(1)
    go func(page int) {
        defer wg.Done()
        text, err := doc.ExtractText(page)
        if err == nil {
            out <- text
        }
    }(i)
}

go func() { wg.Wait(); close(out) }()

for text := range out {
    _ = text
}

DocumentEditor は内部で書き込みを直列化しますが、独立した編集を複数の goroutine からパイプラインのように流し込まないでください。変更はひとつの goroutine にまとめてください。パターンは並行性ガイドを参照してください。

エラー処理

import "errors"

text, err := doc.ExtractText(0)
if err != nil {
    switch {
    case errors.Is(err, pdfoxide.ErrDocumentClosed):
        log.Print("ドキュメントは閉じられています")
    case errors.Is(err, pdfoxide.ErrInvalidPageIndex):
        log.Print("無効なページインデックスです")
    case errors.Is(err, pdfoxide.ErrExtractionFailed):
        log.Print("抽出に失敗しました")
    default:
        log.Printf("想定外: %v", err)
    }
}

利用可能な番兵エラー:

ErrInvalidPath        ErrDocumentNotFound   ErrInvalidFormat
ErrExtractionFailed   ErrParseError         ErrInvalidPageIndex
ErrSearchFailed       ErrInternal           ErrDocumentClosed
ErrEditorClosed       ErrCreatorClosed      ErrIndexOutOfBounds
ErrEmptyContent

errors.As で数値の Code と Message を取り出せます。

var e *pdfoxide.Error
if errors.As(err, &e) {
    fmt.Printf("code=%d message=%s\n", e.Code, e.Message)
}

次のステップ

Python 入門 — Python から PDF Oxide を使う
Go API リファレンス — API の完全ドキュメント
並行性ガイド — goroutine のパターン
テキスト抽出 — 抽出オプションの詳細
PDF 作成 — 作成の応用
pkg.go.dev のパッケージ — 自動生成の API ドキュメント