What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Go PDF 라이브러리 — PDF Oxide

PDF Oxide는 Go에서 가장 빠른 PDF 라이브러리입니다. 텍스트 추출 평균 0.8ms, PyMuPDF보다 5배, pypdf보다 15배 빠르며, 3,830개 PDF에서 100% 통과율. 추출·생성·편집을 하나의 모듈로 제공하고 읽기 작업은 sync.RWMutex를 통해 고루틴 안전합니다. MIT / Apache-2.0 라이선스.

설치

v0.3.38부터 두 가지 백엔드가 제공됩니다. 상황에 맞는 쪽을 고르세요.

옵션 A — CGo (정적 링크, 기본값)

go get github.com/yfedoseev/pdf_oxide/go
go run github.com/yfedoseev/pdf_oxide/go/cmd/install@latest

Go 1.21 이상, CGO_ENABLED=1 (기본값), 그리고 C 툴체인이 PATH에 있어야 합니다. 전체 API를 사용할 수 있습니다. 설치 스크립트는 플랫폼에 맞는 pdf_oxide-go-ffi-<platform>.tar.gz 정적 아카이브를 내려받아 SHA-256으로 검증하고, 내보내야 할 CGO_CFLAGS / CGO_LDFLAGS 값을 출력합니다. Rust 코어는 정적 링크되므로 결과 바이너리는 완전히 독립적이어서 실행 시점의 LD_LIBRARY_PATH / DYLD_LIBRARY_PATH / PATH 설정이 필요하지 않습니다. 그냥 go build 후 배포하면 됩니다.

옵션 B — purego (C 툴체인 불필요, `CGO_ENABLED=0`)

go get github.com/yfedoseev/pdf_oxide/go
go run github.com/yfedoseev/pdf_oxide/go/cmd/install@latest -shared

v0.3.38에서 ebitengine/purego를 통해 추가된 경로입니다. 설치 스크립트는 pdf_oxide-go-ffi-shared-<platform>.tar.gz cdylib (libpdf_oxide.so / .dylib / .dll)을 내려받아 내보낼 환경 변수를 출력합니다.

export CGO_ENABLED=0
export PDF_OXIDE_LIB_PATH="$HOME/.cache/pdf_oxide/v0.3.38/lib/linux_amd64/libpdf_oxide.so"

백엔드 선택은 Go의 내장 cgo 빌드 태그로 자동 결정됩니다. //go:build cgo → CGo API, //go:build !cgo → purego.

Purego 지원 API (!cgo에서 컴파일되는 것): PdfDocument 열기(경로 / 바이트 / 비밀번호), 페이지 수, 버전, 텍스트 / Markdown / HTML / 일반 텍스트 추출, Page API, 폰트, 주석, 페이지 요소, 검색, 페이지 치수, 로깅. 테스트 픽스처용 PdfCreator.FromMarkdown / .FromHtml / .FromText도 포함됩니다.

CGo 전용 (!cgo에서는 컴파일 오류): DocumentEditor, DocumentBuilder + FluentPageBuilder + EmbeddedFont, 렌더링 (RenderPage, RenderPageZoom, RenderThumbnail, RenderPageRegion, RenderPageFit), 바코드 (GenerateQRCode, GenerateBarcode), 서명 (Signatures, Signature.Verify), TSA (TsaClient), OCR (OcrEngine), SetFormFieldValue / FlattenForms.

설치 스크립트 플래그

플래그	기본값	용도
`-version`	모듈에 박혀 있는 버전	특정 릴리스에 고정
`-dir`	`os.UserCacheDir()/pdf_oxide/v<ver>`	설치 경로 직접 지정
`-shared`	off	staticlib 대신 cdylib(purego 백엔드) 내려받기
`-write-flags`	empty (환경 변수만 출력)	`cgo_flags.go`를 생성할 디렉터리
`-env-only`	off	다운로드를 건너뛰고 기존 설치의 환경 변수만 출력
`-skip-checksum`	off	SHA-256 검증 건너뛰기 (권장하지 않음)

캐시 위치 (v0.3.38+)

설치 루트가 os.UserCacheDir()로 이동하여 Go의 GOCACHE 규약과 일치시켰습니다.

OS	경로
Linux	`$XDG_CACHE_HOME/pdf_oxide` 또는 `~/.cache/pdf_oxide`
macOS	`~/Library/Caches/pdf_oxide`
Windows	`%LocalAppData%\pdf_oxide`

v0.3.30 ~ v0.3.37에서 업그레이드: 설치 스크립트를 새 경로에 대해 한 번 실행하기 전까지는 go build가 링크 단계에서 undefined reference to pdf_document_open ... 오류로 실패합니다. 이전 ~/.pdf_oxide/ 디렉터리는 자동으로 이전되지 않으므로 디스크 공간을 회수하려면 수동으로 삭제하세요.

모노레포 / 소스 트리 빌드: -tags pdf_oxide_dev를 추가하면 CGo가 로컬의 target/release/libpdf_oxide.a를 참조합니다. 설치 스크립트가 필요하지 않습니다.

사전 빌드된 플랫폼: Linux x64/arm64, macOS x64/arm64 (Apple Silicon), Windows x64 (x86_64-pc-windows-gnu 사용).

PDF 열기

package main

import (
    "fmt"
    "log"

    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    doc, err := pdfoxide.Open("research-paper.pdf")
    if err != nil {
        log.Fatal(err)
    }
    defer doc.Close()

    count, _ := doc.PageCount()
    major, minor, _ := doc.Version()
    fmt.Printf("%d 페이지, PDF %d.%d\n", count, major, minor)
}

페이지 API

v0.3.34부터 페이지 단위로 작업할 수 있습니다. doc.Page(i)는 부모 문서로 위임하는 가벼운 *Page 핸들을 반환합니다.

page, _ := doc.Page(0)
text, _ := page.Text()
md, _   := page.Markdown()

pages, _ := doc.Pages()
for _, p := range pages {
    t, _ := p.Text()
    fmt.Printf("--- 페이지 %d ---\n%s\n", p.Index+1, t)
}

각 Page는 Text(), Markdown(), Html(), PlainText(), Chars(), Words(), Lines(), Tables(), Images(), Paths(), Fonts(), Annotations(), Info(), Search(), NeedsOcr(), TextWithOcr()를 노출합니다.

텍스트 추출

단일 페이지

text, err := doc.ExtractText(0)
if err != nil {
    log.Fatal(err)
}
fmt.Println(text)

전체 페이지

allText, err := doc.ExtractAllText()
if err != nil {
    log.Fatal(err)
}
fmt.Println(allText)

수동으로 페이지 순회

pages, _ := doc.Pages()
for _, p := range pages {
    text, err := p.Text()
    if err != nil {
        log.Printf("페이지 %d: %v", p.Index, err)
        continue
    }
    fmt.Printf("--- 페이지 %d ---\n%s\n", p.Index+1, text)
}

구조화된 추출

words, _  := doc.ExtractWords(0)        // []Word
lines, _  := doc.ExtractTextLines(0)    // []TextLine
chars, _  := doc.ExtractChars(0)        // []Char
tables, _ := doc.ExtractTables(0)       // []Table — 행/셀과 bbox 포함 (v0.3.34)
paths, _  := doc.ExtractPaths(0)        // []Path

for _, w := range words {
    fmt.Printf("%q @ (%.1f, %.1f)\n", w.Text, w.X, w.Y)
}

for _, t := range tables {
    fmt.Printf("%dx%d (헤더=%v)\n", t.RowCount, t.ColCount, t.HasHeader)
    for r := 0; r < t.RowCount; r++ {
        for c := 0; c < t.ColCount; c++ {
            fmt.Printf("%s\t", t.CellText(r, c))
        }
        fmt.Println()
    }
}

영역 기반 추출:

region, _ := doc.ExtractTextInRect(0, 50, 700, 200, 50) // x, y, 너비, 높이
words, _  := doc.ExtractWordsInRect(0, 50, 700, 200, 50)

Markdown 변환

md, err := doc.ToMarkdown(0)
if err != nil {
    log.Fatal(err)
}
fmt.Println(md)

// 전체 페이지
allMd, _ := doc.ToMarkdownAll()

HTML 변환

html, _  := doc.ToHtml(0)
allHtml, _ := doc.ToHtmlAll()

이미지 추출

import "os"

images, err := doc.Images(0)
if err != nil {
    log.Fatal(err)
}

for i, img := range images {
    fmt.Printf("이미지 %d: %dx%d %s %s %dbpc (%d 바이트)\n",
        i, img.Width, img.Height, img.Format, img.Colorspace, img.BitsPerComponent, len(img.Data))
    os.WriteFile(fmt.Sprintf("image_%d.%s", i, img.Format), img.Data, 0644)
}

바이트와 Reader에서 열기

// 바이트에서
data, _ := os.ReadFile("document.pdf")
doc, err := pdfoxide.OpenFromBytes(data)

// 임의의 io.Reader에서
doc, err := pdfoxide.OpenReader(someReader)

// 비밀번호 사용
doc, err := pdfoxide.OpenWithPassword("secure.pdf", "user-password")

PDF 생성

// Markdown에서 (purego에서도 동작)
pdf, _ := pdfoxide.FromMarkdown("# 안녕하세요\n\n본문입니다.")
defer pdf.Close()
pdf.Save("out.pdf")

// HTML에서 (purego에서도 동작)
htmlPdf, _ := pdfoxide.FromHtml("<h1>청구서</h1><p>금액: $42</p>")
defer htmlPdf.Close()
htmlPdf.Save("invoice.pdf")

// 텍스트에서 (purego에서도 동작)
txt, _ := pdfoxide.FromText("일반 텍스트 문서.")
defer txt.Close()

// 여기부터는 CGo 전용입니다:

// 이미지에서
img, _ := pdfoxide.FromImage("photo.jpg")
defer img.Close()

// 여러 PDF 병합
merged, _ := pdfoxide.Merge([]string{"a.pdf", "b.pdf"})
os.WriteFile("merged.pdf", merged, 0644)

DocumentBuilder (CGo 전용, v0.3.38)

플루언트 DocumentBuilder API는 v0.3.38에 Go로 올라왔습니다. 주석, AcroForm 위젯 (TextField, Checkbox, ComboBox, RadioGroup, PushButton), 그래픽 프리미티브 (Rect, FilledRect, Line), 임베디드 폰트 (CJK / 키릴 / 그리스), AES-256 암호화가 모두 이 API로 제공됩니다.

font, _ := pdfoxide.EmbeddedFontFromFile("DejaVuSans.ttf")
defer font.Close()

builder := pdfoxide.NewDocumentBuilder()
builder.RegisterEmbeddedFont("DejaVu", font)
builder.A4Page().
    Font("DejaVu", 12).At(72, 720).Text("Privet, mir!").
    Highlight(1.0, 1.0, 0.0).
    TextField("name", 150, 680, 200, 20, "Jane Doe").
    Checkbox("subscribe", 72, 650, 15, 15, true).
    Done()
_ = builder.SaveEncrypted("out.pdf", "user-pw", "owner-pw")

전체 메서드 목록은 DocumentBuilder 플루언트 API를 참고하세요 (모든 바인딩에서 동일한 모양).

렌더링

렌더링 API는 모두 CGo 전용입니다 (CGO_ENABLED=0에서는 컴파일 오류가 발생합니다).

// 포맷: 0 = PNG, 1 = JPEG
img, err := doc.RenderPage(0, 0)
if err != nil {
    log.Fatal(err)
}
defer img.Close()
img.SaveToFile("page.png")

// 확대 (2×)
zoomed, _ := doc.RenderPageZoom(0, 2.0, 0)
defer zoomed.Close()

// 썸네일 (너비 200px)
thumb, _ := doc.RenderThumbnail(0, 200, 0)
defer thumb.Close()

// 지정 영역만 잘라서 렌더링 (v0.3.38)
region, _ := doc.RenderPageRegion(0, 72, 200, 468, 300, 0)
defer region.Close()

// 목표 박스에 맞춰 렌더링 (v0.3.38)
fitted, _ := doc.RenderPageFit(0, 1024, 768, 0)
defer fitted.Close()

검색

// 모든 페이지 검색 (대소문자 구분 없음)
hits, _ := doc.SearchAll("configuration", false)
for _, r := range hits {
    fmt.Printf("페이지 %d: %q @ (%.0f, %.0f)\n", r.Page, r.Text, r.X, r.Y)
}

// 단일 페이지 검색
pageHits, _ := doc.SearchPage(0, "configuration", false)

편집

DocumentEditor는 CGo 전용입니다. 메타데이터, 페이지 작업, 주석, 폼 작업에 사용합니다.

editor, err := pdfoxide.OpenEditor("in.pdf")
if err != nil {
    log.Fatal(err)
}
defer editor.Close()

// 메타데이터 — 한 필드씩
_ = editor.SetTitle("분기 보고서")
_ = editor.SetAuthor("재무팀")

// 여러 필드를 한 번에 적용
_ = editor.ApplyMetadata(pdfoxide.Metadata{
    Title:   "2026년 Q1 보고서",
    Author:  "재무팀",
    Subject: "실적",
})

// 페이지 작업
_ = editor.SetPageRotation(0, 90)
_ = editor.MovePage(2, 0)
_ = editor.DeletePage(5)

// 폼
_ = editor.SetFormFieldValue("employee.name", "Jane Doe")
_ = editor.FlattenForms()

// 저장
_ = editor.Save("out.pdf")
_ = editor.SaveEncrypted("secret.pdf", "user", "owner")

바코드

바코드 생성은 CGo 전용입니다.

qr, _ := pdfoxide.GenerateQRCode("https://example.com", 0, 256)
defer qr.Close()
_ = os.WriteFile("qr.png", qr.PNGData(), 0644)

bc, _ := pdfoxide.GenerateBarcode("123456789", 0, 128)
defer bc.Close()

OCR

스캔된 페이지에서 OCR을 사용하려면 ocr 기능을 활성화해 빌드합니다.

go build -tags ocr ./...

ocr, _ := pdfoxide.NewOcrEngine()
defer ocr.Close()

if ocr.NeedsOcr(doc, 0) {
    text, _ := ocr.ExtractTextWithOcr(doc, 0)
    fmt.Println(text)
}

자세한 레시피는 OCR 가이드를 참고하세요.

동시성

PdfDocument의 읽기 작업은 고루틴 안전합니다. 여러 고루틴이 하나의 문서를 공유하면서 페이지를 병렬 추출할 수 있습니다.

import "sync"

var wg sync.WaitGroup
count, _ := doc.PageCount()
out := make(chan string, count)

for i := 0; i < count; i++ {
    wg.Add(1)
    go func(page int) {
        defer wg.Done()
        text, err := doc.ExtractText(page)
        if err == nil {
            out <- text
        }
    }(i)
}

go func() { wg.Wait(); close(out) }()

for text := range out {
    _ = text
}

DocumentEditor는 내부적으로 쓰기를 직렬화하지만, 서로 독립된 수정 작업을 여러 고루틴에서 파이프라인처럼 흘려보내지는 마세요. 변경 사항은 하나의 고루틴에서 모아 처리하십시오. 패턴은 동시성 가이드를 참고하세요.

오류 처리

import "errors"

text, err := doc.ExtractText(0)
if err != nil {
    switch {
    case errors.Is(err, pdfoxide.ErrDocumentClosed):
        log.Print("문서가 닫혀 있습니다")
    case errors.Is(err, pdfoxide.ErrInvalidPageIndex):
        log.Print("유효하지 않은 페이지 인덱스")
    case errors.Is(err, pdfoxide.ErrExtractionFailed):
        log.Print("추출에 실패했습니다")
    default:
        log.Printf("예상치 못한 오류: %v", err)
    }
}

사용 가능한 센티넬 오류:

ErrInvalidPath        ErrDocumentNotFound   ErrInvalidFormat
ErrExtractionFailed   ErrParseError         ErrInvalidPageIndex
ErrSearchFailed       ErrInternal           ErrDocumentClosed
ErrEditorClosed       ErrCreatorClosed      ErrIndexOutOfBounds
ErrEmptyContent

errors.As로 숫자 Code와 Message를 꺼낼 수 있습니다.

var e *pdfoxide.Error
if errors.As(err, &e) {
    fmt.Printf("code=%d message=%s\n", e.Code, e.Message)
}

다음 단계

Python 시작하기 — Python에서 PDF Oxide 사용하기
Go API 레퍼런스 — 전체 API 문서
동시성 가이드 — 고루틴 패턴
텍스트 추출 — 상세 추출 옵션
PDF 생성 — 고급 생성 기능
pkg.go.dev 패키지 — 자동 생성 API 문서