What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF OCR — PDF Oxide로 스캔된 PDF에서 텍스트 추출하기

내장 OCR 기능을 사용해 스캔된 PDF에서 텍스트를 추출합니다. v0.3.27부터 OCR은 통합 FFI 레이어(pdf_ocr_engine_create, pdf_ocr_page_needs_ocr, pdf_ocr_extract_text)를 통해 Python, Node.js, Go, C#, Rust 등 모든 언어 바인딩에서 사용할 수 있습니다.

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("scanned.pdf")
text = doc.extract_text_ocr(0)
print(text)

Node.js

const { PdfDocument, OcrEngine } = require("pdf-oxide");

const doc = new PdfDocument("scanned.pdf");
const ocr = new OcrEngine();
if (ocr.pageNeedsOcr(doc, 0)) {
  console.log(ocr.extractText(doc, 0));
}
ocr.close();
doc.close();

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("scanned.pdf")
defer doc.Close()

ocr, _ := pdfoxide.NewOcrEngine()
defer ocr.Close()

if ocr.NeedsOcr(doc, 0) {
    text, _ := ocr.ExtractTextWithOcr(doc, 0)
    fmt.Println(text)
}

using PdfOxide.Core;
using PdfOxide.Ocr;

using var doc = PdfDocument.Open("scanned.pdf");
using var ocr = new OcrEngine();

if (ocr.PageNeedsOcr(doc, 0))
{
    Console.WriteLine(ocr.ExtractText(doc, 0));
}

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::ocr::{OcrEngine, OcrConfig, OcrExtractOptions, extract_text_with_ocr};

let mut doc = PdfDocument::open("scanned.pdf")?;
let config = OcrConfig::default();
let engine = OcrEngine::new("models/det.onnx", "models/rec.onnx", "models/dict.txt", config)?;
let options = OcrExtractOptions::default();
let text = extract_text_with_ocr(&mut doc, 0, Some(&engine), options)?;
println!("{text}");

Java

import fyi.oxide.pdf.PdfDocument;

try (PdfDocument doc = PdfDocument.open("scanned.pdf")) {
    // Lean-tier bindings have no OCR engine handle — extractTextAuto
    // routes scanned pages through OCR automatically (graceful fallback).
    String text = doc.extractTextAuto(0);
    System.out.println(text);
}

PHP

<?php
use PdfOxide\PdfDocument;

$doc = PdfDocument::open("scanned.pdf");
// No OCR engine handle in the lean tier — extractTextAuto routes
// scanned pages through OCR automatically (graceful fallback).
echo $doc->extractTextAuto(0);

Ruby

require "pdf_oxide"

doc = PdfOxide::PdfDocument.open("scanned.pdf")
# No OCR engine handle in the lean tier — extract_text_auto routes
# scanned pages through OCR automatically (graceful fallback).
puts doc.extract_text_auto(0)

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("scanned.pdf");
auto engine = pdf_oxide::OcrEngine::create("det.onnx", "rec.onnx", "dict.txt");

if (doc.ocr_page_needs_ocr(0)) {
    std::string text = doc.ocr_extract_text(0, &engine);
    std::cout << text << "\n";
}

Swift

import PdfOxide

let doc = try Document.open("scanned.pdf")
let engine = try OcrEngine.create(
    detModelPath: "det.onnx", recModelPath: "rec.onnx", dictPath: "dict.txt")

if try doc.ocrPageNeedsOcr(0) {
    let text = try doc.ocrExtractText(0, engine: engine)
    print(text)
}

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open("scanned.pdf").use { doc ->
    // Lean-tier bindings have no OCR engine handle — extractTextAuto
    // routes scanned pages through OCR automatically (graceful fallback).
    println(doc.extractTextAuto(0))
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('scanned.pdf');
final engine = OcrEngine.create('det.onnx', 'rec.onnx', 'dict.txt');

if (doc.pageNeedsOcr(0)) {
  print(doc.ocrExtractText(0, engine));
}
engine.close();
doc.close();

library(pdfoxide)

doc    <- pdf_open("scanned.pdf")
engine <- pdf_ocr_engine_create("det.onnx", "rec.onnx", "dict.txt")

if (pdf_ocr_page_needs_ocr(doc, 0)) {
  text <- pdf_ocr_extract_text(doc, 0, engine)
  cat(text)
}

Julia

using PdfOxide

doc    = open_document("scanned.pdf")
engine = ocr_engine_create("det.onnx", "rec.onnx", "dict.txt")

if page_needs_ocr(doc, 0)
    text = ocr_extract_text(doc, 0, engine)
    println(text)
end

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("scanned.pdf");
defer doc.deinit();
var engine = try pdf_oxide.OcrEngine.create("det.onnx", "rec.onnx", "dict.txt");
defer engine.deinit();

if (try doc.ocrPageNeedsOcr(0)) {
    const text = try doc.ocrExtractText(a, 0, engine);
    defer a.free(text);
    std.debug.print("{s}\n", .{text});
}

Scala

import fyi.oxide.pdf.PdfDocument

val doc = PdfDocument.open("scanned.pdf")
// Lean-tier bindings have no OCR engine handle — extractTextAuto
// routes scanned pages through OCR automatically (graceful fallback).
println(doc.extractTextAuto(0))
doc.close()

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "scanned.pdf")]
  ;; Lean-tier bindings have no OCR engine handle — the AutoExtractor
  ;; routes scanned pages through OCR automatically (graceful fallback).
  (let [ax (pdf/auto-extractor doc)]
    (println (pdf/auto-text ax))))

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"scanned.pdf" error:&err];
POXOcrEngine *engine = [POXOcrEngine createWithDetModelPath:@"det.onnx"
                                              recModelPath:@"rec.onnx"
                                                  dictPath:@"dict.txt"
                                                     error:&err];

if ([doc pageNeedsOcr:0 error:&err]) {
    NSString *text = [doc ocrExtractText:0 engine:engine error:&err];
    NSLog(@"%@", text);
}

Elixir

{:ok, doc} = PdfOxide.open("scanned.pdf")
{:ok, engine} = PdfOxide.ocr_engine("det.onnx", "rec.onnx", "dict.txt")

case PdfOxide.ocr_page_needs_ocr(doc, 0) do
  {:ok, true} ->
    {:ok, text} = PdfOxide.ocr_extract_text(doc, 0, engine)
    IO.puts(text)

  _ ->
    :ok
end

PDF Oxide는 ONNX Runtime을 통해 PaddleOCR을 내장합니다 — Tesseract 설치 불필요, 시스템 의존성 없음, 서브프로세스 호출 없음. OCR 엔진이 프로세스 내에서 직접 실행됩니다. PP-OCRv3, PP-OCRv4, PP-OCRv5 모델 패밀리를 지원합니다.

참고: OCR은 WebAssembly에서 사용할 수 없습니다(네이티브 ONNX Runtime이 필요합니다). Go / Node.js / C# / Rust는 ocr feature를 활성화하여 빌드하세요. Python 휠은 기본적으로 OCR이 활성화된 상태로 제공됩니다.

Tesseract 없이 Python PDF OCR 구현하기

대부분의 Python PDF OCR 솔루션은 Tesseract를 시스템 의존성으로 설치해야 합니다 — 운영 체제와 CI 환경마다 설정 방법이 다르고 복잡합니다. PDF Oxide는 PaddleOCR 모델을 Python 휠에 직접 포함합니다.

시스템 의존성 없음 — pip install pdf_oxide 한 줄로 충분
서브프로세스 호출 없음 — OCR이 ONNX Runtime을 통해 네이티브로 실행
세 가지 모델 패밀리 — PP-OCRv3, PP-OCRv4, PP-OCRv5
자동 페이지 감지 — 스캔 페이지와 텍스트 페이지를 자동으로 식별

비교: PDF Oxide OCR vs PyMuPDF + Tesseract

	PDF Oxide	PyMuPDF + Tesseract
설치	`pip install pdf_oxide`	`pip install pymupdf` + 시스템 Tesseract
OCR 엔진	PaddleOCR (ONNX)	Tesseract (서브프로세스)
설정 복잡도	한 줄	OS별 Tesseract 설치 필요
CI/Docker	추가 설정 불필요	`apt-get install tesseract-ocr` 필요
모델 포함 여부	예 (휠에 포함)	아니오 (별도 다운로드 필요)

설치

Python

pip install pdf_oxide

OCR 모델이 휠에 포함되어 있습니다. 추가 다운로드가 필요 없습니다.

Rust

[dependencies]
pdf_oxide = { version = "0.3", features = ["ocr"] }

go build -tags ocr ./...

Node.js

npm install pdf-oxide --build-from-source -- --features ocr

NuGet 패키지는 기본 Linux / macOS / Windows 바이너리에 OCR이 활성화된 상태로 제공됩니다 — 추가 설정이 필요 없습니다.

OCR을 사용해야 하는 경우

대부분의 PDF에는 임베디드 텍스트가 있어 extract_text()로 페이지당 0.8ms에 처리할 수 있습니다. OCR은 다음 경우에만 필요합니다.

스캔 문서 — 종이 문서를 PDF로 스캔한 것
이미지만 있는 PDF — 사진이나 스크린샷으로 만든 PDF
텍스트가 이미지인 PDF — 일부 생성기가 텍스트를 래스터화하는 경우
하이브리드 페이지 — 네이티브 텍스트와 스캔 이미지 영역이 모두 있는 페이지

PP-OCR 모델 버전

PDF Oxide는 세 세대의 PaddleOCR 모델을 지원합니다. 기본 설정은 PP-OCRv3와 PP-OCRv4에 맞춰져 있습니다. PP-OCRv5 서버 모델은 다른 리사이즈 전략이 필요합니다.

PP-OCRv3 / PP-OCRv4 (기본값)

이미지를 최대 변 길이에 맞게 축소하는 모바일 최적화 모델입니다. 대부분의 문서에 적합합니다.

검출 모델: DBNet++ (경량)
인식 모델: SVTR
리사이즈 전략: MaxSide — 가장 긴 변을 960px로 축소
적합한 용도: 일반 문서, 모바일/엣지 배포

Python

from pdf_oxide import OcrConfig, OcrEngine

# Default config works with v3/v4 models
config = OcrConfig()
engine = OcrEngine("det_v4.onnx", "rec_v4.onnx", "dict.txt", config)

Rust

use pdf_oxide::ocr::{OcrConfig, OcrEngine};

// Default config: MaxSide { max_side: 960 }
let config = OcrConfig::default();
let engine = OcrEngine::new("det_v4.onnx", "rec_v4.onnx", "dict.txt", config)?;

PP-OCRv5 (서버)

필요할 때 이미지를 확대하여 고해상도를 유지하는 서버급 모델입니다. 밀집된 텍스트나 작은 글씨 문서에서 정확도가 크게 향상됩니다.

검출 모델: DBNet++ (서버, 대형)
인식 모델: SVTR-v5
리사이즈 전략: MinSide — 가장 짧은 변이 최소 64px가 되도록 보장, 4000px 상한
적합한 용도: 고정밀 추출, 서버 환경, 밀집 텍스트

Python

from pdf_oxide import OcrConfig, OcrEngine

# v5 config: high-resolution input for server models
config = OcrConfig(use_v5=True)
engine = OcrEngine("det_v5.onnx", "rec_v5.onnx", "dict_v5.txt", config)

Rust

use pdf_oxide::ocr::{OcrConfig, OcrEngine};

// v5 config: MinSide { min_side: 64, max_side_limit: 4000 }
let config = OcrConfig::v5();
let engine = OcrEngine::new("det_v5.onnx", "rec_v5.onnx", "dict_v5.txt", config)?;

모델 비교

특징	PP-OCRv3/v4	PP-OCRv5
리사이즈 전략	`MaxSide` (960px로 축소)	`MinSide` (확대, 4000px 상한)
입력 해상도	낮음 (빠름)	높음 (더 정확)
검출 모델 크기	약 3 MB	약 12 MB
인식 모델 크기	약 12 MB	약 25 MB
적합한 용도	모바일, 엣지, 일반 문서	서버, 밀집 텍스트, 작은 글씨
`OcrConfig`	`OcrConfig()` / `OcrConfig::default()`	`OcrConfig(use_v5=True)` / `OcrConfig::v5()`

페이지 유형 감지

PDF Oxide는 OCR이 필요한지 판단하기 위해 페이지를 자동으로 분류합니다. extract_text_ocr() 함수가 내부적으로 이를 처리하지만, 수동으로 페이지 유형을 감지할 수도 있습니다.

스캔 페이지 자동 감지

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("mixed.pdf")

for i in range(doc.page_count()):
    text = doc.extract_text(i)
    if len(text.strip()) < 50:
        # Likely scanned — use OCR
        text = doc.extract_text_ocr(i)
        print(f"Page {i + 1} (OCR): {text[:100]}...")
    else:
        print(f"Page {i + 1} (text): {text[:100]}...")

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::ocr::{detect_page_type, PageType, OcrEngine, OcrConfig, OcrExtractOptions, extract_text_with_ocr};

let mut doc = PdfDocument::open("mixed.pdf")?;
let engine = OcrEngine::new("det.onnx", "rec.onnx", "dict.txt", OcrConfig::default())?;

for i in 0..doc.page_count() {
    let page_type = detect_page_type(&mut doc, i)?;
    match page_type {
        PageType::NativeText => {
            let text = doc.extract_text(i)?;
            println!("Page {} (native): {}...", i + 1, &text[..100.min(text.len())]);
        }
        PageType::ScannedPage => {
            let text = extract_text_with_ocr(&mut doc, i, Some(&engine), OcrExtractOptions::default())?;
            println!("Page {} (OCR): {}...", i + 1, &text[..100.min(text.len())]);
        }
        PageType::HybridPage => {
            // Has both native text and scanned images — merges both sources
            let text = extract_text_with_ocr(&mut doc, i, Some(&engine), OcrExtractOptions::default())?;
            println!("Page {} (hybrid): {}...", i + 1, &text[..100.min(text.len())]);
        }
    }
}

Java

import fyi.oxide.pdf.PdfDocument;

try (PdfDocument doc = PdfDocument.open("mixed.pdf")) {
    for (int i = 0; i < doc.pageCount(); i++) {
        // extractTextAuto classifies each page and routes scanned
        // pages through OCR automatically (graceful fallback).
        System.out.printf("Page %d: %s%n", i + 1, doc.extractTextAuto(i));
    }
}

PHP

<?php
use PdfOxide\PdfDocument;

$doc = PdfDocument::open("mixed.pdf");
for ($i = 0; $i < $doc->pageCount(); $i++) {
    // extractTextAuto classifies each page and routes scanned
    // pages through OCR automatically (graceful fallback).
    printf("Page %d: %s\n", $i + 1, $doc->extractTextAuto($i));
}

Ruby

require "pdf_oxide"

doc = PdfOxide::PdfDocument.open("mixed.pdf")
doc.page_count.times do |i|
  # extract_text_auto classifies each page and routes scanned
  # pages through OCR automatically (graceful fallback).
  puts "Page #{i + 1}: #{doc.extract_text_auto(i)}"
end

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("mixed.pdf");
auto engine = pdf_oxide::OcrEngine::create("det.onnx", "rec.onnx", "dict.txt");

for (int i = 0; i < doc.page_count(); ++i) {
    std::string text = doc.ocr_page_needs_ocr(i)
        ? doc.ocr_extract_text(i, &engine)   // scanned / hybrid → OCR
        : doc.extract_text(i);               // native text
    std::cout << "Page " << (i + 1) << ": " << text << "\n";
}

Swift

import PdfOxide

let doc = try Document.open("mixed.pdf")
let engine = try OcrEngine.create(
    detModelPath: "det.onnx", recModelPath: "rec.onnx", dictPath: "dict.txt")

for i in 0..<(try doc.pageCount()) {
    let text = try doc.ocrPageNeedsOcr(i)
        ? doc.ocrExtractText(i, engine: engine)   // scanned / hybrid → OCR
        : doc.extractText(i)                      // native text
    print("Page \(i + 1): \(text)")
}

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open("mixed.pdf").use { doc ->
    for (i in 0 until doc.pageCount()) {
        // extractTextAuto classifies each page and routes scanned
        // pages through OCR automatically (graceful fallback).
        println("Page ${i + 1}: ${doc.extractTextAuto(i)}")
    }
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('mixed.pdf');
final engine = OcrEngine.create('det.onnx', 'rec.onnx', 'dict.txt');

for (var i = 0; i < doc.pageCount; i++) {
  final text = doc.pageNeedsOcr(i)
      ? doc.ocrExtractText(i, engine)   // scanned / hybrid → OCR
      : doc.extractText(i);             // native text
  print('Page ${i + 1}: $text');
}
engine.close();
doc.close();

library(pdfoxide)

doc    <- pdf_open("mixed.pdf")
engine <- pdf_ocr_engine_create("det.onnx", "rec.onnx", "dict.txt")

for (i in seq_len(pdf_page_count(doc)) - 1) {
  text <- if (pdf_ocr_page_needs_ocr(doc, i)) {
    pdf_ocr_extract_text(doc, i, engine)   # scanned / hybrid -> OCR
  } else {
    pdf_extract_text(doc, i)               # native text
  }
  cat(sprintf("Page %d: %s\n", i + 1, text))
}

Julia

using PdfOxide

doc    = open_document("mixed.pdf")
engine = ocr_engine_create("det.onnx", "rec.onnx", "dict.txt")

for i in 0:(page_count(doc) - 1)
    text = page_needs_ocr(doc, i) ?
        ocr_extract_text(doc, i, engine) :   # scanned / hybrid -> OCR
        extract_text(doc, i)                 # native text
    println("Page $(i + 1): $text")
end

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("mixed.pdf");
defer doc.deinit();
var engine = try pdf_oxide.OcrEngine.create("det.onnx", "rec.onnx", "dict.txt");
defer engine.deinit();

var i: i32 = 0;
const n = try doc.pageCount();
while (i < n) : (i += 1) {
    const text = if (try doc.ocrPageNeedsOcr(i))
        try doc.ocrExtractText(a, i, engine)   // scanned / hybrid → OCR
    else
        try doc.extractText(a, i);             // native text
    defer a.free(text);
    std.debug.print("Page {d}: {s}\n", .{ i + 1, text });
}

Scala

import fyi.oxide.pdf.PdfDocument

val doc = PdfDocument.open("mixed.pdf")
for (i <- 0 until doc.pageCount) {
  // extractTextAuto classifies each page and routes scanned
  // pages through OCR automatically (graceful fallback).
  println(s"Page ${i + 1}: ${doc.extractTextAuto(i)}")
}
doc.close()

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "mixed.pdf")]
  ;; The AutoExtractor classifies each page and routes scanned pages
  ;; through OCR automatically (graceful fallback).
  (println (pdf/auto-text (pdf/auto-extractor doc))))

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"mixed.pdf" error:&err];
POXOcrEngine *engine = [POXOcrEngine createWithDetModelPath:@"det.onnx"
                                              recModelPath:@"rec.onnx"
                                                  dictPath:@"dict.txt"
                                                     error:&err];

NSInteger n = [doc pageCountError:&err];
for (NSInteger i = 0; i < n; i++) {
    NSString *text = [doc pageNeedsOcr:i error:&err]
        ? [doc ocrExtractText:i engine:engine error:&err]   // scanned / hybrid → OCR
        : [doc extractText:i error:&err];                   // native text
    NSLog(@"Page %ld: %@", (long)(i + 1), text);
}

Elixir

{:ok, doc} = PdfOxide.open("mixed.pdf")
{:ok, engine} = PdfOxide.ocr_engine("det.onnx", "rec.onnx", "dict.txt")
{:ok, n} = PdfOxide.page_count(doc)

for i <- 0..(n - 1) do
  {:ok, text} =
    case PdfOxide.ocr_page_needs_ocr(doc, i) do
      {:ok, true} -> PdfOxide.ocr_extract_text(doc, i, engine)   # scanned / hybrid -> OCR
      _ -> PdfOxide.extract_text(doc, i)                         # native text
    end

  IO.puts("Page #{i + 1}: #{text}")
end

PageType 열거값 (Rust)

열거값	설명
`NativeText`	임베디드 텍스트가 있는 페이지 — OCR 불필요
`ScannedPage`	완전히 스캔된 페이지 (큰 이미지, 텍스트 없음/최소) — 전체 OCR
`HybridPage`	네이티브 텍스트와 스캔 이미지 영역이 모두 있는 페이지 — 네이티브 텍스트와 OCR 결과 병합

needs_ocr() 헬퍼는 ScannedPage와 HybridPage 모두에 대해 true를 반환합니다.

use pdf_oxide::ocr::needs_ocr;

if needs_ocr(&mut doc, 0)? {
    let text = extract_text_with_ocr(&mut doc, 0, Some(&engine), OcrExtractOptions::default())?;
}

작동 방식

PDF Oxide가 내부적으로 페이지를 이미지로 렌더링합니다 (300 DPI)
감지 전략에 따라 이미지를 리사이즈합니다 (v3/v4는 MaxSide, v5는 MinSide)
DBNet++ 텍스트 감지기가 사각형 경계 상자로 텍스트 영역을 찾습니다
SVTR 텍스트 인식기가 각 감지된 영역의 문자를 읽습니다
결과가 읽기 순서로 정렬되어 텍스트로 조합됩니다
하이브리드 페이지의 경우 OCR 텍스트가 네이티브 텍스트와 병합됩니다

전체 파이프라인이 ONNX Runtime을 통해 프로세스 내에서 실행됩니다. 외부 바이너리 없음, 서브프로세스 호출 없음, 임시 파일 없음.

OCR 설정

Python

from pdf_oxide import OcrConfig, OcrEngine

# Default (v3/v4)
config = OcrConfig()

# PP-OCRv5 server models
config = OcrConfig(use_v5=True)

# Custom thresholds
config = OcrConfig(
    det_threshold=0.5,    # Detection confidence (0.0-1.0)
    box_threshold=0.7,    # Box confidence (0.0-1.0)
    rec_threshold=0.6,    # Recognition confidence (0.0-1.0)
    num_threads=8,        # ONNX Runtime threads
    max_candidates=500,   # Max text regions
)

# v5 with custom thresholds
config = OcrConfig(use_v5=True, det_threshold=0.4, num_threads=8)

engine = OcrEngine("det.onnx", "rec.onnx", "dict.txt", config)

Rust

use pdf_oxide::ocr::{OcrConfig, OcrConfigBuilder, DetResizeStrategy};

// Default (v3/v4): MaxSide { max_side: 960 }
let config = OcrConfig::default();

// PP-OCRv5: MinSide { min_side: 64, max_side_limit: 4000 }
let config = OcrConfig::v5();

// Custom builder
let config = OcrConfig::builder()
    .det_threshold(0.5)
    .box_threshold(0.7)
    .rec_threshold(0.6)
    .num_threads(8)
    .max_candidates(500)
    .detect_styles(true)        // Enable style detection from OCR geometry
    .build();

// Custom resize strategy
let config = OcrConfig::builder()
    .det_resize_strategy(DetResizeStrategy::MinSide {
        min_side: 128,
        max_side_limit: 6000,
    })
    .build();

DetResizeStrategy (Rust)

감지 모델 실행 전 입력 이미지 리사이즈 방식을 제어합니다.

열거값	필드	설명
`MaxSide`	`max_side: u32` (기본값: 960)	가장 긴 변이 `max_side`에 맞도록 축소. PP-OCRv3/v4의 기본값.
`MinSide`	`min_side: u32` (기본값: 64), `max_side_limit: u32` (기본값: 4000)	가장 짧은 변이 최소 `min_side`가 되도록 확대, `max_side_limit`으로 상한 설정. PP-OCRv5의 기본값.

OcrConfig 필드

필드	타입	기본값	설명
`det_threshold`	`f32`	`0.3`	감지 확률 임계값
`box_threshold`	`f32`	`0.6`	박스 신뢰도 임계값
`rec_threshold`	`f32`	`0.5`	인식 신뢰도 임계값
`det_max_side`	`u32`	`960`	최대 이미지 크기 (v3/v4 호환)
`det_resize_strategy`	`DetResizeStrategy`	`MaxSide { 960 }`	이미지 리사이즈 전략
`rec_target_height`	`u32`	`48`	인식 크롭의 목표 높이
`num_threads`	`usize`	`4`	ONNX Runtime 추론 스레드 수
`unclip_ratio`	`f32`	`1.5`	박스 확장 비율
`max_candidates`	`usize`	`1000`	최대 텍스트 영역 감지 수
`detect_styles`	`bool`	`true`	OCR 기하학 정보에서 폰트 스타일 감지
`det_model_path`	`Option<PathBuf>`	`None`	사용자 정의 감지 모델 경로
`rec_model_path`	`Option<PathBuf>`	`None`	사용자 정의 인식 모델 경로
`dict_path`	`Option<PathBuf>`	`None`	사용자 정의 문자 사전 경로

사용자 정의 모델

내장 모델 대신 직접 ONNX 모델을 사용할 수 있습니다.

Rust

use pdf_oxide::ocr::OcrConfig;

let config = OcrConfig::builder()
    .det_model_path("models/custom_det.onnx")
    .rec_model_path("models/custom_rec.onnx")
    .dict_path("models/custom_dict.txt")
    .build();

스타일 감지

detect_styles가 활성화된 경우(기본값), PDF Oxide는 OCR 기하학 정보(텍스트 크기, 간격, 위치)에서 폰트 스타일(굵게, 제목 레벨)을 추론합니다. 스캔된 페이지의 Markdown 변환 품질이 향상됩니다.

let config = OcrConfig::builder()
    .detect_styles(true)    // Infer styles from text geometry
    .build();

OCR vs Tesseract

특징	PDF Oxide OCR	Tesseract (PyMuPDF 경유)
설치	`pip install pdf_oxide`	시스템 패키지 + pytesseract
시스템 의존성	없음	Tesseract 바이너리 필요
런타임	ONNX (프로세스 내)	서브프로세스 호출
모델 버전	PP-OCRv3, v4, v5	Tesseract LSTM
언어	다국어	언어 팩 필요
설정 복잡도	없음	보통
감지 모델	DBNet++	Tesseract 내장
인식 모델	SVTR / SVTR-v5	Tesseract LSTM
고해상도 지원	`MinSide` 전략 (v5)	DPI 설정
페이지 유형 감지	자동 (네이티브/스캔/하이브리드)	수동

사용자 정의 DPI

OCR을 위해 PDF 페이지를 이미지로 변환할 때 렌더링 해상도를 제어합니다.

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("scanned.pdf")

# Default is 300 DPI — good balance of accuracy and speed
text = doc.extract_text_ocr(0)

# Higher DPI for better accuracy on fine print
text = doc.extract_text_ocr(0)  # DPI configured via OcrExtractOptions in Rust

Rust

use pdf_oxide::ocr::OcrExtractOptions;

// Higher DPI = better accuracy but slower
let options = OcrExtractOptions::default().with_dpi(300.0);

// Lower DPI = faster but less accurate
let options = OcrExtractOptions::default().with_dpi(150.0);

OCR 출력 구조 (Rust)

OcrEngine::ocr_image() 메서드는 스팬별 신뢰도 점수가 포함된 상세한 결과를 반환합니다.

use pdf_oxide::ocr::OcrEngine;

let engine = OcrEngine::new("det.onnx", "rec.onnx", "dict.txt", Default::default())?;
let output = engine.ocr_image(&image)?;

// Full text in reading order
println!("{}", output.text_in_reading_order());

// Per-span details
for span in &output.spans {
    println!("Text: '{}' (confidence: {:.2})", span.text, span.confidence);
    println!("  Bounding box: {:?}", span.bounding_rect());
    println!("  Per-char confidence: {:?}", span.char_confidences);
}

// Overall confidence
println!("Total confidence: {:.2}", output.total_confidence);

OcrOutput 필드

필드 / 메서드	타입	설명
`spans`	`Vec<OcrSpan>`	인식된 모든 텍스트 영역
`total_confidence`	`f32`	모든 스팬의 평균 신뢰도
`text()`	`String`	공백으로 연결된 모든 텍스트
`text_in_reading_order()`	`String`	위치 기준으로 정렬된 텍스트 (위→아래, 왼→오른쪽)

OcrSpan 필드

필드	타입	설명
`text`	`String`	인식된 텍스트
`polygon`	`[[f32; 2]; 4]`	사각형 경계 상자 (4개 꼭짓점)
`confidence`	`f32`	전체 신뢰도 (0.0–1.0)
`char_confidences`	`Vec<f32>`	문자별 신뢰도 점수

배치 OCR 처리

스캔된 PDF 디렉토리를 처리합니다.

Python

from pdf_oxide import PdfDocument, PdfError
from pathlib import Path

pdf_dir = Path("scans/")
output_dir = Path("text-output/")
output_dir.mkdir(exist_ok=True)

for pdf_path in pdf_dir.glob("*.pdf"):
    try:
        doc = PdfDocument(str(pdf_path))
        pages = []
        for i in range(doc.page_count()):
            text = doc.extract_text(i)
            if len(text.strip()) < 50:
                text = doc.extract_text_ocr(i)
            pages.append(text)

        out_path = output_dir / pdf_path.with_suffix(".txt").name
        out_path.write_text("\n\n".join(pages), encoding="utf-8")
    except PdfError as e:
        print(f"Error: {pdf_path.name}: {e}")

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::ocr::{OcrEngine, OcrConfig, OcrExtractOptions, extract_text_with_ocr, needs_ocr};
use std::fs;
use std::path::Path;

let engine = OcrEngine::new("det.onnx", "rec.onnx", "dict.txt", OcrConfig::default())?;
let options = OcrExtractOptions::default();

for entry in fs::read_dir("scans/")? {
    let path = entry?.path();
    if path.extension().map_or(false, |e| e == "pdf") {
        let mut doc = PdfDocument::open(path.to_str().unwrap())?;
        let mut all_text = String::new();
        for i in 0..doc.page_count() {
            let text = if needs_ocr(&mut doc, i)? {
                extract_text_with_ocr(&mut doc, i, Some(&engine), options.clone())?
            } else {
                doc.extract_text(i)?
            };
            all_text.push_str(&text);
            all_text.push_str("\n\n");
        }
        let out_path = Path::new("text-output/")
            .join(path.file_stem().unwrap())
            .with_extension("txt");
        fs::write(out_path, &all_text)?;
    }
}

병렬 OCR (Python)

from pdf_oxide import PdfDocument
from multiprocessing import Pool
from pathlib import Path

def ocr_pdf(pdf_path: str) -> dict:
    doc = PdfDocument(pdf_path)
    text = ""
    for i in range(doc.page_count()):
        text += doc.extract_text_ocr(i) + "\n"
    return {"file": pdf_path, "text": text}

pdf_files = [str(p) for p in Path("scans/").glob("*.pdf")]

with Pool(4) as pool:
    results = pool.map(ocr_pdf, pdf_files)

OCR에서 Markdown으로

스캔된 페이지를 Markdown으로 변환합니다.

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("scanned-report.pdf")

for i in range(doc.page_count()):
    md = doc.to_markdown(i, detect_headings=True)
    if len(md.strip()) < 50:
        # Scanned page — OCR then format
        text = doc.extract_text_ocr(i)
        md = text  # OCR output is plain text
    print(f"--- Page {i + 1} ---")
    print(md)

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::ocr::{OcrEngine, OcrConfig, OcrExtractOptions, needs_ocr, extract_text_with_ocr};

let mut doc = PdfDocument::open("scanned-report.pdf")?;
let engine = OcrEngine::new("det.onnx", "rec.onnx", "dict.txt", OcrConfig::default())?;

for i in 0..doc.page_count() {
    let text = if needs_ocr(&mut doc, i)? {
        extract_text_with_ocr(&mut doc, i, Some(&engine), OcrExtractOptions::default())?
    } else {
        doc.to_markdown(i, &Default::default())?
    };
    println!("--- Page {} ---\n{}", i + 1, text);
}

Java

import fyi.oxide.pdf.PdfDocument;

try (PdfDocument doc = PdfDocument.open("scanned-report.pdf")) {
    for (int i = 0; i < doc.pageCount(); i++) {
        String md = doc.toMarkdown(i);
        if (md.strip().length() < 50) {
            // Scanned page — auto-routing returns OCR text (plain).
            md = doc.extractTextAuto(i);
        }
        System.out.printf("--- Page %d ---%n%s%n", i + 1, md);
    }
}

PHP

<?php
use PdfOxide\PdfDocument;

$doc = PdfDocument::open("scanned-report.pdf");
for ($i = 0; $i < $doc->pageCount(); $i++) {
    $md = $doc->toMarkdown($i);
    if (strlen(trim($md)) < 50) {
        // Scanned page — auto-routing returns OCR text (plain).
        $md = $doc->extractTextAuto($i);
    }
    printf("--- Page %d ---\n%s\n", $i + 1, $md);
}

Ruby

require "pdf_oxide"

doc = PdfOxide::PdfDocument.open("scanned-report.pdf")
doc.page_count.times do |i|
  md = doc.to_markdown(i)
  if md.strip.length < 50
    # Scanned page — auto-routing returns OCR text (plain).
    md = doc.extract_text_auto(i)
  end
  puts "--- Page #{i + 1} ---\n#{md}"
end

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("scanned-report.pdf");
auto engine = pdf_oxide::OcrEngine::create("det.onnx", "rec.onnx", "dict.txt");

for (int i = 0; i < doc.page_count(); ++i) {
    std::string text = doc.ocr_page_needs_ocr(i)
        ? doc.ocr_extract_text(i, &engine)   // scanned / hybrid → OCR
        : doc.to_markdown(i);                // native → Markdown
    std::cout << "--- Page " << (i + 1) << " ---\n" << text << "\n";
}

Swift

import PdfOxide

let doc = try Document.open("scanned-report.pdf")
let engine = try OcrEngine.create(
    detModelPath: "det.onnx", recModelPath: "rec.onnx", dictPath: "dict.txt")

for i in 0..<(try doc.pageCount()) {
    let text = try doc.ocrPageNeedsOcr(i)
        ? doc.ocrExtractText(i, engine: engine)   // scanned / hybrid → OCR
        : doc.toMarkdown(i)                       // native → Markdown
    print("--- Page \(i + 1) ---\n\(text)")
}

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open("scanned-report.pdf").use { doc ->
    for (i in 0 until doc.pageCount()) {
        var md = doc.toMarkdown(i)
        if (md.trim().length < 50) {
            // Scanned page — auto-routing returns OCR text (plain).
            md = doc.extractTextAuto(i)
        }
        println("--- Page ${i + 1} ---\n$md")
    }
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('scanned-report.pdf');
final engine = OcrEngine.create('det.onnx', 'rec.onnx', 'dict.txt');

for (var i = 0; i < doc.pageCount; i++) {
  final text = doc.pageNeedsOcr(i)
      ? doc.ocrExtractText(i, engine)   // scanned / hybrid → OCR
      : doc.toMarkdown(i);              // native → Markdown
  print('--- Page ${i + 1} ---\n$text');
}
engine.close();
doc.close();

library(pdfoxide)

doc    <- pdf_open("scanned-report.pdf")
engine <- pdf_ocr_engine_create("det.onnx", "rec.onnx", "dict.txt")

for (i in seq_len(pdf_page_count(doc)) - 1) {
  text <- if (pdf_ocr_page_needs_ocr(doc, i)) {
    pdf_ocr_extract_text(doc, i, engine)   # scanned / hybrid -> OCR
  } else {
    pdf_to_markdown(doc, i)                # native -> Markdown
  }
  cat(sprintf("--- Page %d ---\n%s\n", i + 1, text))
}

Julia

using PdfOxide

doc    = open_document("scanned-report.pdf")
engine = ocr_engine_create("det.onnx", "rec.onnx", "dict.txt")

for i in 0:(page_count(doc) - 1)
    text = page_needs_ocr(doc, i) ?
        ocr_extract_text(doc, i, engine) :   # scanned / hybrid -> OCR
        to_markdown(doc, i)                  # native -> Markdown
    println("--- Page $(i + 1) ---\n$text")
end

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("scanned-report.pdf");
defer doc.deinit();
var engine = try pdf_oxide.OcrEngine.create("det.onnx", "rec.onnx", "dict.txt");
defer engine.deinit();

var i: i32 = 0;
const n = try doc.pageCount();
while (i < n) : (i += 1) {
    const text = if (try doc.ocrPageNeedsOcr(i))
        try doc.ocrExtractText(a, i, engine)   // scanned / hybrid → OCR
    else
        try doc.toMarkdown(a, i);              // native → Markdown
    defer a.free(text);
    std.debug.print("--- Page {d} ---\n{s}\n", .{ i + 1, text });
}

Scala

import fyi.oxide.pdf.PdfDocument

val doc = PdfDocument.open("scanned-report.pdf")
for (i <- 0 until doc.pageCount) {
  var md = doc.toMarkdown(i)
  if (md.trim.length < 50) {
    // Scanned page — auto-routing returns OCR text (plain).
    md = doc.extractTextAuto(i)
  }
  println(s"--- Page ${i + 1} ---\n$md")
}
doc.close()

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "scanned-report.pdf")]
  ;; The AutoExtractor routes scanned pages through OCR automatically.
  (println (pdf/auto-text (pdf/auto-extractor doc))))

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"scanned-report.pdf" error:&err];
POXOcrEngine *engine = [POXOcrEngine createWithDetModelPath:@"det.onnx"
                                              recModelPath:@"rec.onnx"
                                                  dictPath:@"dict.txt"
                                                     error:&err];

NSInteger n = [doc pageCountError:&err];
for (NSInteger i = 0; i < n; i++) {
    NSString *text = [doc pageNeedsOcr:i error:&err]
        ? [doc ocrExtractText:i engine:engine error:&err]   // scanned / hybrid → OCR
        : [doc toMarkdown:i error:&err];                    // native → Markdown
    NSLog(@"--- Page %ld ---\n%@", (long)(i + 1), text);
}

Elixir

{:ok, doc} = PdfOxide.open("scanned-report.pdf")
{:ok, engine} = PdfOxide.ocr_engine("det.onnx", "rec.onnx", "dict.txt")
{:ok, n} = PdfOxide.page_count(doc)

for i <- 0..(n - 1) do
  {:ok, text} =
    case PdfOxide.ocr_page_needs_ocr(doc, i) do
      {:ok, true} -> PdfOxide.ocr_extract_text(doc, i, engine)   # scanned / hybrid -> OCR
      _ -> PdfOxide.to_markdown(doc, i)                          # native -> Markdown
    end

  IO.puts("--- Page #{i + 1} ---\n#{text}")
end

성능 고려사항

OCR은 텍스트 추출보다 훨씬 느립니다.

작업	일반적인 속도
텍스트 추출	페이지당 0.8ms
OCR (v3/v4)	페이지당 200–1,000ms
OCR (v5 서버)	페이지당 500–2,000ms

OCR 속도는 페이지 복잡도, 이미지 해상도, 텍스트 밀도, 모델 버전에 따라 달라집니다. PP-OCRv5는 느리지만 더 정확합니다. 대량 배치 처리 시 병렬 처리를 고려하세요(위의 ‘배치 OCR 처리’ 참조).

바이트에서 모델 로드 (Rust)

use pdf_oxide::ocr::{OcrEngine, OcrConfig};

let det_bytes = std::fs::read("models/det.onnx")?;
let rec_bytes = std::fs::read("models/rec.onnx")?;
let dict = std::fs::read_to_string("models/dict.txt")?;

let engine = OcrEngine::from_bytes(&det_bytes, &rec_bytes, &dict, OcrConfig::default())?;

PDF OCR — PDF Oxide로 스캔된 PDF에서 텍스트 추출하기

Tesseract 없이 Python PDF OCR 구현하기

비교: PDF Oxide OCR vs PyMuPDF + Tesseract

설치

OCR을 사용해야 하는 경우

PP-OCR 모델 버전

PP-OCRv3 / PP-OCRv4 (기본값)

PP-OCRv5 (서버)

모델 비교

페이지 유형 감지

스캔 페이지 자동 감지

PageType 열거값 (Rust)

작동 방식

OCR 설정

Python

Rust

DetResizeStrategy (Rust)

OcrConfig 필드

사용자 정의 모델

스타일 감지

OCR vs Tesseract

사용자 정의 DPI

OCR 출력 구조 (Rust)

OcrOutput 필드

OcrSpan 필드

배치 OCR 처리

병렬 OCR (Python)

OCR에서 Markdown으로

성능 고려사항

바이트에서 모델 로드 (Rust)

관련 페이지