Skip to content

OCR PDF — Extract Text from Scanned PDFs with PDF Oxide

Extract text from scanned PDFs with built-in OCR. As of v0.3.27, OCR is exposed to all language bindings — Python, Node.js, Go, C#, and Rust — through a unified FFI layer (pdf_ocr_engine_create, pdf_ocr_page_needs_ocr, pdf_ocr_extract_text).

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("scanned.pdf")
text = doc.extract_text_ocr(0)
print(text)

Node.js

const { PdfDocument, OcrEngine } = require("pdf-oxide");

const doc = new PdfDocument("scanned.pdf");
const ocr = new OcrEngine();
if (ocr.pageNeedsOcr(doc, 0)) {
  console.log(ocr.extractText(doc, 0));
}
ocr.close();
doc.close();

Go

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("scanned.pdf")
defer doc.Close()

ocr, _ := pdfoxide.NewOcrEngine()
defer ocr.Close()

if ocr.NeedsOcr(doc, 0) {
    text, _ := ocr.ExtractTextWithOcr(doc, 0)
    fmt.Println(text)
}

C#

using PdfOxide.Core;
using PdfOxide.Ocr;

using var doc = PdfDocument.Open("scanned.pdf");
using var ocr = new OcrEngine();

if (ocr.PageNeedsOcr(doc, 0))
{
    Console.WriteLine(ocr.ExtractText(doc, 0));
}

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::ocr::{OcrEngine, OcrConfig, OcrExtractOptions, extract_text_with_ocr};

let mut doc = PdfDocument::open("scanned.pdf")?;
let config = OcrConfig::default();
let engine = OcrEngine::new("models/det.onnx", "models/rec.onnx", "models/dict.txt", config)?;
let options = OcrExtractOptions::default();
let text = extract_text_with_ocr(&mut doc, 0, Some(&engine), options)?;
println!("{text}");

PDF Oxide includes PaddleOCR via ONNX Runtime — no Tesseract installation, no system dependencies, no subprocess calls. The OCR engine runs directly in the process. Supports PP-OCRv3, PP-OCRv4, and PP-OCRv5 model families.

Note: OCR is not available in WebAssembly (it requires native ONNX Runtime). For Go / Node.js / C# / Rust, build with the ocr feature. Python wheels ship with OCR enabled by default.

Python PDF OCR Without Tesseract

Most Python PDF OCR solutions require installing Tesseract as a system dependency — a complex setup that varies across operating systems and CI environments. PDF Oxide includes PaddleOCR models directly in the Python wheel:

  • No system dependenciespip install pdf_oxide is all you need
  • No subprocess calls — OCR runs natively via ONNX Runtime
  • Three model families — PP-OCRv3, PP-OCRv4, and PP-OCRv5
  • Automatic page detection — identifies which pages are scanned vs text-based

Comparison: PDF Oxide OCR vs PyMuPDF + Tesseract

PDF Oxide PyMuPDF + Tesseract
Install pip install pdf_oxide pip install pymupdf + system Tesseract
OCR engine PaddleOCR (ONNX) Tesseract (subprocess)
Setup complexity One line OS-specific Tesseract install
CI/Docker No extra config Requires apt-get install tesseract-ocr
Models included Yes (in wheel) No (separate download)

Installation

Python

pip install pdf_oxide

The OCR models are included in the wheel. No additional downloads required.

Rust

[dependencies]
pdf_oxide = { version = "0.3", features = ["ocr"] }

Go

go build -tags ocr ./...

Node.js

npm install pdf-oxide --build-from-source -- --features ocr

C#

The NuGet package ships with OCR enabled in the default Linux / macOS / Windows binaries — no extra configuration needed.

When to Use OCR

Most PDFs contain embedded text that extract_text() handles at 0.8ms per page. OCR is only needed for:

  • Scanned documents — paper documents scanned to PDF
  • Image-only PDFs — PDFs created from photos or screenshots
  • PDFs with text as images — some generators rasterize text
  • Hybrid pages — pages with both native text and scanned image regions

PP-OCR Model Versions

PDF Oxide supports three generations of PaddleOCR models. The default configuration works with PP-OCRv3 and PP-OCRv4. PP-OCRv5 server models require a different resize strategy.

PP-OCRv3 / PP-OCRv4 (Default)

Mobile-optimized models that scale images down to fit within a maximum side length. Good for most documents.

  • Detection model: DBNet++ (lightweight)
  • Recognition model: SVTR
  • Resize strategy: MaxSide — scales the longest side down to 960px
  • Best for: standard documents, mobile/edge deployment

Python

from pdf_oxide import OcrConfig, OcrEngine

# Default config works with v3/v4 models
config = OcrConfig()
engine = OcrEngine("det_v4.onnx", "rec_v4.onnx", "dict.txt", config)

Rust

use pdf_oxide::ocr::{OcrConfig, OcrEngine};

// Default config: MaxSide { max_side: 960 }
let config = OcrConfig::default();
let engine = OcrEngine::new("det_v4.onnx", "rec_v4.onnx", "dict.txt", config)?;

PP-OCRv5 (Server)

Server-grade models that preserve high resolution by scaling images up when needed. Significantly more accurate on dense or fine-print documents.

  • Detection model: DBNet++ (server, larger)
  • Recognition model: SVTR-v5
  • Resize strategy: MinSide — ensures the shortest side is at least 64px, caps at 4000px
  • Best for: high-accuracy extraction, server environments, dense text

Python

from pdf_oxide import OcrConfig, OcrEngine

# v5 config: high-resolution input for server models
config = OcrConfig(use_v5=True)
engine = OcrEngine("det_v5.onnx", "rec_v5.onnx", "dict_v5.txt", config)

Rust

use pdf_oxide::ocr::{OcrConfig, OcrEngine};

// v5 config: MinSide { min_side: 64, max_side_limit: 4000 }
let config = OcrConfig::v5();
let engine = OcrEngine::new("det_v5.onnx", "rec_v5.onnx", "dict_v5.txt", config)?;

Model Comparison

Feature PP-OCRv3/v4 PP-OCRv5
Resize strategy MaxSide (scale down to 960px) MinSide (scale up, cap at 4000px)
Input resolution Lower (faster) Higher (more accurate)
Detection model size ~3 MB ~12 MB
Recognition model size ~12 MB ~25 MB
Best for Mobile, edge, standard docs Server, dense text, fine print
OcrConfig OcrConfig() / OcrConfig::default() OcrConfig(use_v5=True) / OcrConfig::v5()

Page Type Detection

PDF Oxide automatically classifies pages to determine whether OCR is needed. The extract_text_ocr() function handles this internally, but you can also detect page types manually.

Auto-Detect Scanned Pages

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("mixed.pdf")

for i in range(doc.page_count()):
    text = doc.extract_text(i)
    if len(text.strip()) < 50:
        # Likely scanned — use OCR
        text = doc.extract_text_ocr(i)
        print(f"Page {i + 1} (OCR): {text[:100]}...")
    else:
        print(f"Page {i + 1} (text): {text[:100]}...")

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::ocr::{detect_page_type, PageType, OcrEngine, OcrConfig, OcrExtractOptions, extract_text_with_ocr};

let mut doc = PdfDocument::open("mixed.pdf")?;
let engine = OcrEngine::new("det.onnx", "rec.onnx", "dict.txt", OcrConfig::default())?;

for i in 0..doc.page_count() {
    let page_type = detect_page_type(&mut doc, i)?;
    match page_type {
        PageType::NativeText => {
            let text = doc.extract_text(i)?;
            println!("Page {} (native): {}...", i + 1, &text[..100.min(text.len())]);
        }
        PageType::ScannedPage => {
            let text = extract_text_with_ocr(&mut doc, i, Some(&engine), OcrExtractOptions::default())?;
            println!("Page {} (OCR): {}...", i + 1, &text[..100.min(text.len())]);
        }
        PageType::HybridPage => {
            // Has both native text and scanned images — merges both sources
            let text = extract_text_with_ocr(&mut doc, i, Some(&engine), OcrExtractOptions::default())?;
            println!("Page {} (hybrid): {}...", i + 1, &text[..100.min(text.len())]);
        }
    }
}

PageType Variants (Rust)

Variant Description
NativeText Page has embedded text — no OCR needed
ScannedPage Page is fully scanned (large image, no/minimal text) — full OCR
HybridPage Page has both native text and large scanned images — merges native text with OCR results

The needs_ocr() helper returns true for both ScannedPage and HybridPage:

use pdf_oxide::ocr::needs_ocr;

if needs_ocr(&mut doc, 0)? {
    let text = extract_text_with_ocr(&mut doc, 0, Some(&engine), OcrExtractOptions::default())?;
}

How It Works

  1. PDF Oxide renders the page to an image internally (at 300 DPI)
  2. The image is resized according to the detection strategy (MaxSide for v3/v4, MinSide for v5)
  3. DBNet++ text detector locates text regions as quadrilateral bounding boxes
  4. SVTR text recognizer reads characters from each detected region
  5. Results are assembled into text with reading-order sorting
  6. For hybrid pages, OCR text is merged with native text

The entire pipeline runs in-process via ONNX Runtime. No external binaries, no subprocess calls, no temporary files.


OCR Configuration

Python

from pdf_oxide import OcrConfig, OcrEngine

# Default (v3/v4)
config = OcrConfig()

# PP-OCRv5 server models
config = OcrConfig(use_v5=True)

# Custom thresholds
config = OcrConfig(
    det_threshold=0.5,    # Detection confidence (0.0-1.0)
    box_threshold=0.7,    # Box confidence (0.0-1.0)
    rec_threshold=0.6,    # Recognition confidence (0.0-1.0)
    num_threads=8,        # ONNX Runtime threads
    max_candidates=500,   # Max text regions
)

# v5 with custom thresholds
config = OcrConfig(use_v5=True, det_threshold=0.4, num_threads=8)

engine = OcrEngine("det.onnx", "rec.onnx", "dict.txt", config)

Rust

use pdf_oxide::ocr::{OcrConfig, OcrConfigBuilder, DetResizeStrategy};

// Default (v3/v4): MaxSide { max_side: 960 }
let config = OcrConfig::default();

// PP-OCRv5: MinSide { min_side: 64, max_side_limit: 4000 }
let config = OcrConfig::v5();

// Custom builder
let config = OcrConfig::builder()
    .det_threshold(0.5)
    .box_threshold(0.7)
    .rec_threshold(0.6)
    .num_threads(8)
    .max_candidates(500)
    .detect_styles(true)        // Enable style detection from OCR geometry
    .build();

// Custom resize strategy
let config = OcrConfig::builder()
    .det_resize_strategy(DetResizeStrategy::MinSide {
        min_side: 128,
        max_side_limit: 6000,
    })
    .build();

DetResizeStrategy (Rust)

Controls how input images are resized before the detection model runs.

Variant Fields Description
MaxSide max_side: u32 (default: 960) Scale DOWN so the longest side fits within max_side. Default for PP-OCRv3/v4.
MinSide min_side: u32 (default: 64), max_side_limit: u32 (default: 4000) Scale UP so the shortest side is at least min_side, cap at max_side_limit. Default for PP-OCRv5.

OcrConfig Fields

Field Type Default Description
det_threshold f32 0.3 Detection probability threshold
box_threshold f32 0.6 Box confidence threshold
rec_threshold f32 0.5 Recognition confidence threshold
det_max_side u32 960 Max image dimension (v3/v4 compat)
det_resize_strategy DetResizeStrategy MaxSide { 960 } Image resize strategy
rec_target_height u32 48 Target height for recognition crops
num_threads usize 4 ONNX Runtime inference threads
unclip_ratio f32 1.5 Box expansion ratio
max_candidates usize 1000 Maximum text regions to detect
detect_styles bool true Detect font styles from OCR geometry
det_model_path Option<PathBuf> None Custom detection model path
rec_model_path Option<PathBuf> None Custom recognition model path
dict_path Option<PathBuf> None Custom character dictionary path

Custom Models

Use your own ONNX models instead of the bundled ones:

Rust

use pdf_oxide::ocr::OcrConfig;

let config = OcrConfig::builder()
    .det_model_path("models/custom_det.onnx")
    .rec_model_path("models/custom_rec.onnx")
    .dict_path("models/custom_dict.txt")
    .build();

Style Detection

When detect_styles is enabled (default), PDF Oxide infers font styles (bold, heading-level) from OCR geometry — text size, spacing, and position. This improves Markdown conversion output from scanned pages.

let config = OcrConfig::builder()
    .detect_styles(true)    // Infer styles from text geometry
    .build();

OCR vs Tesseract

Feature PDF Oxide OCR Tesseract (via PyMuPDF)
Installation pip install pdf_oxide System package + pytesseract
System dependencies None Tesseract binary required
Runtime ONNX (in-process) Subprocess call
Model versions PP-OCRv3, v4, v5 Tesseract LSTM
Languages Multi-language Requires language packs
Setup complexity Zero Moderate
Detection model DBNet++ Tesseract internal
Recognition model SVTR / SVTR-v5 Tesseract LSTM
High-res support MinSide strategy (v5) DPI setting
Page type detection Automatic (native/scanned/hybrid) Manual

Custom DPI

Control rendering resolution when converting PDF pages to images for OCR:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("scanned.pdf")

# Default is 300 DPI — good balance of accuracy and speed
text = doc.extract_text_ocr(0)

# Higher DPI for better accuracy on fine print
text = doc.extract_text_ocr(0)  # DPI configured via OcrExtractOptions in Rust

Rust

use pdf_oxide::ocr::OcrExtractOptions;

// Higher DPI = better accuracy but slower
let options = OcrExtractOptions::default().with_dpi(300.0);

// Lower DPI = faster but less accurate
let options = OcrExtractOptions::default().with_dpi(150.0);

OCR Output Structure (Rust)

The OcrEngine::ocr_image() method returns detailed results with per-span confidence scores:

use pdf_oxide::ocr::OcrEngine;

let engine = OcrEngine::new("det.onnx", "rec.onnx", "dict.txt", Default::default())?;
let output = engine.ocr_image(&image)?;

// Full text in reading order
println!("{}", output.text_in_reading_order());

// Per-span details
for span in &output.spans {
    println!("Text: '{}' (confidence: {:.2})", span.text, span.confidence);
    println!("  Bounding box: {:?}", span.bounding_rect());
    println!("  Per-char confidence: {:?}", span.char_confidences);
}

// Overall confidence
println!("Total confidence: {:.2}", output.total_confidence);

OcrOutput Fields

Field / Method Type Description
spans Vec<OcrSpan> All recognized text regions
total_confidence f32 Average confidence across all spans
text() String All text concatenated with spaces
text_in_reading_order() String Text sorted by position (top-to-bottom, left-to-right)

OcrSpan Fields

Field Type Description
text String Recognized text
polygon [[f32; 2]; 4] Quadrilateral bounding box (4 corners)
confidence f32 Overall confidence (0.0–1.0)
char_confidences Vec<f32> Per-character confidence scores

Batch OCR Processing

Process a directory of scanned PDFs:

Python

from pdf_oxide import PdfDocument, PdfError
from pathlib import Path

pdf_dir = Path("scans/")
output_dir = Path("text-output/")
output_dir.mkdir(exist_ok=True)

for pdf_path in pdf_dir.glob("*.pdf"):
    try:
        doc = PdfDocument(str(pdf_path))
        pages = []
        for i in range(doc.page_count()):
            text = doc.extract_text(i)
            if len(text.strip()) < 50:
                text = doc.extract_text_ocr(i)
            pages.append(text)

        out_path = output_dir / pdf_path.with_suffix(".txt").name
        out_path.write_text("\n\n".join(pages), encoding="utf-8")
    except PdfError as e:
        print(f"Error: {pdf_path.name}: {e}")

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::ocr::{OcrEngine, OcrConfig, OcrExtractOptions, extract_text_with_ocr, needs_ocr};
use std::fs;
use std::path::Path;

let engine = OcrEngine::new("det.onnx", "rec.onnx", "dict.txt", OcrConfig::default())?;
let options = OcrExtractOptions::default();

for entry in fs::read_dir("scans/")? {
    let path = entry?.path();
    if path.extension().map_or(false, |e| e == "pdf") {
        let mut doc = PdfDocument::open(path.to_str().unwrap())?;
        let mut all_text = String::new();
        for i in 0..doc.page_count() {
            let text = if needs_ocr(&mut doc, i)? {
                extract_text_with_ocr(&mut doc, i, Some(&engine), options.clone())?
            } else {
                doc.extract_text(i)?
            };
            all_text.push_str(&text);
            all_text.push_str("\n\n");
        }
        let out_path = Path::new("text-output/")
            .join(path.file_stem().unwrap())
            .with_extension("txt");
        fs::write(out_path, &all_text)?;
    }
}

Parallel OCR (Python)

from pdf_oxide import PdfDocument
from multiprocessing import Pool
from pathlib import Path

def ocr_pdf(pdf_path: str) -> dict:
    doc = PdfDocument(pdf_path)
    text = ""
    for i in range(doc.page_count()):
        text += doc.extract_text_ocr(i) + "\n"
    return {"file": pdf_path, "text": text}

pdf_files = [str(p) for p in Path("scans/").glob("*.pdf")]

with Pool(4) as pool:
    results = pool.map(ocr_pdf, pdf_files)

OCR to Markdown

Convert scanned pages to Markdown:

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("scanned-report.pdf")

for i in range(doc.page_count()):
    md = doc.to_markdown(i, detect_headings=True)
    if len(md.strip()) < 50:
        # Scanned page — OCR then format
        text = doc.extract_text_ocr(i)
        md = text  # OCR output is plain text
    print(f"--- Page {i + 1} ---")
    print(md)

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::ocr::{OcrEngine, OcrConfig, OcrExtractOptions, needs_ocr, extract_text_with_ocr};

let mut doc = PdfDocument::open("scanned-report.pdf")?;
let engine = OcrEngine::new("det.onnx", "rec.onnx", "dict.txt", OcrConfig::default())?;

for i in 0..doc.page_count() {
    let text = if needs_ocr(&mut doc, i)? {
        extract_text_with_ocr(&mut doc, i, Some(&engine), OcrExtractOptions::default())?
    } else {
        doc.to_markdown(i, &Default::default())?
    };
    println!("--- Page {} ---\n{}", i + 1, text);
}

Performance Considerations

OCR is significantly slower than text extraction:

Operation Typical Speed
Text extraction 0.8ms per page
OCR (v3/v4) 200–1,000ms per page
OCR (v5 server) 500–2,000ms per page

OCR speed depends on page complexity, image resolution, text density, and model version. PP-OCRv5 is slower but more accurate. For large batches, consider parallel processing (see Batch OCR Processing above).


Load Models from Bytes (Rust)

use pdf_oxide::ocr::{OcrEngine, OcrConfig};

let det_bytes = std::fs::read("models/det.onnx")?;
let rec_bytes = std::fs::read("models/rec.onnx")?;
let dict = std::fs::read_to_string("models/dict.txt")?;

let engine = OcrEngine::from_bytes(&det_bytes, &rec_bytes, &dict, OcrConfig::default())?;