OCR PDF — Extract Text from Scanned PDFs with PDF Oxide
Extract text from scanned PDFs with built-in OCR. As of v0.3.27, OCR is exposed to all language bindings — Python, Node.js, Go, C#, and Rust — through a unified FFI layer (pdf_ocr_engine_create, pdf_ocr_page_needs_ocr, pdf_ocr_extract_text).
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("scanned.pdf")
text = doc.extract_text_ocr(0)
print(text)
Node.js
const { PdfDocument, OcrEngine } = require("pdf-oxide");
const doc = new PdfDocument("scanned.pdf");
const ocr = new OcrEngine();
if (ocr.pageNeedsOcr(doc, 0)) {
console.log(ocr.extractText(doc, 0));
}
ocr.close();
doc.close();
Go
import pdfoxide "github.com/yfedoseev/pdf_oxide/go"
doc, _ := pdfoxide.Open("scanned.pdf")
defer doc.Close()
ocr, _ := pdfoxide.NewOcrEngine()
defer ocr.Close()
if ocr.NeedsOcr(doc, 0) {
text, _ := ocr.ExtractTextWithOcr(doc, 0)
fmt.Println(text)
}
C#
using PdfOxide.Core;
using PdfOxide.Ocr;
using var doc = PdfDocument.Open("scanned.pdf");
using var ocr = new OcrEngine();
if (ocr.PageNeedsOcr(doc, 0))
{
Console.WriteLine(ocr.ExtractText(doc, 0));
}
Rust
use pdf_oxide::PdfDocument;
use pdf_oxide::ocr::{OcrEngine, OcrConfig, OcrExtractOptions, extract_text_with_ocr};
let mut doc = PdfDocument::open("scanned.pdf")?;
let config = OcrConfig::default();
let engine = OcrEngine::new("models/det.onnx", "models/rec.onnx", "models/dict.txt", config)?;
let options = OcrExtractOptions::default();
let text = extract_text_with_ocr(&mut doc, 0, Some(&engine), options)?;
println!("{text}");
PDF Oxide includes PaddleOCR via ONNX Runtime — no Tesseract installation, no system dependencies, no subprocess calls. The OCR engine runs directly in the process. Supports PP-OCRv3, PP-OCRv4, and PP-OCRv5 model families.
Note: OCR is not available in WebAssembly (it requires native ONNX Runtime). For Go / Node.js / C# / Rust, build with the
ocrfeature. Python wheels ship with OCR enabled by default.
Python PDF OCR Without Tesseract
Most Python PDF OCR solutions require installing Tesseract as a system dependency — a complex setup that varies across operating systems and CI environments. PDF Oxide includes PaddleOCR models directly in the Python wheel:
- No system dependencies —
pip install pdf_oxideis all you need - No subprocess calls — OCR runs natively via ONNX Runtime
- Three model families — PP-OCRv3, PP-OCRv4, and PP-OCRv5
- Automatic page detection — identifies which pages are scanned vs text-based
Comparison: PDF Oxide OCR vs PyMuPDF + Tesseract
| PDF Oxide | PyMuPDF + Tesseract | |
|---|---|---|
| Install | pip install pdf_oxide |
pip install pymupdf + system Tesseract |
| OCR engine | PaddleOCR (ONNX) | Tesseract (subprocess) |
| Setup complexity | One line | OS-specific Tesseract install |
| CI/Docker | No extra config | Requires apt-get install tesseract-ocr |
| Models included | Yes (in wheel) | No (separate download) |
Installation
Python
pip install pdf_oxide
The OCR models are included in the wheel. No additional downloads required.
Rust
[dependencies]
pdf_oxide = { version = "0.3", features = ["ocr"] }
Go
go build -tags ocr ./...
Node.js
npm install pdf-oxide --build-from-source -- --features ocr
C#
The NuGet package ships with OCR enabled in the default Linux / macOS / Windows binaries — no extra configuration needed.
When to Use OCR
Most PDFs contain embedded text that extract_text() handles at 0.8ms per page. OCR is only needed for:
- Scanned documents — paper documents scanned to PDF
- Image-only PDFs — PDFs created from photos or screenshots
- PDFs with text as images — some generators rasterize text
- Hybrid pages — pages with both native text and scanned image regions
PP-OCR Model Versions
PDF Oxide supports three generations of PaddleOCR models. The default configuration works with PP-OCRv3 and PP-OCRv4. PP-OCRv5 server models require a different resize strategy.
PP-OCRv3 / PP-OCRv4 (Default)
Mobile-optimized models that scale images down to fit within a maximum side length. Good for most documents.
- Detection model: DBNet++ (lightweight)
- Recognition model: SVTR
- Resize strategy:
MaxSide— scales the longest side down to 960px - Best for: standard documents, mobile/edge deployment
Python
from pdf_oxide import OcrConfig, OcrEngine
# Default config works with v3/v4 models
config = OcrConfig()
engine = OcrEngine("det_v4.onnx", "rec_v4.onnx", "dict.txt", config)
Rust
use pdf_oxide::ocr::{OcrConfig, OcrEngine};
// Default config: MaxSide { max_side: 960 }
let config = OcrConfig::default();
let engine = OcrEngine::new("det_v4.onnx", "rec_v4.onnx", "dict.txt", config)?;
PP-OCRv5 (Server)
Server-grade models that preserve high resolution by scaling images up when needed. Significantly more accurate on dense or fine-print documents.
- Detection model: DBNet++ (server, larger)
- Recognition model: SVTR-v5
- Resize strategy:
MinSide— ensures the shortest side is at least 64px, caps at 4000px - Best for: high-accuracy extraction, server environments, dense text
Python
from pdf_oxide import OcrConfig, OcrEngine
# v5 config: high-resolution input for server models
config = OcrConfig(use_v5=True)
engine = OcrEngine("det_v5.onnx", "rec_v5.onnx", "dict_v5.txt", config)
Rust
use pdf_oxide::ocr::{OcrConfig, OcrEngine};
// v5 config: MinSide { min_side: 64, max_side_limit: 4000 }
let config = OcrConfig::v5();
let engine = OcrEngine::new("det_v5.onnx", "rec_v5.onnx", "dict_v5.txt", config)?;
Model Comparison
| Feature | PP-OCRv3/v4 | PP-OCRv5 |
|---|---|---|
| Resize strategy | MaxSide (scale down to 960px) |
MinSide (scale up, cap at 4000px) |
| Input resolution | Lower (faster) | Higher (more accurate) |
| Detection model size | ~3 MB | ~12 MB |
| Recognition model size | ~12 MB | ~25 MB |
| Best for | Mobile, edge, standard docs | Server, dense text, fine print |
OcrConfig |
OcrConfig() / OcrConfig::default() |
OcrConfig(use_v5=True) / OcrConfig::v5() |
Page Type Detection
PDF Oxide automatically classifies pages to determine whether OCR is needed. The extract_text_ocr() function handles this internally, but you can also detect page types manually.
Auto-Detect Scanned Pages
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("mixed.pdf")
for i in range(doc.page_count()):
text = doc.extract_text(i)
if len(text.strip()) < 50:
# Likely scanned — use OCR
text = doc.extract_text_ocr(i)
print(f"Page {i + 1} (OCR): {text[:100]}...")
else:
print(f"Page {i + 1} (text): {text[:100]}...")
Rust
use pdf_oxide::PdfDocument;
use pdf_oxide::ocr::{detect_page_type, PageType, OcrEngine, OcrConfig, OcrExtractOptions, extract_text_with_ocr};
let mut doc = PdfDocument::open("mixed.pdf")?;
let engine = OcrEngine::new("det.onnx", "rec.onnx", "dict.txt", OcrConfig::default())?;
for i in 0..doc.page_count() {
let page_type = detect_page_type(&mut doc, i)?;
match page_type {
PageType::NativeText => {
let text = doc.extract_text(i)?;
println!("Page {} (native): {}...", i + 1, &text[..100.min(text.len())]);
}
PageType::ScannedPage => {
let text = extract_text_with_ocr(&mut doc, i, Some(&engine), OcrExtractOptions::default())?;
println!("Page {} (OCR): {}...", i + 1, &text[..100.min(text.len())]);
}
PageType::HybridPage => {
// Has both native text and scanned images — merges both sources
let text = extract_text_with_ocr(&mut doc, i, Some(&engine), OcrExtractOptions::default())?;
println!("Page {} (hybrid): {}...", i + 1, &text[..100.min(text.len())]);
}
}
}
PageType Variants (Rust)
| Variant | Description |
|---|---|
NativeText |
Page has embedded text — no OCR needed |
ScannedPage |
Page is fully scanned (large image, no/minimal text) — full OCR |
HybridPage |
Page has both native text and large scanned images — merges native text with OCR results |
The needs_ocr() helper returns true for both ScannedPage and HybridPage:
use pdf_oxide::ocr::needs_ocr;
if needs_ocr(&mut doc, 0)? {
let text = extract_text_with_ocr(&mut doc, 0, Some(&engine), OcrExtractOptions::default())?;
}
How It Works
- PDF Oxide renders the page to an image internally (at 300 DPI)
- The image is resized according to the detection strategy (
MaxSidefor v3/v4,MinSidefor v5) - DBNet++ text detector locates text regions as quadrilateral bounding boxes
- SVTR text recognizer reads characters from each detected region
- Results are assembled into text with reading-order sorting
- For hybrid pages, OCR text is merged with native text
The entire pipeline runs in-process via ONNX Runtime. No external binaries, no subprocess calls, no temporary files.
OCR Configuration
Python
from pdf_oxide import OcrConfig, OcrEngine
# Default (v3/v4)
config = OcrConfig()
# PP-OCRv5 server models
config = OcrConfig(use_v5=True)
# Custom thresholds
config = OcrConfig(
det_threshold=0.5, # Detection confidence (0.0-1.0)
box_threshold=0.7, # Box confidence (0.0-1.0)
rec_threshold=0.6, # Recognition confidence (0.0-1.0)
num_threads=8, # ONNX Runtime threads
max_candidates=500, # Max text regions
)
# v5 with custom thresholds
config = OcrConfig(use_v5=True, det_threshold=0.4, num_threads=8)
engine = OcrEngine("det.onnx", "rec.onnx", "dict.txt", config)
Rust
use pdf_oxide::ocr::{OcrConfig, OcrConfigBuilder, DetResizeStrategy};
// Default (v3/v4): MaxSide { max_side: 960 }
let config = OcrConfig::default();
// PP-OCRv5: MinSide { min_side: 64, max_side_limit: 4000 }
let config = OcrConfig::v5();
// Custom builder
let config = OcrConfig::builder()
.det_threshold(0.5)
.box_threshold(0.7)
.rec_threshold(0.6)
.num_threads(8)
.max_candidates(500)
.detect_styles(true) // Enable style detection from OCR geometry
.build();
// Custom resize strategy
let config = OcrConfig::builder()
.det_resize_strategy(DetResizeStrategy::MinSide {
min_side: 128,
max_side_limit: 6000,
})
.build();
DetResizeStrategy (Rust)
Controls how input images are resized before the detection model runs.
| Variant | Fields | Description |
|---|---|---|
MaxSide |
max_side: u32 (default: 960) |
Scale DOWN so the longest side fits within max_side. Default for PP-OCRv3/v4. |
MinSide |
min_side: u32 (default: 64), max_side_limit: u32 (default: 4000) |
Scale UP so the shortest side is at least min_side, cap at max_side_limit. Default for PP-OCRv5. |
OcrConfig Fields
| Field | Type | Default | Description |
|---|---|---|---|
det_threshold |
f32 |
0.3 |
Detection probability threshold |
box_threshold |
f32 |
0.6 |
Box confidence threshold |
rec_threshold |
f32 |
0.5 |
Recognition confidence threshold |
det_max_side |
u32 |
960 |
Max image dimension (v3/v4 compat) |
det_resize_strategy |
DetResizeStrategy |
MaxSide { 960 } |
Image resize strategy |
rec_target_height |
u32 |
48 |
Target height for recognition crops |
num_threads |
usize |
4 |
ONNX Runtime inference threads |
unclip_ratio |
f32 |
1.5 |
Box expansion ratio |
max_candidates |
usize |
1000 |
Maximum text regions to detect |
detect_styles |
bool |
true |
Detect font styles from OCR geometry |
det_model_path |
Option<PathBuf> |
None |
Custom detection model path |
rec_model_path |
Option<PathBuf> |
None |
Custom recognition model path |
dict_path |
Option<PathBuf> |
None |
Custom character dictionary path |
Custom Models
Use your own ONNX models instead of the bundled ones:
Rust
use pdf_oxide::ocr::OcrConfig;
let config = OcrConfig::builder()
.det_model_path("models/custom_det.onnx")
.rec_model_path("models/custom_rec.onnx")
.dict_path("models/custom_dict.txt")
.build();
Style Detection
When detect_styles is enabled (default), PDF Oxide infers font styles (bold, heading-level) from OCR geometry — text size, spacing, and position. This improves Markdown conversion output from scanned pages.
let config = OcrConfig::builder()
.detect_styles(true) // Infer styles from text geometry
.build();
OCR vs Tesseract
| Feature | PDF Oxide OCR | Tesseract (via PyMuPDF) |
|---|---|---|
| Installation | pip install pdf_oxide |
System package + pytesseract |
| System dependencies | None | Tesseract binary required |
| Runtime | ONNX (in-process) | Subprocess call |
| Model versions | PP-OCRv3, v4, v5 | Tesseract LSTM |
| Languages | Multi-language | Requires language packs |
| Setup complexity | Zero | Moderate |
| Detection model | DBNet++ | Tesseract internal |
| Recognition model | SVTR / SVTR-v5 | Tesseract LSTM |
| High-res support | MinSide strategy (v5) |
DPI setting |
| Page type detection | Automatic (native/scanned/hybrid) | Manual |
Custom DPI
Control rendering resolution when converting PDF pages to images for OCR:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("scanned.pdf")
# Default is 300 DPI — good balance of accuracy and speed
text = doc.extract_text_ocr(0)
# Higher DPI for better accuracy on fine print
text = doc.extract_text_ocr(0) # DPI configured via OcrExtractOptions in Rust
Rust
use pdf_oxide::ocr::OcrExtractOptions;
// Higher DPI = better accuracy but slower
let options = OcrExtractOptions::default().with_dpi(300.0);
// Lower DPI = faster but less accurate
let options = OcrExtractOptions::default().with_dpi(150.0);
OCR Output Structure (Rust)
The OcrEngine::ocr_image() method returns detailed results with per-span confidence scores:
use pdf_oxide::ocr::OcrEngine;
let engine = OcrEngine::new("det.onnx", "rec.onnx", "dict.txt", Default::default())?;
let output = engine.ocr_image(&image)?;
// Full text in reading order
println!("{}", output.text_in_reading_order());
// Per-span details
for span in &output.spans {
println!("Text: '{}' (confidence: {:.2})", span.text, span.confidence);
println!(" Bounding box: {:?}", span.bounding_rect());
println!(" Per-char confidence: {:?}", span.char_confidences);
}
// Overall confidence
println!("Total confidence: {:.2}", output.total_confidence);
OcrOutput Fields
| Field / Method | Type | Description |
|---|---|---|
spans |
Vec<OcrSpan> |
All recognized text regions |
total_confidence |
f32 |
Average confidence across all spans |
text() |
String |
All text concatenated with spaces |
text_in_reading_order() |
String |
Text sorted by position (top-to-bottom, left-to-right) |
OcrSpan Fields
| Field | Type | Description |
|---|---|---|
text |
String |
Recognized text |
polygon |
[[f32; 2]; 4] |
Quadrilateral bounding box (4 corners) |
confidence |
f32 |
Overall confidence (0.0–1.0) |
char_confidences |
Vec<f32> |
Per-character confidence scores |
Batch OCR Processing
Process a directory of scanned PDFs:
Python
from pdf_oxide import PdfDocument, PdfError
from pathlib import Path
pdf_dir = Path("scans/")
output_dir = Path("text-output/")
output_dir.mkdir(exist_ok=True)
for pdf_path in pdf_dir.glob("*.pdf"):
try:
doc = PdfDocument(str(pdf_path))
pages = []
for i in range(doc.page_count()):
text = doc.extract_text(i)
if len(text.strip()) < 50:
text = doc.extract_text_ocr(i)
pages.append(text)
out_path = output_dir / pdf_path.with_suffix(".txt").name
out_path.write_text("\n\n".join(pages), encoding="utf-8")
except PdfError as e:
print(f"Error: {pdf_path.name}: {e}")
Rust
use pdf_oxide::PdfDocument;
use pdf_oxide::ocr::{OcrEngine, OcrConfig, OcrExtractOptions, extract_text_with_ocr, needs_ocr};
use std::fs;
use std::path::Path;
let engine = OcrEngine::new("det.onnx", "rec.onnx", "dict.txt", OcrConfig::default())?;
let options = OcrExtractOptions::default();
for entry in fs::read_dir("scans/")? {
let path = entry?.path();
if path.extension().map_or(false, |e| e == "pdf") {
let mut doc = PdfDocument::open(path.to_str().unwrap())?;
let mut all_text = String::new();
for i in 0..doc.page_count() {
let text = if needs_ocr(&mut doc, i)? {
extract_text_with_ocr(&mut doc, i, Some(&engine), options.clone())?
} else {
doc.extract_text(i)?
};
all_text.push_str(&text);
all_text.push_str("\n\n");
}
let out_path = Path::new("text-output/")
.join(path.file_stem().unwrap())
.with_extension("txt");
fs::write(out_path, &all_text)?;
}
}
Parallel OCR (Python)
from pdf_oxide import PdfDocument
from multiprocessing import Pool
from pathlib import Path
def ocr_pdf(pdf_path: str) -> dict:
doc = PdfDocument(pdf_path)
text = ""
for i in range(doc.page_count()):
text += doc.extract_text_ocr(i) + "\n"
return {"file": pdf_path, "text": text}
pdf_files = [str(p) for p in Path("scans/").glob("*.pdf")]
with Pool(4) as pool:
results = pool.map(ocr_pdf, pdf_files)
OCR to Markdown
Convert scanned pages to Markdown:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("scanned-report.pdf")
for i in range(doc.page_count()):
md = doc.to_markdown(i, detect_headings=True)
if len(md.strip()) < 50:
# Scanned page — OCR then format
text = doc.extract_text_ocr(i)
md = text # OCR output is plain text
print(f"--- Page {i + 1} ---")
print(md)
Rust
use pdf_oxide::PdfDocument;
use pdf_oxide::ocr::{OcrEngine, OcrConfig, OcrExtractOptions, needs_ocr, extract_text_with_ocr};
let mut doc = PdfDocument::open("scanned-report.pdf")?;
let engine = OcrEngine::new("det.onnx", "rec.onnx", "dict.txt", OcrConfig::default())?;
for i in 0..doc.page_count() {
let text = if needs_ocr(&mut doc, i)? {
extract_text_with_ocr(&mut doc, i, Some(&engine), OcrExtractOptions::default())?
} else {
doc.to_markdown(i, &Default::default())?
};
println!("--- Page {} ---\n{}", i + 1, text);
}
Performance Considerations
OCR is significantly slower than text extraction:
| Operation | Typical Speed |
|---|---|
| Text extraction | 0.8ms per page |
| OCR (v3/v4) | 200–1,000ms per page |
| OCR (v5 server) | 500–2,000ms per page |
OCR speed depends on page complexity, image resolution, text density, and model version. PP-OCRv5 is slower but more accurate. For large batches, consider parallel processing (see Batch OCR Processing above).
Load Models from Bytes (Rust)
use pdf_oxide::ocr::{OcrEngine, OcrConfig};
let det_bytes = std::fs::read("models/det.onnx")?;
let rec_bytes = std::fs::read("models/rec.onnx")?;
let dict = std::fs::read_to_string("models/dict.txt")?;
let engine = OcrEngine::from_bytes(&det_bytes, &rec_bytes, &dict, OcrConfig::default())?;
Related Pages
- Text Extraction — standard text extraction
- Markdown Conversion — Markdown with heading detection
- Page Rendering — render pages to images (used internally by OCR)
- Batch Processing — parallel processing patterns
- Extract Text from PDF — text extraction guide