Classify PDF Pages — Text vs Scanned
Before you extract, you usually want to know what kind of page you are looking at: does it have a usable native text layer, or is it a scan that needs OCR? PDF Oxide answers that with a cheap preflight — classify_page and classify_document inspect the PDF’s internals (glyph counts, image area, codec, invisible-text ratio, garbled-glyph ratio) without running OCR and without rasterising the page.
The classification is explainable: every verdict carries a confidence score, a typed reason code, and the raw signals that drove the decision — so you can route pages to the right extractor (native text vs OCR) and log why.
Binding coverage. Classification is exposed in Rust, Go, C#, Swift, and WASM/JavaScript. The Python and Node N-API bindings do not expose
classify_page/classify_documentin v0.3.69 — from those runtimes, use the auto-extraction path or bridge through the Rust core / CLI.
How do I classify a single PDF page?
classify_page takes a 0-based page index and returns a PageClassification — in the C-ABI bindings (Go, C#, Swift, WASM) it comes back as a JSON string you deserialize.
Rust
use pdf_oxide::PdfDocument;
fn main() -> pdf_oxide::Result<()> {
let doc = PdfDocument::open("mixed.pdf")?;
// PdfDocument::classify_page(&self, page: usize)
// -> Result<pdf_oxide::extractors::auto::PageClassification>
let result = doc.classify_page(0)?;
println!("page {} is {:?} (confidence {:.2})",
result.page, result.kind, result.confidence);
println!("reason: {:?}", result.reason);
println!("glyphs={} image_area={:.2} garbled={:.2}",
result.signals.text_glyph_count,
result.signals.image_area_ratio,
result.signals.garbled_ratio);
Ok(())
}
Go
package main
import (
"encoding/json"
"fmt"
"log"
pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)
func main() {
doc, err := pdfoxide.Open("mixed.pdf")
if err != nil {
log.Fatal(err)
}
defer doc.Close()
// func (doc *PdfDocument) ClassifyPage(pageIndex int) (string, error)
raw, err := doc.ClassifyPage(0)
if err != nil {
log.Fatal(err)
}
var pc struct {
Page int `json:"page"`
Kind string `json:"kind"`
Confidence float64 `json:"confidence"`
Reason string `json:"reason"`
}
if err := json.Unmarshal([]byte(raw), &pc); err != nil {
log.Fatal(err)
}
fmt.Printf("page %d is %s (%.2f): %s\n", pc.Page, pc.Kind, pc.Confidence, pc.Reason)
}
C#
using System;
using System.Text.Json;
using PdfOxide.Core;
using var doc = PdfDocument.Open("mixed.pdf");
// string PdfDocument.ClassifyPage(int pageIndex)
string raw = doc.ClassifyPage(0);
using var json = JsonDocument.Parse(raw);
var root = json.RootElement;
Console.WriteLine(
$"page {root.GetProperty("page").GetInt32()} is " +
$"{root.GetProperty("kind").GetString()} " +
$"({root.GetProperty("confidence").GetDouble():F2}): " +
$"{root.GetProperty("reason").GetString()}");
Swift
import PdfOxide
let doc = try PdfDocument(path: "mixed.pdf")
// func classifyPage(_ pageIndex: Int) throws -> String (JSON)
let json = try doc.classifyPage(0)
print(json)
JavaScript (WASM)
import init, { WasmPdfDocument } from "pdf-oxide-wasm";
await init();
const bytes = new Uint8Array(await (await fetch("mixed.pdf")).arrayBuffer());
const doc = WasmPdfDocument.fromBytes(bytes);
// WasmPdfDocument.classifyPage(pageIndex) -> JSON string
const pc = JSON.parse(doc.classifyPage(0));
console.log(`page ${pc.page} is ${pc.kind} (${pc.confidence}): ${pc.reason}`);
Java
import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;
import fyi.oxide.pdf.auto.PageClass;
try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("mixed.pdf"))) {
AutoExtractor auto = AutoExtractor.of(doc);
// PageClass AutoExtractor.classifyPageKind(int pageIndex)
PageClass kind = auto.classifyPageKind(0);
System.out.println("page 0 is " + kind); // TEXT_LAYER / SCANNED / IMAGE_TEXT / MIXED / EMPTY
}
PHP
<?php
use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;
$doc = PdfDocument::open('mixed.pdf');
$auto = AutoExtractor::of($doc);
// string AutoExtractor::classifyPageKind(int $pageIndex)
$kind = $auto->classifyPageKind(0);
echo "page 0 is {$kind}\n"; // text_layer / scanned / image_text / mixed / empty
Ruby
require 'pdf_oxide'
doc = PdfOxide::PdfDocument.open('mixed.pdf')
auto = PdfOxide::AutoExtractor.new(doc)
# AutoExtractor#classify_page(page_index)
# => { reason:, kind:, confidence:, classification: }
pc = auto.classify_page(0)
puts "page 0 is #{pc[:kind]} (#{pc[:confidence]}): #{pc[:reason]}"
C++
#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>
int main() {
auto doc = pdf_oxide::Document::open("mixed.pdf");
// std::string Document::classify_page(int page_index) — JSON
std::string json = doc.classify_page(0);
std::cout << json << '\n';
}
Dart
import 'package:pdf_oxide/pdf_oxide.dart';
void main() {
final doc = PdfDocument.open('mixed.pdf');
// String PdfDocument.classifyPage(int page) — JSON
final json = doc.classifyPage(0);
print(json);
}
R
library(pdfoxide)
doc <- pdf_open("mixed.pdf")
# pdf_classify_page(doc, page) — JSON PageClassification
json <- pdf_classify_page(doc, 0)
cat(json, "\n")
Julia
using PdfOxide
doc = open_document("mixed.pdf")
# classify_page(doc, page) -> JSON string
json = classify_page(doc, 0)
println(json)
Zig
const std = @import("std");
const pdf = @import("pdf_oxide");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
const alloc = gpa.allocator();
var doc = try pdf.Document.open("mixed.pdf");
defer doc.deinit();
// classifyPage(alloc, page_index) -> caller-owned JSON bytes
const json = try doc.classifyPage(alloc, 0);
defer alloc.free(json);
std.debug.print("{s}\n", .{json});
}
Objective-C
#import <POXPdfOxide.h>
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"mixed.pdf" error:&err];
// -classifyPage:error: -> JSON NSString
NSString *json = [doc classifyPage:0 error:&err];
NSLog(@"%@", json);
Elixir
{:ok, doc} = PdfOxide.open("mixed.pdf")
# PdfOxide.classify_page(doc, page) -> JSON string
json = PdfOxide.classify_page(doc, 0)
IO.puts(json)
How do I classify a whole PDF document at once?
classify_document runs the same cheap preflight over every page and rolls it up: a per-page kind list, the 0-based pages_needing_ocr indices, and an aggregate summary. The decision is per-page — PDF Oxide never forces one document mode onto a mixed PDF.
Rust
use pdf_oxide::PdfDocument;
fn main() -> pdf_oxide::Result<()> {
let doc = PdfDocument::open("report.pdf")?;
// PdfDocument::classify_document(&self)
// -> Result<pdf_oxide::extractors::auto::DocumentClassification>
let dc = doc.classify_document()?;
println!("summary: {:?}", dc.summary);
println!("pages needing OCR: {:?}", dc.pages_needing_ocr);
for (i, kind) in dc.pages.iter().enumerate() {
println!(" page {i}: {kind:?}");
}
Ok(())
}
Go
package main
import (
"encoding/json"
"fmt"
"log"
pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)
func main() {
doc, err := pdfoxide.Open("report.pdf")
if err != nil {
log.Fatal(err)
}
defer doc.Close()
// func (doc *PdfDocument) ClassifyDocument() (string, error)
raw, err := doc.ClassifyDocument()
if err != nil {
log.Fatal(err)
}
var dc struct {
Pages []string `json:"pages"`
PagesNeedingOCR []int `json:"pages_needing_ocr"`
Summary string `json:"summary"`
}
if err := json.Unmarshal([]byte(raw), &dc); err != nil {
log.Fatal(err)
}
fmt.Printf("summary=%s ocr_pages=%v\n", dc.Summary, dc.PagesNeedingOCR)
}
C#
using System;
using PdfOxide.Core;
using var doc = PdfDocument.Open("report.pdf");
// string PdfDocument.ClassifyDocument()
string raw = doc.ClassifyDocument();
Console.WriteLine(raw);
Swift
import PdfOxide
let doc = try PdfDocument(path: "report.pdf")
// func classifyDocument() throws -> String (JSON)
let json = try doc.classifyDocument()
print(json)
JavaScript (WASM)
import init, { WasmPdfDocument } from "pdf-oxide-wasm";
await init();
const bytes = new Uint8Array(await (await fetch("report.pdf")).arrayBuffer());
const doc = WasmPdfDocument.fromBytes(bytes);
// WasmPdfDocument.classifyDocument() -> JSON string
const dc = JSON.parse(doc.classifyDocument());
console.log("pages needing OCR:", dc.pages_needing_ocr);
Java
import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;
import fyi.oxide.pdf.auto.PageClass;
import java.util.List;
try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("report.pdf"))) {
AutoExtractor auto = AutoExtractor.of(doc);
// List<PageClass> AutoExtractor.classifyDocumentKinds()
List<PageClass> kinds = auto.classifyDocumentKinds();
for (int i = 0; i < kinds.size(); i++) {
System.out.println("page " + i + ": " + kinds.get(i));
}
}
PHP
<?php
use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;
$doc = PdfDocument::open('report.pdf');
$auto = AutoExtractor::of($doc);
// array<int,string> AutoExtractor::classifyDocumentKinds()
$kinds = $auto->classifyDocumentKinds();
foreach ($kinds as $i => $kind) {
echo "page {$i}: {$kind}\n";
}
Ruby
require 'pdf_oxide'
doc = PdfOxide::PdfDocument.open('report.pdf')
auto = PdfOxide::AutoExtractor.new(doc)
# AutoExtractor#classify_document => decoded JSON envelope
dc = auto.classify_document
puts "pages needing OCR: #{dc['pages_needing_ocr']}"
C++
#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>
int main() {
auto doc = pdf_oxide::Document::open("report.pdf");
// std::string Document::classify_document() — JSON
std::string json = doc.classify_document();
std::cout << json << '\n';
}
Dart
import 'package:pdf_oxide/pdf_oxide.dart';
void main() {
final doc = PdfDocument.open('report.pdf');
// String PdfDocument.classifyDocument() — JSON
final json = doc.classifyDocument();
print(json);
}
R
library(pdfoxide)
doc <- pdf_open("report.pdf")
# pdf_classify_document(doc) — JSON DocumentClassification
json <- pdf_classify_document(doc)
cat(json, "\n")
Julia
using PdfOxide
doc = open_document("report.pdf")
# classify_document(doc) -> JSON string
json = classify_document(doc)
println(json)
Zig
const std = @import("std");
const pdf = @import("pdf_oxide");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
const alloc = gpa.allocator();
var doc = try pdf.Document.open("report.pdf");
defer doc.deinit();
// classifyDocument(alloc) -> caller-owned JSON bytes
const json = try doc.classifyDocument(alloc);
defer alloc.free(json);
std.debug.print("{s}\n", .{json});
}
Objective-C
#import <POXPdfOxide.h>
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
// -classifyDocumentWithError: -> JSON NSString
NSString *json = [doc classifyDocumentWithError:&err];
NSLog(@"%@", json);
Elixir
{:ok, doc} = PdfOxide.open("report.pdf")
# PdfOxide.classify_document(doc) -> JSON string
json = PdfOxide.classify_document(doc)
IO.puts(json)
What does the classification JSON look like?
classify_page returns a PageClassification:
{
"page": 0,
"kind": "text_layer",
"confidence": 0.97,
"reason": "native_text_high_confidence",
"signals": {
"text_glyph_count": 1840,
"text_area_ratio": 0.62,
"image_area_ratio": 0.0,
"codec": "none",
"invisible_text_ratio": 0.0,
"garbled_ratio": 0.0,
"fragmented_word_ratio": 0.01,
"consecutive_repeat_ratio": 0.0,
"vector_path_density": 0.04,
"has_reliable_structure": true,
"producer_prior": "authoring",
"page_is_empty": false
}
}
classify_document returns a DocumentClassification:
{
"pages": ["text_layer", "scanned", "image_text"],
"pages_needing_ocr": [1],
"summary": "mixed"
}
Page kinds (kind)
| Kind | Meaning | Recommended route |
|---|---|---|
text_layer |
Usable native text dominates | Extract the text layer |
scanned |
Image-dominated, no/garbled text | OCR the page |
image_text |
Native text and text-bearing image regions | Hybrid: native + region OCR |
mixed |
Heterogeneous within the page (text + image-table/figure) | Auto-route per region |
empty |
Blank / near-empty — not an error | Skip |
Document summary (summary)
mostly_text, mostly_scanned, mixed, or empty.
Reason codes (reason)
The reason is a frozen, append-only snake_case token. Common values: ok, native_text_high_confidence, no_text_layer_present, text_layer_below_threshold, glyph_mapping_missing, encrypted_no_extract_permission, image_table_reconstructed, image_table_no_structure.
How do I route extraction based on classification?
The point of the cheap preflight is to avoid paying for OCR on pages that don’t need it. Classify first, then extract only the OCR pages with the heavier path:
Rust
use pdf_oxide::PdfDocument;
use pdf_oxide::extractors::auto::PageKind;
fn main() -> pdf_oxide::Result<()> {
let doc = PdfDocument::open("report.pdf")?;
let dc = doc.classify_document()?;
for (page, kind) in dc.pages.iter().enumerate() {
match kind {
PageKind::TextLayer => {
// Fast, free native path — no OCR cost.
let text = doc.extract_text(page)?;
println!("=== page {page} (native) ===\n{text}");
}
PageKind::Scanned | PageKind::ImageText | PageKind::Mixed => {
println!("=== page {page} needs OCR ===");
// route to your OCR / auto-extract pipeline here
}
PageKind::Empty => { /* skip */ }
}
}
Ok(())
}
Java
import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;
import fyi.oxide.pdf.auto.PageClass;
import java.util.List;
try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("report.pdf"))) {
AutoExtractor auto = AutoExtractor.of(doc);
List<PageClass> kinds = auto.classifyDocumentKinds();
for (int page = 0; page < kinds.size(); page++) {
switch (kinds.get(page)) {
case TEXT_LAYER -> {
// Fast, free native path — no OCR cost.
String text = doc.extractText(page);
System.out.println("=== page " + page + " (native) ===\n" + text);
}
case SCANNED, IMAGE_TEXT, MIXED ->
System.out.println("=== page " + page + " needs OCR ===");
case EMPTY -> { /* skip */ }
}
}
}
Ruby
require 'pdf_oxide'
doc = PdfOxide::PdfDocument.open('report.pdf')
auto = PdfOxide::AutoExtractor.new(doc)
dc = auto.classify_document
dc['pages'].each_with_index do |kind, page|
case kind
when 'text_layer'
# Fast, free native path — no OCR cost.
text = doc.extract_text(page)
puts "=== page #{page} (native) ===\n#{text}"
when 'scanned', 'image_text', 'mixed'
puts "=== page #{page} needs OCR ==="
when 'empty'
# skip
end
end
C++
#include <pdf_oxide/pdf_oxide.hpp>
#include <nlohmann/json.hpp> // any JSON lib
#include <iostream>
int main() {
auto doc = pdf_oxide::Document::open("report.pdf");
auto dc = nlohmann::json::parse(doc.classify_document());
int page = 0;
for (const auto& kind : dc["pages"]) {
if (kind == "text_layer") {
// Fast, free native path — no OCR cost.
std::cout << "=== page " << page << " (native) ===\n"
<< doc.extract_text(page) << '\n';
} else if (kind == "scanned" || kind == "image_text" || kind == "mixed") {
std::cout << "=== page " << page << " needs OCR ===\n";
}
++page;
}
}
PHP
<?php
use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;
$doc = PdfDocument::open('report.pdf');
$auto = AutoExtractor::of($doc);
foreach ($auto->classifyDocumentKinds() as $page => $kind) {
if ($kind === 'text_layer') {
// Fast, free native path — no OCR cost.
echo "=== page {$page} (native) ===\n" . $doc->extractText($page) . "\n";
} elseif (in_array($kind, ['scanned', 'image_text', 'mixed'], true)) {
echo "=== page {$page} needs OCR ===\n";
}
}
Dart
import 'dart:convert';
import 'package:pdf_oxide/pdf_oxide.dart';
void main() {
final doc = PdfDocument.open('report.pdf');
final dc = jsonDecode(doc.classifyDocument()) as Map<String, dynamic>;
final pages = (dc['pages'] as List).cast<String>();
for (var page = 0; page < pages.length; page++) {
final kind = pages[page];
if (kind == 'text_layer') {
// Fast, free native path — no OCR cost.
print('=== page $page (native) ===\n${doc.extractText(page)}');
} else if (kind == 'scanned' || kind == 'image_text' || kind == 'mixed') {
print('=== page $page needs OCR ===');
}
}
}
R
library(pdfoxide)
library(jsonlite)
doc <- pdf_open("report.pdf")
dc <- fromJSON(pdf_classify_document(doc))
for (page in seq_along(dc$pages)) {
kind <- dc$pages[[page]]
idx <- page - 1L # 0-based page index
if (kind == "text_layer") {
# Fast, free native path — no OCR cost.
cat(sprintf("=== page %d (native) ===\n%s\n", idx, pdf_extract_text(doc, idx)))
} else if (kind %in% c("scanned", "image_text", "mixed")) {
cat(sprintf("=== page %d needs OCR ===\n", idx))
}
}
Julia
using PdfOxide
using JSON
doc = open_document("report.pdf")
dc = JSON.parse(classify_document(doc))
for (page, kind) in enumerate(dc["pages"])
idx = page - 1 # 0-based page index
if kind == "text_layer"
# Fast, free native path — no OCR cost.
println("=== page $idx (native) ===\n", extract_text(doc, idx))
elseif kind in ("scanned", "image_text", "mixed")
println("=== page $idx needs OCR ===")
end
end
Zig
const std = @import("std");
const pdf = @import("pdf_oxide");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
const alloc = gpa.allocator();
var doc = try pdf.Document.open("report.pdf");
defer doc.deinit();
const dc = try doc.classifyDocument(alloc);
defer alloc.free(dc);
const parsed = try std.json.parseFromSlice(std.json.Value, alloc, dc, .{});
defer parsed.deinit();
const pages = parsed.value.object.get("pages").?.array;
for (pages.items, 0..) |kind_val, page| {
const kind = kind_val.string;
const idx: i32 = @intCast(page);
if (std.mem.eql(u8, kind, "text_layer")) {
// Fast, free native path — no OCR cost.
const text = try doc.extractText(alloc, idx);
defer alloc.free(text);
std.debug.print("=== page {d} (native) ===\n{s}\n", .{ idx, text });
} else if (std.mem.eql(u8, kind, "scanned") or
std.mem.eql(u8, kind, "image_text") or
std.mem.eql(u8, kind, "mixed"))
{
std.debug.print("=== page {d} needs OCR ===\n", .{idx});
}
}
}
Objective-C
#import <POXPdfOxide.h>
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
NSString *json = [doc classifyDocumentWithError:&err];
NSDictionary *dc = [NSJSONSerialization JSONObjectWithData:[json dataUsingEncoding:NSUTF8StringEncoding]
options:0 error:&err];
NSArray<NSString *> *pages = dc[@"pages"];
[pages enumerateObjectsUsingBlock:^(NSString *kind, NSUInteger page, BOOL *stop) {
if ([kind isEqualToString:@"text_layer"]) {
// Fast, free native path — no OCR cost.
NSString *text = [doc extractText:(NSInteger)page error:nil];
NSLog(@"=== page %lu (native) ===\n%@", (unsigned long)page, text);
} else if ([kind isEqualToString:@"scanned"] ||
[kind isEqualToString:@"image_text"] ||
[kind isEqualToString:@"mixed"]) {
NSLog(@"=== page %lu needs OCR ===", (unsigned long)page);
}
}];
Elixir
{:ok, doc} = PdfOxide.open("report.pdf")
dc = doc |> PdfOxide.classify_document() |> Jason.decode!()
dc["pages"]
|> Enum.with_index()
|> Enum.each(fn {kind, page} ->
case kind do
"text_layer" ->
# Fast, free native path — no OCR cost.
IO.puts("=== page #{page} (native) ===\n#{PdfOxide.extract_text(doc, page)}")
k when k in ["scanned", "image_text", "mixed"] ->
IO.puts("=== page #{page} needs OCR ===")
_ ->
:ok
end
end)
The native classification preflight is essentially free relative to extraction — it never rasterises and never runs OCR, so you can run it across an entire corpus to decide which pages are worth the OCR budget. PDF Oxide’s native text extraction itself runs at a 0.8ms mean / 100% pass rate in the published benchmark, so the classify-then-extract path keeps the fast common case fast.
A note on encrypted PDFs
classify_page and classify_document fail closed on an encrypted document you have not authenticated — they return an EncryptedPdf error rather than silently reporting empty. Authenticate first (see Encrypt & Decrypt PDFs) before classifying. Non-security per-page failures degrade gracefully to empty.
FAQ
Does classification run OCR?
No. classify_page / classify_document are pure inspection of the PDF internals — no OCR, no rasterisation. That is what makes them cheap enough to run across a whole corpus as a preflight.
Is classification available in Python or Node? Not in v0.3.69. The methods are exposed in Rust, Go, C#, Swift, and WASM/JavaScript. From Python/Node, use auto-extraction or bridge through the Rust core / CLI.
How accurate is the text_layer vs scanned call?
The classifier blends multiple signals (glyph count, image area, raster codec, invisible-text ratio, garbled/fragmented ratios) and applies an enriched text-quality gate, so an unusable born-digital text layer (column-scramble, (cid:NN) garbage, per-glyph fragmentation) is downgraded to scanned with the typed reason instead of being trusted.
Why is the result JSON in Go / C# / Swift? Those bindings cross the C ABI, which returns the classification as a malloc’d JSON string. Deserialize it with your standard JSON library — the field names and enum tokens are frozen and stable across releases.
Related Pages
- OCR Scanned PDFs — extract text from the pages classification flags as
scanned - Offline Model Management — prefetch OCR models for air-gapped routing
- Extraction Profiles — tune native text extraction per document type
- Extract Text from a PDF — the native fast path for
text_layerpages