Skip to content

Classify PDF Pages — Text vs Scanned

Before you extract, you usually want to know what kind of page you are looking at: does it have a usable native text layer, or is it a scan that needs OCR? PDF Oxide answers that with a cheap preflightclassify_page and classify_document inspect the PDF’s internals (glyph counts, image area, codec, invisible-text ratio, garbled-glyph ratio) without running OCR and without rasterising the page.

The classification is explainable: every verdict carries a confidence score, a typed reason code, and the raw signals that drove the decision — so you can route pages to the right extractor (native text vs OCR) and log why.

Binding coverage. Classification is exposed in Rust, Go, C#, Swift, and WASM/JavaScript. The Python and Node N-API bindings do not expose classify_page / classify_document in v0.3.69 — from those runtimes, use the auto-extraction path or bridge through the Rust core / CLI.

How do I classify a single PDF page?

classify_page takes a 0-based page index and returns a PageClassification — in the C-ABI bindings (Go, C#, Swift, WASM) it comes back as a JSON string you deserialize.

Rust

use pdf_oxide::PdfDocument;

fn main() -> pdf_oxide::Result<()> {
    let doc = PdfDocument::open("mixed.pdf")?;

    // PdfDocument::classify_page(&self, page: usize)
    //   -> Result<pdf_oxide::extractors::auto::PageClassification>
    let result = doc.classify_page(0)?;

    println!("page {} is {:?} (confidence {:.2})",
        result.page, result.kind, result.confidence);
    println!("reason: {:?}", result.reason);
    println!("glyphs={} image_area={:.2} garbled={:.2}",
        result.signals.text_glyph_count,
        result.signals.image_area_ratio,
        result.signals.garbled_ratio);
    Ok(())
}

Go

package main

import (
	"encoding/json"
	"fmt"
	"log"

	pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
	doc, err := pdfoxide.Open("mixed.pdf")
	if err != nil {
		log.Fatal(err)
	}
	defer doc.Close()

	// func (doc *PdfDocument) ClassifyPage(pageIndex int) (string, error)
	raw, err := doc.ClassifyPage(0)
	if err != nil {
		log.Fatal(err)
	}

	var pc struct {
		Page       int     `json:"page"`
		Kind       string  `json:"kind"`
		Confidence float64 `json:"confidence"`
		Reason     string  `json:"reason"`
	}
	if err := json.Unmarshal([]byte(raw), &pc); err != nil {
		log.Fatal(err)
	}
	fmt.Printf("page %d is %s (%.2f): %s\n", pc.Page, pc.Kind, pc.Confidence, pc.Reason)
}

C#

using System;
using System.Text.Json;
using PdfOxide.Core;

using var doc = PdfDocument.Open("mixed.pdf");

// string PdfDocument.ClassifyPage(int pageIndex)
string raw = doc.ClassifyPage(0);

using var json = JsonDocument.Parse(raw);
var root = json.RootElement;
Console.WriteLine(
    $"page {root.GetProperty("page").GetInt32()} is " +
    $"{root.GetProperty("kind").GetString()} " +
    $"({root.GetProperty("confidence").GetDouble():F2}): " +
    $"{root.GetProperty("reason").GetString()}");

Swift

import PdfOxide

let doc = try PdfDocument(path: "mixed.pdf")

// func classifyPage(_ pageIndex: Int) throws -> String   (JSON)
let json = try doc.classifyPage(0)
print(json)

JavaScript (WASM)

import init, { WasmPdfDocument } from "pdf-oxide-wasm";

await init();
const bytes = new Uint8Array(await (await fetch("mixed.pdf")).arrayBuffer());
const doc = WasmPdfDocument.fromBytes(bytes);

// WasmPdfDocument.classifyPage(pageIndex) -> JSON string
const pc = JSON.parse(doc.classifyPage(0));
console.log(`page ${pc.page} is ${pc.kind} (${pc.confidence}): ${pc.reason}`);

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;
import fyi.oxide.pdf.auto.PageClass;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("mixed.pdf"))) {
    AutoExtractor auto = AutoExtractor.of(doc);

    // PageClass AutoExtractor.classifyPageKind(int pageIndex)
    PageClass kind = auto.classifyPageKind(0);
    System.out.println("page 0 is " + kind);   // TEXT_LAYER / SCANNED / IMAGE_TEXT / MIXED / EMPTY
}

PHP

<?php
use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;

$doc = PdfDocument::open('mixed.pdf');
$auto = AutoExtractor::of($doc);

// string AutoExtractor::classifyPageKind(int $pageIndex)
$kind = $auto->classifyPageKind(0);
echo "page 0 is {$kind}\n";   // text_layer / scanned / image_text / mixed / empty

Ruby

require 'pdf_oxide'

doc  = PdfOxide::PdfDocument.open('mixed.pdf')
auto = PdfOxide::AutoExtractor.new(doc)

# AutoExtractor#classify_page(page_index)
#   => { reason:, kind:, confidence:, classification: }
pc = auto.classify_page(0)
puts "page 0 is #{pc[:kind]} (#{pc[:confidence]}): #{pc[:reason]}"

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    auto doc = pdf_oxide::Document::open("mixed.pdf");

    // std::string Document::classify_page(int page_index) — JSON
    std::string json = doc.classify_page(0);
    std::cout << json << '\n';
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('mixed.pdf');

  // String PdfDocument.classifyPage(int page) — JSON
  final json = doc.classifyPage(0);
  print(json);
}

R

library(pdfoxide)

doc <- pdf_open("mixed.pdf")

# pdf_classify_page(doc, page) — JSON PageClassification
json <- pdf_classify_page(doc, 0)
cat(json, "\n")

Julia

using PdfOxide

doc = open_document("mixed.pdf")

# classify_page(doc, page) -> JSON string
json = classify_page(doc, 0)
println(json)

Zig

const std = @import("std");
const pdf = @import("pdf_oxide");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const alloc = gpa.allocator();

    var doc = try pdf.Document.open("mixed.pdf");
    defer doc.deinit();

    // classifyPage(alloc, page_index) -> caller-owned JSON bytes
    const json = try doc.classifyPage(alloc, 0);
    defer alloc.free(json);
    std.debug.print("{s}\n", .{json});
}

Objective-C

#import <POXPdfOxide.h>

NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"mixed.pdf" error:&err];

// -classifyPage:error: -> JSON NSString
NSString *json = [doc classifyPage:0 error:&err];
NSLog(@"%@", json);

Elixir

{:ok, doc} = PdfOxide.open("mixed.pdf")

# PdfOxide.classify_page(doc, page) -> JSON string
json = PdfOxide.classify_page(doc, 0)
IO.puts(json)

How do I classify a whole PDF document at once?

classify_document runs the same cheap preflight over every page and rolls it up: a per-page kind list, the 0-based pages_needing_ocr indices, and an aggregate summary. The decision is per-page — PDF Oxide never forces one document mode onto a mixed PDF.

Rust

use pdf_oxide::PdfDocument;

fn main() -> pdf_oxide::Result<()> {
    let doc = PdfDocument::open("report.pdf")?;

    // PdfDocument::classify_document(&self)
    //   -> Result<pdf_oxide::extractors::auto::DocumentClassification>
    let dc = doc.classify_document()?;

    println!("summary: {:?}", dc.summary);
    println!("pages needing OCR: {:?}", dc.pages_needing_ocr);
    for (i, kind) in dc.pages.iter().enumerate() {
        println!("  page {i}: {kind:?}");
    }
    Ok(())
}

Go

package main

import (
	"encoding/json"
	"fmt"
	"log"

	pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
	doc, err := pdfoxide.Open("report.pdf")
	if err != nil {
		log.Fatal(err)
	}
	defer doc.Close()

	// func (doc *PdfDocument) ClassifyDocument() (string, error)
	raw, err := doc.ClassifyDocument()
	if err != nil {
		log.Fatal(err)
	}

	var dc struct {
		Pages           []string `json:"pages"`
		PagesNeedingOCR []int    `json:"pages_needing_ocr"`
		Summary         string   `json:"summary"`
	}
	if err := json.Unmarshal([]byte(raw), &dc); err != nil {
		log.Fatal(err)
	}
	fmt.Printf("summary=%s ocr_pages=%v\n", dc.Summary, dc.PagesNeedingOCR)
}

C#

using System;
using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");

// string PdfDocument.ClassifyDocument()
string raw = doc.ClassifyDocument();
Console.WriteLine(raw);

Swift

import PdfOxide

let doc = try PdfDocument(path: "report.pdf")

// func classifyDocument() throws -> String   (JSON)
let json = try doc.classifyDocument()
print(json)

JavaScript (WASM)

import init, { WasmPdfDocument } from "pdf-oxide-wasm";

await init();
const bytes = new Uint8Array(await (await fetch("report.pdf")).arrayBuffer());
const doc = WasmPdfDocument.fromBytes(bytes);

// WasmPdfDocument.classifyDocument() -> JSON string
const dc = JSON.parse(doc.classifyDocument());
console.log("pages needing OCR:", dc.pages_needing_ocr);

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;
import fyi.oxide.pdf.auto.PageClass;
import java.util.List;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("report.pdf"))) {
    AutoExtractor auto = AutoExtractor.of(doc);

    // List<PageClass> AutoExtractor.classifyDocumentKinds()
    List<PageClass> kinds = auto.classifyDocumentKinds();
    for (int i = 0; i < kinds.size(); i++) {
        System.out.println("page " + i + ": " + kinds.get(i));
    }
}

PHP

<?php
use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;

$doc = PdfDocument::open('report.pdf');
$auto = AutoExtractor::of($doc);

// array<int,string> AutoExtractor::classifyDocumentKinds()
$kinds = $auto->classifyDocumentKinds();
foreach ($kinds as $i => $kind) {
    echo "page {$i}: {$kind}\n";
}

Ruby

require 'pdf_oxide'

doc  = PdfOxide::PdfDocument.open('report.pdf')
auto = PdfOxide::AutoExtractor.new(doc)

# AutoExtractor#classify_document => decoded JSON envelope
dc = auto.classify_document
puts "pages needing OCR: #{dc['pages_needing_ocr']}"

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    auto doc = pdf_oxide::Document::open("report.pdf");

    // std::string Document::classify_document() — JSON
    std::string json = doc.classify_document();
    std::cout << json << '\n';
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('report.pdf');

  // String PdfDocument.classifyDocument() — JSON
  final json = doc.classifyDocument();
  print(json);
}

R

library(pdfoxide)

doc <- pdf_open("report.pdf")

# pdf_classify_document(doc) — JSON DocumentClassification
json <- pdf_classify_document(doc)
cat(json, "\n")

Julia

using PdfOxide

doc = open_document("report.pdf")

# classify_document(doc) -> JSON string
json = classify_document(doc)
println(json)

Zig

const std = @import("std");
const pdf = @import("pdf_oxide");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const alloc = gpa.allocator();

    var doc = try pdf.Document.open("report.pdf");
    defer doc.deinit();

    // classifyDocument(alloc) -> caller-owned JSON bytes
    const json = try doc.classifyDocument(alloc);
    defer alloc.free(json);
    std.debug.print("{s}\n", .{json});
}

Objective-C

#import <POXPdfOxide.h>

NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];

// -classifyDocumentWithError: -> JSON NSString
NSString *json = [doc classifyDocumentWithError:&err];
NSLog(@"%@", json);

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")

# PdfOxide.classify_document(doc) -> JSON string
json = PdfOxide.classify_document(doc)
IO.puts(json)

What does the classification JSON look like?

classify_page returns a PageClassification:

{
  "page": 0,
  "kind": "text_layer",
  "confidence": 0.97,
  "reason": "native_text_high_confidence",
  "signals": {
    "text_glyph_count": 1840,
    "text_area_ratio": 0.62,
    "image_area_ratio": 0.0,
    "codec": "none",
    "invisible_text_ratio": 0.0,
    "garbled_ratio": 0.0,
    "fragmented_word_ratio": 0.01,
    "consecutive_repeat_ratio": 0.0,
    "vector_path_density": 0.04,
    "has_reliable_structure": true,
    "producer_prior": "authoring",
    "page_is_empty": false
  }
}

classify_document returns a DocumentClassification:

{
  "pages": ["text_layer", "scanned", "image_text"],
  "pages_needing_ocr": [1],
  "summary": "mixed"
}

Page kinds (kind)

Kind Meaning Recommended route
text_layer Usable native text dominates Extract the text layer
scanned Image-dominated, no/garbled text OCR the page
image_text Native text and text-bearing image regions Hybrid: native + region OCR
mixed Heterogeneous within the page (text + image-table/figure) Auto-route per region
empty Blank / near-empty — not an error Skip

Document summary (summary)

mostly_text, mostly_scanned, mixed, or empty.

Reason codes (reason)

The reason is a frozen, append-only snake_case token. Common values: ok, native_text_high_confidence, no_text_layer_present, text_layer_below_threshold, glyph_mapping_missing, encrypted_no_extract_permission, image_table_reconstructed, image_table_no_structure.

How do I route extraction based on classification?

The point of the cheap preflight is to avoid paying for OCR on pages that don’t need it. Classify first, then extract only the OCR pages with the heavier path:

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::extractors::auto::PageKind;

fn main() -> pdf_oxide::Result<()> {
    let doc = PdfDocument::open("report.pdf")?;
    let dc = doc.classify_document()?;

    for (page, kind) in dc.pages.iter().enumerate() {
        match kind {
            PageKind::TextLayer => {
                // Fast, free native path — no OCR cost.
                let text = doc.extract_text(page)?;
                println!("=== page {page} (native) ===\n{text}");
            }
            PageKind::Scanned | PageKind::ImageText | PageKind::Mixed => {
                println!("=== page {page} needs OCR ===");
                // route to your OCR / auto-extract pipeline here
            }
            PageKind::Empty => { /* skip */ }
        }
    }
    Ok(())
}

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;
import fyi.oxide.pdf.auto.PageClass;
import java.util.List;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("report.pdf"))) {
    AutoExtractor auto = AutoExtractor.of(doc);
    List<PageClass> kinds = auto.classifyDocumentKinds();

    for (int page = 0; page < kinds.size(); page++) {
        switch (kinds.get(page)) {
            case TEXT_LAYER -> {
                // Fast, free native path — no OCR cost.
                String text = doc.extractText(page);
                System.out.println("=== page " + page + " (native) ===\n" + text);
            }
            case SCANNED, IMAGE_TEXT, MIXED ->
                System.out.println("=== page " + page + " needs OCR ===");
            case EMPTY -> { /* skip */ }
        }
    }
}

Ruby

require 'pdf_oxide'

doc  = PdfOxide::PdfDocument.open('report.pdf')
auto = PdfOxide::AutoExtractor.new(doc)
dc   = auto.classify_document

dc['pages'].each_with_index do |kind, page|
  case kind
  when 'text_layer'
    # Fast, free native path — no OCR cost.
    text = doc.extract_text(page)
    puts "=== page #{page} (native) ===\n#{text}"
  when 'scanned', 'image_text', 'mixed'
    puts "=== page #{page} needs OCR ==="
  when 'empty'
    # skip
  end
end

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <nlohmann/json.hpp>   // any JSON lib
#include <iostream>

int main() {
    auto doc = pdf_oxide::Document::open("report.pdf");

    auto dc = nlohmann::json::parse(doc.classify_document());
    int page = 0;
    for (const auto& kind : dc["pages"]) {
        if (kind == "text_layer") {
            // Fast, free native path — no OCR cost.
            std::cout << "=== page " << page << " (native) ===\n"
                      << doc.extract_text(page) << '\n';
        } else if (kind == "scanned" || kind == "image_text" || kind == "mixed") {
            std::cout << "=== page " << page << " needs OCR ===\n";
        }
        ++page;
    }
}

PHP

<?php
use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;

$doc  = PdfDocument::open('report.pdf');
$auto = AutoExtractor::of($doc);

foreach ($auto->classifyDocumentKinds() as $page => $kind) {
    if ($kind === 'text_layer') {
        // Fast, free native path — no OCR cost.
        echo "=== page {$page} (native) ===\n" . $doc->extractText($page) . "\n";
    } elseif (in_array($kind, ['scanned', 'image_text', 'mixed'], true)) {
        echo "=== page {$page} needs OCR ===\n";
    }
}

Dart

import 'dart:convert';
import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('report.pdf');
  final dc = jsonDecode(doc.classifyDocument()) as Map<String, dynamic>;

  final pages = (dc['pages'] as List).cast<String>();
  for (var page = 0; page < pages.length; page++) {
    final kind = pages[page];
    if (kind == 'text_layer') {
      // Fast, free native path — no OCR cost.
      print('=== page $page (native) ===\n${doc.extractText(page)}');
    } else if (kind == 'scanned' || kind == 'image_text' || kind == 'mixed') {
      print('=== page $page needs OCR ===');
    }
  }
}

R

library(pdfoxide)
library(jsonlite)

doc <- pdf_open("report.pdf")
dc  <- fromJSON(pdf_classify_document(doc))

for (page in seq_along(dc$pages)) {
  kind <- dc$pages[[page]]
  idx  <- page - 1L   # 0-based page index
  if (kind == "text_layer") {
    # Fast, free native path — no OCR cost.
    cat(sprintf("=== page %d (native) ===\n%s\n", idx, pdf_extract_text(doc, idx)))
  } else if (kind %in% c("scanned", "image_text", "mixed")) {
    cat(sprintf("=== page %d needs OCR ===\n", idx))
  }
}

Julia

using PdfOxide
using JSON

doc = open_document("report.pdf")
dc  = JSON.parse(classify_document(doc))

for (page, kind) in enumerate(dc["pages"])
    idx = page - 1   # 0-based page index
    if kind == "text_layer"
        # Fast, free native path — no OCR cost.
        println("=== page $idx (native) ===\n", extract_text(doc, idx))
    elseif kind in ("scanned", "image_text", "mixed")
        println("=== page $idx needs OCR ===")
    end
end

Zig

const std = @import("std");
const pdf = @import("pdf_oxide");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const alloc = gpa.allocator();

    var doc = try pdf.Document.open("report.pdf");
    defer doc.deinit();

    const dc = try doc.classifyDocument(alloc);
    defer alloc.free(dc);

    const parsed = try std.json.parseFromSlice(std.json.Value, alloc, dc, .{});
    defer parsed.deinit();

    const pages = parsed.value.object.get("pages").?.array;
    for (pages.items, 0..) |kind_val, page| {
        const kind = kind_val.string;
        const idx: i32 = @intCast(page);
        if (std.mem.eql(u8, kind, "text_layer")) {
            // Fast, free native path — no OCR cost.
            const text = try doc.extractText(alloc, idx);
            defer alloc.free(text);
            std.debug.print("=== page {d} (native) ===\n{s}\n", .{ idx, text });
        } else if (std.mem.eql(u8, kind, "scanned") or
            std.mem.eql(u8, kind, "image_text") or
            std.mem.eql(u8, kind, "mixed"))
        {
            std.debug.print("=== page {d} needs OCR ===\n", .{idx});
        }
    }
}

Objective-C

#import <POXPdfOxide.h>

NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];

NSString *json = [doc classifyDocumentWithError:&err];
NSDictionary *dc = [NSJSONSerialization JSONObjectWithData:[json dataUsingEncoding:NSUTF8StringEncoding]
                                                   options:0 error:&err];

NSArray<NSString *> *pages = dc[@"pages"];
[pages enumerateObjectsUsingBlock:^(NSString *kind, NSUInteger page, BOOL *stop) {
    if ([kind isEqualToString:@"text_layer"]) {
        // Fast, free native path — no OCR cost.
        NSString *text = [doc extractText:(NSInteger)page error:nil];
        NSLog(@"=== page %lu (native) ===\n%@", (unsigned long)page, text);
    } else if ([kind isEqualToString:@"scanned"] ||
               [kind isEqualToString:@"image_text"] ||
               [kind isEqualToString:@"mixed"]) {
        NSLog(@"=== page %lu needs OCR ===", (unsigned long)page);
    }
}];

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")

dc = doc |> PdfOxide.classify_document() |> Jason.decode!()

dc["pages"]
|> Enum.with_index()
|> Enum.each(fn {kind, page} ->
  case kind do
    "text_layer" ->
      # Fast, free native path — no OCR cost.
      IO.puts("=== page #{page} (native) ===\n#{PdfOxide.extract_text(doc, page)}")

    k when k in ["scanned", "image_text", "mixed"] ->
      IO.puts("=== page #{page} needs OCR ===")

    _ ->
      :ok
  end
end)

The native classification preflight is essentially free relative to extraction — it never rasterises and never runs OCR, so you can run it across an entire corpus to decide which pages are worth the OCR budget. PDF Oxide’s native text extraction itself runs at a 0.8ms mean / 100% pass rate in the published benchmark, so the classify-then-extract path keeps the fast common case fast.

A note on encrypted PDFs

classify_page and classify_document fail closed on an encrypted document you have not authenticated — they return an EncryptedPdf error rather than silently reporting empty. Authenticate first (see Encrypt & Decrypt PDFs) before classifying. Non-security per-page failures degrade gracefully to empty.

FAQ

Does classification run OCR? No. classify_page / classify_document are pure inspection of the PDF internals — no OCR, no rasterisation. That is what makes them cheap enough to run across a whole corpus as a preflight.

Is classification available in Python or Node? Not in v0.3.69. The methods are exposed in Rust, Go, C#, Swift, and WASM/JavaScript. From Python/Node, use auto-extraction or bridge through the Rust core / CLI.

How accurate is the text_layer vs scanned call? The classifier blends multiple signals (glyph count, image area, raster codec, invisible-text ratio, garbled/fragmented ratios) and applies an enriched text-quality gate, so an unusable born-digital text layer (column-scramble, (cid:NN) garbage, per-glyph fragmentation) is downgraded to scanned with the typed reason instead of being trusted.

Why is the result JSON in Go / C# / Swift? Those bindings cross the C ABI, which returns the classification as a malloc’d JSON string. Deserialize it with your standard JSON library — the field names and enum tokens are frozen and stable across releases.