PDF 页面分类 — 文本层 vs 扫描件
在提取文本之前,通常需要先了解当前页面是什么类型:是有可用的原生文本层,还是需要 OCR 处理的扫描图像?PDF Oxide 通过低成本预检来回答这个问题 — classify_page 和 classify_document 检查 PDF 内部结构(字形数量、图像面积、编解码器、不可见文本比率、乱码字形比率),无需运行 OCR,也无需光栅化页面。
分类结果是可解释的:每个判定都携带置信度分数、类型化的 reason 代码,以及驱动决策的原始 signals,让你能将页面路由到正确的提取器(原生文本 vs OCR)并记录原因。
绑定覆盖范围。 分类功能在 Rust、Go、C#、Swift 和 WASM/JavaScript 中可用。Python 和 Node N-API 绑定在 v0.3.69 中不暴露
classify_page/classify_document— 从这些运行时可使用自动提取路径,或通过 Rust 核心 / CLI 桥接。
如何对单个 PDF 页面进行分类?
classify_page 接受从 0 开始的页面索引,返回 PageClassification。在 C-ABI 绑定(Go、C#、Swift、WASM)中,它以需要反序列化的 JSON 字符串形式返回。
Rust
use pdf_oxide::PdfDocument;
fn main() -> pdf_oxide::Result<()> {
let doc = PdfDocument::open("mixed.pdf")?;
// PdfDocument::classify_page(&self, page: usize)
// -> Result<pdf_oxide::extractors::auto::PageClassification>
let result = doc.classify_page(0)?;
println!("page {} is {:?} (confidence {:.2})",
result.page, result.kind, result.confidence);
println!("reason: {:?}", result.reason);
println!("glyphs={} image_area={:.2} garbled={:.2}",
result.signals.text_glyph_count,
result.signals.image_area_ratio,
result.signals.garbled_ratio);
Ok(())
}
Go
package main
import (
"encoding/json"
"fmt"
"log"
pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)
func main() {
doc, err := pdfoxide.Open("mixed.pdf")
if err != nil {
log.Fatal(err)
}
defer doc.Close()
// func (doc *PdfDocument) ClassifyPage(pageIndex int) (string, error)
raw, err := doc.ClassifyPage(0)
if err != nil {
log.Fatal(err)
}
var pc struct {
Page int `json:"page"`
Kind string `json:"kind"`
Confidence float64 `json:"confidence"`
Reason string `json:"reason"`
}
if err := json.Unmarshal([]byte(raw), &pc); err != nil {
log.Fatal(err)
}
fmt.Printf("page %d is %s (%.2f): %s\n", pc.Page, pc.Kind, pc.Confidence, pc.Reason)
}
C#
using System;
using System.Text.Json;
using PdfOxide.Core;
using var doc = PdfDocument.Open("mixed.pdf");
// string PdfDocument.ClassifyPage(int pageIndex)
string raw = doc.ClassifyPage(0);
using var json = JsonDocument.Parse(raw);
var root = json.RootElement;
Console.WriteLine(
$"page {root.GetProperty("page").GetInt32()} is " +
$"{root.GetProperty("kind").GetString()} " +
$"({root.GetProperty("confidence").GetDouble():F2}): " +
$"{root.GetProperty("reason").GetString()}");
Swift
import PdfOxide
let doc = try PdfDocument(path: "mixed.pdf")
// func classifyPage(_ pageIndex: Int) throws -> String (JSON)
let json = try doc.classifyPage(0)
print(json)
JavaScript (WASM)
import init, { WasmPdfDocument } from "pdf-oxide-wasm";
await init();
const bytes = new Uint8Array(await (await fetch("mixed.pdf")).arrayBuffer());
const doc = WasmPdfDocument.fromBytes(bytes);
// WasmPdfDocument.classifyPage(pageIndex) -> JSON string
const pc = JSON.parse(doc.classifyPage(0));
console.log(`page ${pc.page} is ${pc.kind} (${pc.confidence}): ${pc.reason}`);
Java
import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;
import fyi.oxide.pdf.auto.PageClass;
try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("mixed.pdf"))) {
AutoExtractor auto = AutoExtractor.of(doc);
// PageClass AutoExtractor.classifyPageKind(int pageIndex)
PageClass kind = auto.classifyPageKind(0);
System.out.println("page 0 is " + kind); // TEXT_LAYER / SCANNED / IMAGE_TEXT / MIXED / EMPTY
}
PHP
<?php
use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;
$doc = PdfDocument::open('mixed.pdf');
$auto = AutoExtractor::of($doc);
// string AutoExtractor::classifyPageKind(int $pageIndex)
$kind = $auto->classifyPageKind(0);
echo "page 0 is {$kind}\n"; // text_layer / scanned / image_text / mixed / empty
Ruby
require 'pdf_oxide'
doc = PdfOxide::PdfDocument.open('mixed.pdf')
auto = PdfOxide::AutoExtractor.new(doc)
# AutoExtractor#classify_page(page_index)
# => { reason:, kind:, confidence:, classification: }
pc = auto.classify_page(0)
puts "page 0 is #{pc[:kind]} (#{pc[:confidence]}): #{pc[:reason]}"
C++
#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>
int main() {
auto doc = pdf_oxide::Document::open("mixed.pdf");
// std::string Document::classify_page(int page_index) — JSON
std::string json = doc.classify_page(0);
std::cout << json << '\n';
}
Dart
import 'package:pdf_oxide/pdf_oxide.dart';
void main() {
final doc = PdfDocument.open('mixed.pdf');
// String PdfDocument.classifyPage(int page) — JSON
final json = doc.classifyPage(0);
print(json);
}
R
library(pdfoxide)
doc <- pdf_open("mixed.pdf")
# pdf_classify_page(doc, page) — JSON PageClassification
json <- pdf_classify_page(doc, 0)
cat(json, "\n")
Julia
using PdfOxide
doc = open_document("mixed.pdf")
# classify_page(doc, page) -> JSON string
json = classify_page(doc, 0)
println(json)
Zig
const std = @import("std");
const pdf = @import("pdf_oxide");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
const alloc = gpa.allocator();
var doc = try pdf.Document.open("mixed.pdf");
defer doc.deinit();
// classifyPage(alloc, page_index) -> caller-owned JSON bytes
const json = try doc.classifyPage(alloc, 0);
defer alloc.free(json);
std.debug.print("{s}\n", .{json});
}
Objective-C
#import <POXPdfOxide.h>
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"mixed.pdf" error:&err];
// -classifyPage:error: -> JSON NSString
NSString *json = [doc classifyPage:0 error:&err];
NSLog(@"%@", json);
Elixir
{:ok, doc} = PdfOxide.open("mixed.pdf")
# PdfOxide.classify_page(doc, page) -> JSON string
json = PdfOxide.classify_page(doc, 0)
IO.puts(json)
如何一次性对整个 PDF 文档进行分类?
classify_document 对每个页面执行相同的低成本预检并汇总结果:每页的 kind 列表、从 0 开始的 pages_needing_ocr 索引,以及聚合的 summary。判定是按页进行的 — PDF Oxide 不会将单一文档模式强加给混合型 PDF。
Rust
use pdf_oxide::PdfDocument;
fn main() -> pdf_oxide::Result<()> {
let doc = PdfDocument::open("report.pdf")?;
// PdfDocument::classify_document(&self)
// -> Result<pdf_oxide::extractors::auto::DocumentClassification>
let dc = doc.classify_document()?;
println!("summary: {:?}", dc.summary);
println!("pages needing OCR: {:?}", dc.pages_needing_ocr);
for (i, kind) in dc.pages.iter().enumerate() {
println!(" page {i}: {kind:?}");
}
Ok(())
}
Go
package main
import (
"encoding/json"
"fmt"
"log"
pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)
func main() {
doc, err := pdfoxide.Open("report.pdf")
if err != nil {
log.Fatal(err)
}
defer doc.Close()
// func (doc *PdfDocument) ClassifyDocument() (string, error)
raw, err := doc.ClassifyDocument()
if err != nil {
log.Fatal(err)
}
var dc struct {
Pages []string `json:"pages"`
PagesNeedingOCR []int `json:"pages_needing_ocr"`
Summary string `json:"summary"`
}
if err := json.Unmarshal([]byte(raw), &dc); err != nil {
log.Fatal(err)
}
fmt.Printf("summary=%s ocr_pages=%v\n", dc.Summary, dc.PagesNeedingOCR)
}
C#
using System;
using PdfOxide.Core;
using var doc = PdfDocument.Open("report.pdf");
// string PdfDocument.ClassifyDocument()
string raw = doc.ClassifyDocument();
Console.WriteLine(raw);
Swift
import PdfOxide
let doc = try PdfDocument(path: "report.pdf")
// func classifyDocument() throws -> String (JSON)
let json = try doc.classifyDocument()
print(json)
JavaScript (WASM)
import init, { WasmPdfDocument } from "pdf-oxide-wasm";
await init();
const bytes = new Uint8Array(await (await fetch("report.pdf")).arrayBuffer());
const doc = WasmPdfDocument.fromBytes(bytes);
// WasmPdfDocument.classifyDocument() -> JSON string
const dc = JSON.parse(doc.classifyDocument());
console.log("pages needing OCR:", dc.pages_needing_ocr);
Java
import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;
import fyi.oxide.pdf.auto.PageClass;
import java.util.List;
try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("report.pdf"))) {
AutoExtractor auto = AutoExtractor.of(doc);
// List<PageClass> AutoExtractor.classifyDocumentKinds()
List<PageClass> kinds = auto.classifyDocumentKinds();
for (int i = 0; i < kinds.size(); i++) {
System.out.println("page " + i + ": " + kinds.get(i));
}
}
PHP
<?php
use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;
$doc = PdfDocument::open('report.pdf');
$auto = AutoExtractor::of($doc);
// array<int,string> AutoExtractor::classifyDocumentKinds()
$kinds = $auto->classifyDocumentKinds();
foreach ($kinds as $i => $kind) {
echo "page {$i}: {$kind}\n";
}
Ruby
require 'pdf_oxide'
doc = PdfOxide::PdfDocument.open('report.pdf')
auto = PdfOxide::AutoExtractor.new(doc)
# AutoExtractor#classify_document => decoded JSON envelope
dc = auto.classify_document
puts "pages needing OCR: #{dc['pages_needing_ocr']}"
C++
#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>
int main() {
auto doc = pdf_oxide::Document::open("report.pdf");
// std::string Document::classify_document() — JSON
std::string json = doc.classify_document();
std::cout << json << '\n';
}
Dart
import 'package:pdf_oxide/pdf_oxide.dart';
void main() {
final doc = PdfDocument.open('report.pdf');
// String PdfDocument.classifyDocument() — JSON
final json = doc.classifyDocument();
print(json);
}
R
library(pdfoxide)
doc <- pdf_open("report.pdf")
# pdf_classify_document(doc) — JSON DocumentClassification
json <- pdf_classify_document(doc)
cat(json, "\n")
Julia
using PdfOxide
doc = open_document("report.pdf")
# classify_document(doc) -> JSON string
json = classify_document(doc)
println(json)
Zig
const std = @import("std");
const pdf = @import("pdf_oxide");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
const alloc = gpa.allocator();
var doc = try pdf.Document.open("report.pdf");
defer doc.deinit();
// classifyDocument(alloc) -> caller-owned JSON bytes
const json = try doc.classifyDocument(alloc);
defer alloc.free(json);
std.debug.print("{s}\n", .{json});
}
Objective-C
#import <POXPdfOxide.h>
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
// -classifyDocumentWithError: -> JSON NSString
NSString *json = [doc classifyDocumentWithError:&err];
NSLog(@"%@", json);
Elixir
{:ok, doc} = PdfOxide.open("report.pdf")
# PdfOxide.classify_document(doc) -> JSON string
json = PdfOxide.classify_document(doc)
IO.puts(json)
分类 JSON 的格式是什么?
classify_page 返回 PageClassification:
{
"page": 0,
"kind": "text_layer",
"confidence": 0.97,
"reason": "native_text_high_confidence",
"signals": {
"text_glyph_count": 1840,
"text_area_ratio": 0.62,
"image_area_ratio": 0.0,
"codec": "none",
"invisible_text_ratio": 0.0,
"garbled_ratio": 0.0,
"fragmented_word_ratio": 0.01,
"consecutive_repeat_ratio": 0.0,
"vector_path_density": 0.04,
"has_reliable_structure": true,
"producer_prior": "authoring",
"page_is_empty": false
}
}
classify_document 返回 DocumentClassification:
{
"pages": ["text_layer", "scanned", "image_text"],
"pages_needing_ocr": [1],
"summary": "mixed"
}
页面类型(kind)
| 类型 | 含义 | 推荐路由 |
|---|---|---|
text_layer |
可用的原生文本层占主导 | 提取文本层 |
scanned |
图像为主,无文本或文本乱码 | 对该页进行 OCR |
image_text |
原生文本且存在含文字的图像区域 | 混合处理:原生 + 区域 OCR |
mixed |
页面内部异构(文本 + 图像表格/图形) | 按区域自动路由 |
empty |
空白 / 近乎空白 — 不是错误 | 跳过 |
文档摘要(summary)
mostly_text、mostly_scanned、mixed 或 empty。
原因代码(reason)
原因是一个冻结的、仅追加的 snake_case 令牌。常见值:ok、native_text_high_confidence、no_text_layer_present、text_layer_below_threshold、glyph_mapping_missing、encrypted_no_extract_permission、image_table_reconstructed、image_table_no_structure。
如何根据分类结果路由提取?
低成本预检的意义在于避免对不需要 OCR 的页面支付 OCR 开销。先分类,再仅对 OCR 页面走更重的处理路径:
Rust
use pdf_oxide::PdfDocument;
use pdf_oxide::extractors::auto::PageKind;
fn main() -> pdf_oxide::Result<()> {
let doc = PdfDocument::open("report.pdf")?;
let dc = doc.classify_document()?;
for (page, kind) in dc.pages.iter().enumerate() {
match kind {
PageKind::TextLayer => {
// Fast, free native path — no OCR cost.
let text = doc.extract_text(page)?;
println!("=== page {page} (native) ===\n{text}");
}
PageKind::Scanned | PageKind::ImageText | PageKind::Mixed => {
println!("=== page {page} needs OCR ===");
// route to your OCR / auto-extract pipeline here
}
PageKind::Empty => { /* skip */ }
}
}
Ok(())
}
Java
import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;
import fyi.oxide.pdf.auto.PageClass;
import java.util.List;
try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("report.pdf"))) {
AutoExtractor auto = AutoExtractor.of(doc);
List<PageClass> kinds = auto.classifyDocumentKinds();
for (int page = 0; page < kinds.size(); page++) {
switch (kinds.get(page)) {
case TEXT_LAYER -> {
// Fast, free native path — no OCR cost.
String text = doc.extractText(page);
System.out.println("=== page " + page + " (native) ===\n" + text);
}
case SCANNED, IMAGE_TEXT, MIXED ->
System.out.println("=== page " + page + " needs OCR ===");
case EMPTY -> { /* skip */ }
}
}
}
Ruby
require 'pdf_oxide'
doc = PdfOxide::PdfDocument.open('report.pdf')
auto = PdfOxide::AutoExtractor.new(doc)
dc = auto.classify_document
dc['pages'].each_with_index do |kind, page|
case kind
when 'text_layer'
# Fast, free native path — no OCR cost.
text = doc.extract_text(page)
puts "=== page #{page} (native) ===\n#{text}"
when 'scanned', 'image_text', 'mixed'
puts "=== page #{page} needs OCR ==="
when 'empty'
# skip
end
end
C++
#include <pdf_oxide/pdf_oxide.hpp>
#include <nlohmann/json.hpp> // any JSON lib
#include <iostream>
int main() {
auto doc = pdf_oxide::Document::open("report.pdf");
auto dc = nlohmann::json::parse(doc.classify_document());
int page = 0;
for (const auto& kind : dc["pages"]) {
if (kind == "text_layer") {
// Fast, free native path — no OCR cost.
std::cout << "=== page " << page << " (native) ===\n"
<< doc.extract_text(page) << '\n';
} else if (kind == "scanned" || kind == "image_text" || kind == "mixed") {
std::cout << "=== page " << page << " needs OCR ===\n";
}
++page;
}
}
PHP
<?php
use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;
$doc = PdfDocument::open('report.pdf');
$auto = AutoExtractor::of($doc);
foreach ($auto->classifyDocumentKinds() as $page => $kind) {
if ($kind === 'text_layer') {
// Fast, free native path — no OCR cost.
echo "=== page {$page} (native) ===\n" . $doc->extractText($page) . "\n";
} elseif (in_array($kind, ['scanned', 'image_text', 'mixed'], true)) {
echo "=== page {$page} needs OCR ===\n";
}
}
Dart
import 'dart:convert';
import 'package:pdf_oxide/pdf_oxide.dart';
void main() {
final doc = PdfDocument.open('report.pdf');
final dc = jsonDecode(doc.classifyDocument()) as Map<String, dynamic>;
final pages = (dc['pages'] as List).cast<String>();
for (var page = 0; page < pages.length; page++) {
final kind = pages[page];
if (kind == 'text_layer') {
// Fast, free native path — no OCR cost.
print('=== page $page (native) ===\n${doc.extractText(page)}');
} else if (kind == 'scanned' || kind == 'image_text' || kind == 'mixed') {
print('=== page $page needs OCR ===');
}
}
}
R
library(pdfoxide)
library(jsonlite)
doc <- pdf_open("report.pdf")
dc <- fromJSON(pdf_classify_document(doc))
for (page in seq_along(dc$pages)) {
kind <- dc$pages[[page]]
idx <- page - 1L # 0-based page index
if (kind == "text_layer") {
# Fast, free native path — no OCR cost.
cat(sprintf("=== page %d (native) ===\n%s\n", idx, pdf_extract_text(doc, idx)))
} else if (kind %in% c("scanned", "image_text", "mixed")) {
cat(sprintf("=== page %d needs OCR ===\n", idx))
}
}
Julia
using PdfOxide
using JSON
doc = open_document("report.pdf")
dc = JSON.parse(classify_document(doc))
for (page, kind) in enumerate(dc["pages"])
idx = page - 1 # 0-based page index
if kind == "text_layer"
# Fast, free native path — no OCR cost.
println("=== page $idx (native) ===\n", extract_text(doc, idx))
elseif kind in ("scanned", "image_text", "mixed")
println("=== page $idx needs OCR ===")
end
end
Zig
const std = @import("std");
const pdf = @import("pdf_oxide");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
const alloc = gpa.allocator();
var doc = try pdf.Document.open("report.pdf");
defer doc.deinit();
const dc = try doc.classifyDocument(alloc);
defer alloc.free(dc);
const parsed = try std.json.parseFromSlice(std.json.Value, alloc, dc, .{});
defer parsed.deinit();
const pages = parsed.value.object.get("pages").?.array;
for (pages.items, 0..) |kind_val, page| {
const kind = kind_val.string;
const idx: i32 = @intCast(page);
if (std.mem.eql(u8, kind, "text_layer")) {
// Fast, free native path — no OCR cost.
const text = try doc.extractText(alloc, idx);
defer alloc.free(text);
std.debug.print("=== page {d} (native) ===\n{s}\n", .{ idx, text });
} else if (std.mem.eql(u8, kind, "scanned") or
std.mem.eql(u8, kind, "image_text") or
std.mem.eql(u8, kind, "mixed"))
{
std.debug.print("=== page {d} needs OCR ===\n", .{idx});
}
}
}
Objective-C
#import <POXPdfOxide.h>
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
NSString *json = [doc classifyDocumentWithError:&err];
NSDictionary *dc = [NSJSONSerialization JSONObjectWithData:[json dataUsingEncoding:NSUTF8StringEncoding]
options:0 error:&err];
NSArray<NSString *> *pages = dc[@"pages"];
[pages enumerateObjectsUsingBlock:^(NSString *kind, NSUInteger page, BOOL *stop) {
if ([kind isEqualToString:@"text_layer"]) {
// Fast, free native path — no OCR cost.
NSString *text = [doc extractText:(NSInteger)page error:nil];
NSLog(@"=== page %lu (native) ===\n%@", (unsigned long)page, text);
} else if ([kind isEqualToString:@"scanned"] ||
[kind isEqualToString:@"image_text"] ||
[kind isEqualToString:@"mixed"]) {
NSLog(@"=== page %lu needs OCR ===", (unsigned long)page);
}
}];
Elixir
{:ok, doc} = PdfOxide.open("report.pdf")
dc = doc |> PdfOxide.classify_document() |> Jason.decode!()
dc["pages"]
|> Enum.with_index()
|> Enum.each(fn {kind, page} ->
case kind do
"text_layer" ->
# Fast, free native path — no OCR cost.
IO.puts("=== page #{page} (native) ===\n#{PdfOxide.extract_text(doc, page)}")
k when k in ["scanned", "image_text", "mixed"] ->
IO.puts("=== page #{page} needs OCR ===")
_ ->
:ok
end
end)
原生分类预检相对于提取操作几乎是零成本的 — 它既不光栅化也不运行 OCR,因此可以在整个文档集上运行,以决定哪些页面值得投入 OCR 预算。PDF Oxide 的原生文本提取本身在公开基准测试中达到 0.8ms 均值 / 100% 通过率,因此先分类再提取的路径能让快速的常见情况保持高效。
关于加密 PDF 的说明
对于尚未通过认证的加密文档,classify_page 和 classify_document 会安全失败 — 它们返回 EncryptedPdf 错误,而不是静默报告 empty。请先完成认证(参见加密与解密 PDF),再进行分类。非安全类的单页失败会优雅降级为 empty。
常见问题
分类会运行 OCR 吗?
不会。classify_page / classify_document 是对 PDF 内部的纯粹检查 — 没有 OCR,没有光栅化。正是这一点使它们足够轻量,可以作为预检在整个文档集上运行。
Python 或 Node 中能使用分类功能吗? v0.3.69 中不支持。这些方法在 Rust、Go、C#、Swift 和 WASM/JavaScript 中可用。从 Python/Node 中,请使用自动提取,或通过 Rust 核心 / CLI 桥接。
text_layer vs scanned 的判定有多准确?
分类器融合多个信号(字形数量、图像面积、光栅编解码器、不可见文本比率、乱码/碎片化比率),并应用增强的文本质量门控。因此,不可用的 born-digital 文本层(列序混乱、(cid:NN) 乱码、逐字形碎片化)会被降级为 scanned 并附上类型化的原因代码,而非被信任。
为什么 Go / C# / Swift 中结果是 JSON? 这些绑定跨越了 C ABI,分类结果以 malloc 的 JSON 字符串形式返回。请用标准 JSON 库反序列化 — 字段名和枚举令牌在各版本间已冻结且稳定。
相关页面
- 扫描件 PDF 的 OCR 处理 — 从分类标记为
scanned的页面提取文本 - 离线模型管理 — 为离线路由预取 OCR 模型
- 提取配置文件 — 按文档类型调优原生文本提取
- 从 PDF 中提取文本 —
text_layer页面的原生快速路径