What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF 页面分类 — 文本层 vs 扫描件

在提取文本之前，通常需要先了解当前页面是什么类型：是有可用的原生文本层，还是需要 OCR 处理的扫描图像？PDF Oxide 通过低成本预检来回答这个问题 — classify_page 和 classify_document 检查 PDF 内部结构（字形数量、图像面积、编解码器、不可见文本比率、乱码字形比率），无需运行 OCR，也无需光栅化页面。

分类结果是可解释的：每个判定都携带置信度分数、类型化的 reason 代码，以及驱动决策的原始 signals，让你能将页面路由到正确的提取器（原生文本 vs OCR）并记录原因。

绑定覆盖范围。 分类功能在 Rust、Go、C#、Swift 和 WASM/JavaScript 中可用。Python 和 Node N-API 绑定在 v0.3.69 中不暴露 classify_page / classify_document — 从这些运行时可使用自动提取路径，或通过 Rust 核心 / CLI 桥接。

如何对单个 PDF 页面进行分类？

classify_page 接受从 0 开始的页面索引，返回 PageClassification。在 C-ABI 绑定（Go、C#、Swift、WASM）中，它以需要反序列化的 JSON 字符串形式返回。

Rust

use pdf_oxide::PdfDocument;

fn main() -> pdf_oxide::Result<()> {
    let doc = PdfDocument::open("mixed.pdf")?;

    // PdfDocument::classify_page(&self, page: usize)
    //   -> Result<pdf_oxide::extractors::auto::PageClassification>
    let result = doc.classify_page(0)?;

    println!("page {} is {:?} (confidence {:.2})",
        result.page, result.kind, result.confidence);
    println!("reason: {:?}", result.reason);
    println!("glyphs={} image_area={:.2} garbled={:.2}",
        result.signals.text_glyph_count,
        result.signals.image_area_ratio,
        result.signals.garbled_ratio);
    Ok(())
}

package main

import (
	"encoding/json"
	"fmt"
	"log"

	pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
	doc, err := pdfoxide.Open("mixed.pdf")
	if err != nil {
		log.Fatal(err)
	}
	defer doc.Close()

	// func (doc *PdfDocument) ClassifyPage(pageIndex int) (string, error)
	raw, err := doc.ClassifyPage(0)
	if err != nil {
		log.Fatal(err)
	}

	var pc struct {
		Page       int     `json:"page"`
		Kind       string  `json:"kind"`
		Confidence float64 `json:"confidence"`
		Reason     string  `json:"reason"`
	}
	if err := json.Unmarshal([]byte(raw), &pc); err != nil {
		log.Fatal(err)
	}
	fmt.Printf("page %d is %s (%.2f): %s\n", pc.Page, pc.Kind, pc.Confidence, pc.Reason)
}

using System;
using System.Text.Json;
using PdfOxide.Core;

using var doc = PdfDocument.Open("mixed.pdf");

// string PdfDocument.ClassifyPage(int pageIndex)
string raw = doc.ClassifyPage(0);

using var json = JsonDocument.Parse(raw);
var root = json.RootElement;
Console.WriteLine(
    $"page {root.GetProperty("page").GetInt32()} is " +
    $"{root.GetProperty("kind").GetString()} " +
    $"({root.GetProperty("confidence").GetDouble():F2}): " +
    $"{root.GetProperty("reason").GetString()}");

Swift

import PdfOxide

let doc = try PdfDocument(path: "mixed.pdf")

// func classifyPage(_ pageIndex: Int) throws -> String   (JSON)
let json = try doc.classifyPage(0)
print(json)

JavaScript (WASM)

import init, { WasmPdfDocument } from "pdf-oxide-wasm";

await init();
const bytes = new Uint8Array(await (await fetch("mixed.pdf")).arrayBuffer());
const doc = WasmPdfDocument.fromBytes(bytes);

// WasmPdfDocument.classifyPage(pageIndex) -> JSON string
const pc = JSON.parse(doc.classifyPage(0));
console.log(`page ${pc.page} is ${pc.kind} (${pc.confidence}): ${pc.reason}`);

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;
import fyi.oxide.pdf.auto.PageClass;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("mixed.pdf"))) {
    AutoExtractor auto = AutoExtractor.of(doc);

    // PageClass AutoExtractor.classifyPageKind(int pageIndex)
    PageClass kind = auto.classifyPageKind(0);
    System.out.println("page 0 is " + kind);   // TEXT_LAYER / SCANNED / IMAGE_TEXT / MIXED / EMPTY
}

PHP

<?php
use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;

$doc = PdfDocument::open('mixed.pdf');
$auto = AutoExtractor::of($doc);

// string AutoExtractor::classifyPageKind(int $pageIndex)
$kind = $auto->classifyPageKind(0);
echo "page 0 is {$kind}\n";   // text_layer / scanned / image_text / mixed / empty

Ruby

require 'pdf_oxide'

doc  = PdfOxide::PdfDocument.open('mixed.pdf')
auto = PdfOxide::AutoExtractor.new(doc)

# AutoExtractor#classify_page(page_index)
#   => { reason:, kind:, confidence:, classification: }
pc = auto.classify_page(0)
puts "page 0 is #{pc[:kind]} (#{pc[:confidence]}): #{pc[:reason]}"

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    auto doc = pdf_oxide::Document::open("mixed.pdf");

    // std::string Document::classify_page(int page_index) — JSON
    std::string json = doc.classify_page(0);
    std::cout << json << '\n';
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('mixed.pdf');

  // String PdfDocument.classifyPage(int page) — JSON
  final json = doc.classifyPage(0);
  print(json);
}

library(pdfoxide)

doc <- pdf_open("mixed.pdf")

# pdf_classify_page(doc, page) — JSON PageClassification
json <- pdf_classify_page(doc, 0)
cat(json, "\n")

Julia

using PdfOxide

doc = open_document("mixed.pdf")

# classify_page(doc, page) -> JSON string
json = classify_page(doc, 0)
println(json)

Zig

const std = @import("std");
const pdf = @import("pdf_oxide");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const alloc = gpa.allocator();

    var doc = try pdf.Document.open("mixed.pdf");
    defer doc.deinit();

    // classifyPage(alloc, page_index) -> caller-owned JSON bytes
    const json = try doc.classifyPage(alloc, 0);
    defer alloc.free(json);
    std.debug.print("{s}\n", .{json});
}

Objective-C

#import <POXPdfOxide.h>

NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"mixed.pdf" error:&err];

// -classifyPage:error: -> JSON NSString
NSString *json = [doc classifyPage:0 error:&err];
NSLog(@"%@", json);

Elixir

{:ok, doc} = PdfOxide.open("mixed.pdf")

# PdfOxide.classify_page(doc, page) -> JSON string
json = PdfOxide.classify_page(doc, 0)
IO.puts(json)

如何一次性对整个 PDF 文档进行分类？

classify_document 对每个页面执行相同的低成本预检并汇总结果：每页的 kind 列表、从 0 开始的 pages_needing_ocr 索引，以及聚合的 summary。判定是按页进行的 — PDF Oxide 不会将单一文档模式强加给混合型 PDF。

Rust

use pdf_oxide::PdfDocument;

fn main() -> pdf_oxide::Result<()> {
    let doc = PdfDocument::open("report.pdf")?;

    // PdfDocument::classify_document(&self)
    //   -> Result<pdf_oxide::extractors::auto::DocumentClassification>
    let dc = doc.classify_document()?;

    println!("summary: {:?}", dc.summary);
    println!("pages needing OCR: {:?}", dc.pages_needing_ocr);
    for (i, kind) in dc.pages.iter().enumerate() {
        println!("  page {i}: {kind:?}");
    }
    Ok(())
}

package main

import (
	"encoding/json"
	"fmt"
	"log"

	pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
	doc, err := pdfoxide.Open("report.pdf")
	if err != nil {
		log.Fatal(err)
	}
	defer doc.Close()

	// func (doc *PdfDocument) ClassifyDocument() (string, error)
	raw, err := doc.ClassifyDocument()
	if err != nil {
		log.Fatal(err)
	}

	var dc struct {
		Pages           []string `json:"pages"`
		PagesNeedingOCR []int    `json:"pages_needing_ocr"`
		Summary         string   `json:"summary"`
	}
	if err := json.Unmarshal([]byte(raw), &dc); err != nil {
		log.Fatal(err)
	}
	fmt.Printf("summary=%s ocr_pages=%v\n", dc.Summary, dc.PagesNeedingOCR)
}

using System;
using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");

// string PdfDocument.ClassifyDocument()
string raw = doc.ClassifyDocument();
Console.WriteLine(raw);

Swift

import PdfOxide

let doc = try PdfDocument(path: "report.pdf")

// func classifyDocument() throws -> String   (JSON)
let json = try doc.classifyDocument()
print(json)

JavaScript (WASM)

import init, { WasmPdfDocument } from "pdf-oxide-wasm";

await init();
const bytes = new Uint8Array(await (await fetch("report.pdf")).arrayBuffer());
const doc = WasmPdfDocument.fromBytes(bytes);

// WasmPdfDocument.classifyDocument() -> JSON string
const dc = JSON.parse(doc.classifyDocument());
console.log("pages needing OCR:", dc.pages_needing_ocr);

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;
import fyi.oxide.pdf.auto.PageClass;
import java.util.List;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("report.pdf"))) {
    AutoExtractor auto = AutoExtractor.of(doc);

    // List<PageClass> AutoExtractor.classifyDocumentKinds()
    List<PageClass> kinds = auto.classifyDocumentKinds();
    for (int i = 0; i < kinds.size(); i++) {
        System.out.println("page " + i + ": " + kinds.get(i));
    }
}

PHP

<?php
use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;

$doc = PdfDocument::open('report.pdf');
$auto = AutoExtractor::of($doc);

// array<int,string> AutoExtractor::classifyDocumentKinds()
$kinds = $auto->classifyDocumentKinds();
foreach ($kinds as $i => $kind) {
    echo "page {$i}: {$kind}\n";
}

Ruby

require 'pdf_oxide'

doc  = PdfOxide::PdfDocument.open('report.pdf')
auto = PdfOxide::AutoExtractor.new(doc)

# AutoExtractor#classify_document => decoded JSON envelope
dc = auto.classify_document
puts "pages needing OCR: #{dc['pages_needing_ocr']}"

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    auto doc = pdf_oxide::Document::open("report.pdf");

    // std::string Document::classify_document() — JSON
    std::string json = doc.classify_document();
    std::cout << json << '\n';
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('report.pdf');

  // String PdfDocument.classifyDocument() — JSON
  final json = doc.classifyDocument();
  print(json);
}

library(pdfoxide)

doc <- pdf_open("report.pdf")

# pdf_classify_document(doc) — JSON DocumentClassification
json <- pdf_classify_document(doc)
cat(json, "\n")

Julia

using PdfOxide

doc = open_document("report.pdf")

# classify_document(doc) -> JSON string
json = classify_document(doc)
println(json)

Zig

const std = @import("std");
const pdf = @import("pdf_oxide");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const alloc = gpa.allocator();

    var doc = try pdf.Document.open("report.pdf");
    defer doc.deinit();

    // classifyDocument(alloc) -> caller-owned JSON bytes
    const json = try doc.classifyDocument(alloc);
    defer alloc.free(json);
    std.debug.print("{s}\n", .{json});
}

Objective-C

#import <POXPdfOxide.h>

NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];

// -classifyDocumentWithError: -> JSON NSString
NSString *json = [doc classifyDocumentWithError:&err];
NSLog(@"%@", json);

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")

# PdfOxide.classify_document(doc) -> JSON string
json = PdfOxide.classify_document(doc)
IO.puts(json)

分类 JSON 的格式是什么？

classify_page 返回 PageClassification：

{
  "page": 0,
  "kind": "text_layer",
  "confidence": 0.97,
  "reason": "native_text_high_confidence",
  "signals": {
    "text_glyph_count": 1840,
    "text_area_ratio": 0.62,
    "image_area_ratio": 0.0,
    "codec": "none",
    "invisible_text_ratio": 0.0,
    "garbled_ratio": 0.0,
    "fragmented_word_ratio": 0.01,
    "consecutive_repeat_ratio": 0.0,
    "vector_path_density": 0.04,
    "has_reliable_structure": true,
    "producer_prior": "authoring",
    "page_is_empty": false
  }
}

classify_document 返回 DocumentClassification：

{
  "pages": ["text_layer", "scanned", "image_text"],
  "pages_needing_ocr": [1],
  "summary": "mixed"
}

页面类型（`kind`）

类型	含义	推荐路由
`text_layer`	可用的原生文本层占主导	提取文本层
`scanned`	图像为主，无文本或文本乱码	对该页进行 OCR
`image_text`	原生文本且存在含文字的图像区域	混合处理：原生 + 区域 OCR
`mixed`	页面内部异构（文本 + 图像表格/图形）	按区域自动路由
`empty`	空白 / 近乎空白 — 不是错误	跳过

文档摘要（`summary`）

mostly_text、mostly_scanned、mixed 或 empty。

原因代码（`reason`）

原因是一个冻结的、仅追加的 snake_case 令牌。常见值：ok、native_text_high_confidence、no_text_layer_present、text_layer_below_threshold、glyph_mapping_missing、encrypted_no_extract_permission、image_table_reconstructed、image_table_no_structure。

如何根据分类结果路由提取？

低成本预检的意义在于避免对不需要 OCR 的页面支付 OCR 开销。先分类，再仅对 OCR 页面走更重的处理路径：

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::extractors::auto::PageKind;

fn main() -> pdf_oxide::Result<()> {
    let doc = PdfDocument::open("report.pdf")?;
    let dc = doc.classify_document()?;

    for (page, kind) in dc.pages.iter().enumerate() {
        match kind {
            PageKind::TextLayer => {
                // Fast, free native path — no OCR cost.
                let text = doc.extract_text(page)?;
                println!("=== page {page} (native) ===\n{text}");
            }
            PageKind::Scanned | PageKind::ImageText | PageKind::Mixed => {
                println!("=== page {page} needs OCR ===");
                // route to your OCR / auto-extract pipeline here
            }
            PageKind::Empty => { /* skip */ }
        }
    }
    Ok(())
}

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.AutoExtractor;
import fyi.oxide.pdf.auto.PageClass;
import java.util.List;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("report.pdf"))) {
    AutoExtractor auto = AutoExtractor.of(doc);
    List<PageClass> kinds = auto.classifyDocumentKinds();

    for (int page = 0; page < kinds.size(); page++) {
        switch (kinds.get(page)) {
            case TEXT_LAYER -> {
                // Fast, free native path — no OCR cost.
                String text = doc.extractText(page);
                System.out.println("=== page " + page + " (native) ===\n" + text);
            }
            case SCANNED, IMAGE_TEXT, MIXED ->
                System.out.println("=== page " + page + " needs OCR ===");
            case EMPTY -> { /* skip */ }
        }
    }
}

Ruby

require 'pdf_oxide'

doc  = PdfOxide::PdfDocument.open('report.pdf')
auto = PdfOxide::AutoExtractor.new(doc)
dc   = auto.classify_document

dc['pages'].each_with_index do |kind, page|
  case kind
  when 'text_layer'
    # Fast, free native path — no OCR cost.
    text = doc.extract_text(page)
    puts "=== page #{page} (native) ===\n#{text}"
  when 'scanned', 'image_text', 'mixed'
    puts "=== page #{page} needs OCR ==="
  when 'empty'
    # skip
  end
end

C++

#include <pdf_oxide/pdf_oxide.hpp>
#include <nlohmann/json.hpp>   // any JSON lib
#include <iostream>

int main() {
    auto doc = pdf_oxide::Document::open("report.pdf");

    auto dc = nlohmann::json::parse(doc.classify_document());
    int page = 0;
    for (const auto& kind : dc["pages"]) {
        if (kind == "text_layer") {
            // Fast, free native path — no OCR cost.
            std::cout << "=== page " << page << " (native) ===\n"
                      << doc.extract_text(page) << '\n';
        } else if (kind == "scanned" || kind == "image_text" || kind == "mixed") {
            std::cout << "=== page " << page << " needs OCR ===\n";
        }
        ++page;
    }
}

PHP

<?php
use PdfOxide\PdfDocument;
use PdfOxide\AutoExtractor;

$doc  = PdfDocument::open('report.pdf');
$auto = AutoExtractor::of($doc);

foreach ($auto->classifyDocumentKinds() as $page => $kind) {
    if ($kind === 'text_layer') {
        // Fast, free native path — no OCR cost.
        echo "=== page {$page} (native) ===\n" . $doc->extractText($page) . "\n";
    } elseif (in_array($kind, ['scanned', 'image_text', 'mixed'], true)) {
        echo "=== page {$page} needs OCR ===\n";
    }
}

Dart

import 'dart:convert';
import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('report.pdf');
  final dc = jsonDecode(doc.classifyDocument()) as Map<String, dynamic>;

  final pages = (dc['pages'] as List).cast<String>();
  for (var page = 0; page < pages.length; page++) {
    final kind = pages[page];
    if (kind == 'text_layer') {
      // Fast, free native path — no OCR cost.
      print('=== page $page (native) ===\n${doc.extractText(page)}');
    } else if (kind == 'scanned' || kind == 'image_text' || kind == 'mixed') {
      print('=== page $page needs OCR ===');
    }
  }
}

library(pdfoxide)
library(jsonlite)

doc <- pdf_open("report.pdf")
dc  <- fromJSON(pdf_classify_document(doc))

for (page in seq_along(dc$pages)) {
  kind <- dc$pages[[page]]
  idx  <- page - 1L   # 0-based page index
  if (kind == "text_layer") {
    # Fast, free native path — no OCR cost.
    cat(sprintf("=== page %d (native) ===\n%s\n", idx, pdf_extract_text(doc, idx)))
  } else if (kind %in% c("scanned", "image_text", "mixed")) {
    cat(sprintf("=== page %d needs OCR ===\n", idx))
  }
}

Julia

using PdfOxide
using JSON

doc = open_document("report.pdf")
dc  = JSON.parse(classify_document(doc))

for (page, kind) in enumerate(dc["pages"])
    idx = page - 1   # 0-based page index
    if kind == "text_layer"
        # Fast, free native path — no OCR cost.
        println("=== page $idx (native) ===\n", extract_text(doc, idx))
    elseif kind in ("scanned", "image_text", "mixed")
        println("=== page $idx needs OCR ===")
    end
end

Zig

const std = @import("std");
const pdf = @import("pdf_oxide");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    const alloc = gpa.allocator();

    var doc = try pdf.Document.open("report.pdf");
    defer doc.deinit();

    const dc = try doc.classifyDocument(alloc);
    defer alloc.free(dc);

    const parsed = try std.json.parseFromSlice(std.json.Value, alloc, dc, .{});
    defer parsed.deinit();

    const pages = parsed.value.object.get("pages").?.array;
    for (pages.items, 0..) |kind_val, page| {
        const kind = kind_val.string;
        const idx: i32 = @intCast(page);
        if (std.mem.eql(u8, kind, "text_layer")) {
            // Fast, free native path — no OCR cost.
            const text = try doc.extractText(alloc, idx);
            defer alloc.free(text);
            std.debug.print("=== page {d} (native) ===\n{s}\n", .{ idx, text });
        } else if (std.mem.eql(u8, kind, "scanned") or
            std.mem.eql(u8, kind, "image_text") or
            std.mem.eql(u8, kind, "mixed"))
        {
            std.debug.print("=== page {d} needs OCR ===\n", .{idx});
        }
    }
}

Objective-C

#import <POXPdfOxide.h>

NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];

NSString *json = [doc classifyDocumentWithError:&err];
NSDictionary *dc = [NSJSONSerialization JSONObjectWithData:[json dataUsingEncoding:NSUTF8StringEncoding]
                                                   options:0 error:&err];

NSArray<NSString *> *pages = dc[@"pages"];
[pages enumerateObjectsUsingBlock:^(NSString *kind, NSUInteger page, BOOL *stop) {
    if ([kind isEqualToString:@"text_layer"]) {
        // Fast, free native path — no OCR cost.
        NSString *text = [doc extractText:(NSInteger)page error:nil];
        NSLog(@"=== page %lu (native) ===\n%@", (unsigned long)page, text);
    } else if ([kind isEqualToString:@"scanned"] ||
               [kind isEqualToString:@"image_text"] ||
               [kind isEqualToString:@"mixed"]) {
        NSLog(@"=== page %lu needs OCR ===", (unsigned long)page);
    }
}];

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")

dc = doc |> PdfOxide.classify_document() |> Jason.decode!()

dc["pages"]
|> Enum.with_index()
|> Enum.each(fn {kind, page} ->
  case kind do
    "text_layer" ->
      # Fast, free native path — no OCR cost.
      IO.puts("=== page #{page} (native) ===\n#{PdfOxide.extract_text(doc, page)}")

    k when k in ["scanned", "image_text", "mixed"] ->
      IO.puts("=== page #{page} needs OCR ===")

    _ ->
      :ok
  end
end)

原生分类预检相对于提取操作几乎是零成本的 — 它既不光栅化也不运行 OCR，因此可以在整个文档集上运行，以决定哪些页面值得投入 OCR 预算。PDF Oxide 的原生文本提取本身在公开基准测试中达到 0.8ms 均值 / 100% 通过率，因此先分类再提取的路径能让快速的常见情况保持高效。

关于加密 PDF 的说明

对于尚未通过认证的加密文档，classify_page 和 classify_document 会安全失败 — 它们返回 EncryptedPdf 错误，而不是静默报告 empty。请先完成认证（参见加密与解密 PDF），再进行分类。非安全类的单页失败会优雅降级为 empty。

常见问题

分类会运行 OCR 吗？ 不会。classify_page / classify_document 是对 PDF 内部的纯粹检查 — 没有 OCR，没有光栅化。正是这一点使它们足够轻量，可以作为预检在整个文档集上运行。

Python 或 Node 中能使用分类功能吗？ v0.3.69 中不支持。这些方法在 Rust、Go、C#、Swift 和 WASM/JavaScript 中可用。从 Python/Node 中，请使用自动提取，或通过 Rust 核心 / CLI 桥接。

text_layer vs scanned 的判定有多准确？ 分类器融合多个信号（字形数量、图像面积、光栅编解码器、不可见文本比率、乱码/碎片化比率），并应用增强的文本质量门控。因此，不可用的 born-digital 文本层（列序混乱、(cid:NN) 乱码、逐字形碎片化）会被降级为 scanned 并附上类型化的原因代码，而非被信任。

为什么 Go / C# / Swift 中结果是 JSON？ 这些绑定跨越了 C ABI，分类结果以 malloc 的 JSON 字符串形式返回。请用标准 JSON 库反序列化 — 字段名和枚举令牌在各版本间已冻结且稳定。

PDF 页面分类 — 文本层 vs 扫描件

如何对单个 PDF 页面进行分类？

如何一次性对整个 PDF 文档进行分类？

分类 JSON 的格式是什么？

页面类型（kind）

文档摘要（summary）

原因代码（reason）

如何根据分类结果路由提取？

关于加密 PDF 的说明

常见问题

相关页面

页面类型（`kind`）

文档摘要（`summary`）

原因代码（`reason`）