What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Kotlin API 参考

PDF Oxide 提供地道的 Kotlin/JVM 绑定（适配 Android），它是成熟的 fyi.oxide:pdf-oxide Java 绑定之上一层轻薄的门面封装，而 Java 绑定持有唯一的 JNI 原生桥接（pdf_oxide_jni crate）。Kotlin 模块不含任何原生代码：它重新导出 Java 类型（PdfDocument、Pdf、PdfPage、DocumentEditor、PdfSigner、PdfValidator、AutoExtractor，以及几何 / 文本 / 表格 / 搜索值类型），并叠加一层 Kotlin 语法糖——把 Optional<T> 转为 T? 的扩展函数，以及为 AutoCloseable 句柄提供的 use { }。

// build.gradle.kts
dependencies {
    implementation("fyi.oxide:pdf-oxide-kotlin:0.3.69")
}

JNI 原生库（libpdf_oxide_jni）并未随包打包——请通过 System.loadLibrary("pdf_oxide_jni") 加载它（把 .so/.dylib 放到 java.library.path 上，或在 Android 上放到 jniLibs/<abi>/ 中），或者用 -Dfyi.oxide.pdf.lib.path=<path> 把 Java 的 NativeLoader 指向它。

关于 Java API，参见 Java API 参考。关于 Rust API，参见 Rust API 参考。关于类型细节，参见类型与枚举。

import fyi.oxide.pdf.Pdf
import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.producerOrNull

Pdf.fromMarkdown("# Hello\n\nbody\n").use { pdf ->
    PdfDocument.open(pdf.save()).use { doc ->
        println(doc.pageCount())
        println(doc.extractText(0))
        println(doc.toMarkdown())
        println(doc.page(0).words().map { it.text() })
        println(doc.producerOrNull() ?: "(no producer)")   // Optional -> nullable
    }
}

所有句柄（PdfDocument、Pdf、DocumentEditor）都实现了 AutoCloseable，因此 Kotlin 的 use { } 代码块会确定性地释放原生内存。出错时会抛出 PdfException（及其子类）；参见异常。

PdfDocument

操作 PDF 的主要只读入口——打开、提取、转换、渲染、搜索以及检视表单字段。实例持有原生内存，必须关闭；请使用 use { }。

import fyi.oxide.pdf.PdfDocument

工厂方法

PdfDocument.open(path: Path): PdfDocument

从文件系统路径打开一个 PDF。

PdfDocument.open(path: String): PdfDocument

从路径字符串打开一个 PDF。

PdfDocument.open(bytes: ByteArray): PdfDocument

从内存中的字节打开一个 PDF（例如从 S3 下载或通过 HTTP 接收到的字节）。

PdfDocument.open(path: Path, password: String): PdfDocument

用用户密码或所有者密码从路径打开一个加密 PDF。

PdfDocument.open(path: String, password: String): PdfDocument

用密码从路径字符串打开一个加密 PDF。

PdfDocument.open(bytes: ByteArray, password: String): PdfDocument

用密码从字节打开一个加密 PDF。

PdfDocument.open(stream: InputStream): PdfDocument

通过从 InputStream 读取全部字节来打开一个 PDF。

静态一次性调用

PdfDocument.extractText(path: String): String
PdfDocument.extractText(path: Path): String

一次调用即完成打开、提取全部文本并关闭——适用于无需持有活动句柄的简单场景。

认证

doc.authenticate(password: String): Boolean
doc.authenticate(password: ByteArray): Boolean

在打开后对加密文档进行认证。密码匹配时返回 true。

文档信息

doc.pageCount(): Int

文档的页数。

doc.producer(): Optional<String>
doc.creator(): Optional<String>

文档的 /Producer 与 /Creator 元数据。如需基于 null 的访问方式，请使用 Kotlin 的 producerOrNull() / creatorOrNull() 扩展函数。

val doc.isOpen: Boolean

原生句柄是否仍处于打开状态（这是对 Java isOpen() getter 的 Kotlin 属性封装）。

文本提取

doc.extractText(pageIndex: Int): String

从单个从零开始索引的页面提取纯文本。

doc.extractTextAuto(pageIndex: Int): String

以自动策略选择的方式提取文本（当 OCR 特性可用时，对扫描页面会回退到 OCR）。

doc.extractStructured(page: Int): String

提取页面文本与版面的结构化（JSON）表示。

转换

doc.toMarkdown(): String
doc.toMarkdown(pageIndex: Int): String

将整个文档或单个页面转换为 Markdown。

doc.toHtml(): String
doc.toHtml(pageIndex: Int): String

将整个文档或单个页面转换为 HTML。

搜索

doc.search(query: String): List<SearchMatch>

在文档中搜索字面字符串。返回每一页的匹配项及其边界框。

doc.search(query: String, caseInsensitive: Boolean, regex: Boolean, maxResults: Int): List<SearchMatch>

按区分大小写、正则和结果数上限（maxResults = 0 表示不限）进行搜索。

表单

doc.formFields(): List<FormField>

获取所有 AcroForm 字段及其类型、取值、控件边界和页码索引。参见 FormField。

渲染

doc.render(pageIndex: Int): ByteArray
doc.render(pageIndex: Int, dpi: Int): ByteArray

以默认 DPI 或指定 DPI 将页面渲染为 PNG 图像字节。

页面访问

doc.page(index: Int): PdfPage

获取给定从零开始索引的惰性 PdfPage 句柄。

doc.pages(): List<PdfPage>

以列表形式获取所有页面。

doc.pagesStream(): Stream<PdfPage>

以 Java Stream 形式获取所有页面，便于流式处理。

生命周期

doc.close()

释放原生内存。幂等——再次调用为空操作。建议优先使用 use { }。

PdfPage

由 PdfDocument.page()、pages() 或 pagesStream() 返回的惰性页面句柄。所有访问器在访问时都会转发到其父文档。

PdfDocument.open(bytes).use { doc ->
    val page = doc.page(0)
    val words = page.words()
    val tables = page.tables()
}

几何

page.parent(): PdfDocument
page.index(): Int
page.mediaBox(): BBox
page.cropBox(): BBox
page.width(): Double
page.height(): Double
page.rotation(): Int

父文档、从零开始的索引、MediaBox / CropBox 矩形、以 PDF 点为单位的尺寸，以及以度为单位的页面旋转角度。

内容提取

page.text(): String

提取页面上的全部文本。

page.text(region: BBox): String

提取边界框区域内的文本。

page.words(): List<TextWord>
page.lines(): List<TextLine>
page.chars(): List<TextChar>

以单词、行、字符粒度返回的结构化文本。

page.images(): List<ExtractedImage>
page.tables(): List<Table>
page.annotations(): List<Annotation>

提取出的图片、检测到的表格，以及页面注释。

Pdf

从源格式创建 PDF、按书签拆分并序列化。实现了 AutoCloseable。

import fyi.oxide.pdf.Pdf

工厂方法

Pdf.fromMarkdown(markdown: String): Pdf

从 Markdown 内容创建 PDF。

Pdf.fromHtml(html: String): Pdf

从 HTML 内容创建 PDF。

Pdf.fromImages(images: List<ByteArray>): Pdf

从一组图像字节数组创建多页 PDF，每张图像对应一页。

拆分

pdf.planSplitByBookmarks(opts: SplitByBookmarksOptions): List<BookmarkSegment>

按大纲书签规划一次拆分而不实际产生输出——返回将会创建的分段（标题、页码范围、文件名）。

pdf.splitByBookmarks(opts: SplitByBookmarksOptions): List<ByteArray>

按书签层级拆分为多个 PDF。每个分段返回一个字节数组。

Pdf.planSplitByBookmarksCount(sourcePdf: ByteArray, level: Int): Int

静态辅助方法：统计在给定层级按书签拆分会产生多少个分段。

Pdf.splitByBookmarksFromBytes(sourcePdf: ByteArray, level: Int): Array<ByteArray>

静态辅助方法：直接按书签层级拆分源 PDF 字节。

保存

pdf.save(): ByteArray

将 PDF 序列化为字节。

pdf.saveTo(out: Path)

将 PDF 写入文件。

val pdf.isOpen: Boolean
pdf.close()

生命周期（Kotlin isOpen 属性与 close()）。建议优先使用 use { }。

DocumentEditor

用于密文涂黑、表单填写、元数据清除和增量保存的可变编辑器。实现了 AutoCloseable。setter 方法返回 this，以支持流式链式调用。

import fyi.oxide.pdf.DocumentEditor

工厂方法

DocumentEditor.open(path: Path): DocumentEditor
DocumentEditor.open(path: String): DocumentEditor
DocumentEditor.open(bytes: ByteArray): DocumentEditor

从路径或内存中的字节打开一个文档以供编辑。

表单填写

editor.setFormField(name: String, value: String): DocumentEditor

按完全限定名设置文本 / 选择字段的值。

editor.setFormField(name: String, checked: Boolean): DocumentEditor

按名称设置复选框 / 单选框字段的状态。

密文涂黑

editor.addRedaction(pageIndex: Int, region: BBox): DocumentEditor

在页面上的矩形区域上排入一项密文涂黑。

editor.redactionCount(pageIndex: Int): Int
editor.redactionCount(): Int

某一页上、或整个文档范围内已排入的密文涂黑数量。

editor.applyRedactionsDestructive(): RedactResult

永久应用所有已排入的密文涂黑，移除其下方的底层内容。返回一个 RedactResult，包含已应用的数量和预言机校验状态。

元数据

editor.scrubMetadata(): DocumentEditor

剥除文档元数据（Info 字典、XMP）以保护隐私。

保存

editor.save(): ByteArray
editor.saveTo(out: Path)

以完全重写的方式序列化编辑后的文档。

editor.saveIncremental(): ByteArray
editor.saveIncrementalTo(out: Path)

以增量更新方式序列化（追加变更，保留原始字节）。

val editor.isOpen: Boolean
editor.close()

生命周期。建议优先使用 use { }。

AutoExtractor

自适应提取流水线，对页面进行分类（文本层 vs. 扫描件），在需要处应用 OCR，并输出带置信度分数的文本 / Markdown / HTML。

import fyi.oxide.pdf.AutoExtractor

工厂方法

AutoExtractor.of(doc: PdfDocument): AutoExtractor
AutoExtractor.of(doc: PdfDocument, config: AutoExtractConfig): AutoExtractor

针对某个文档创建提取器，可选地附带自定义的 AutoExtractConfig。

AutoExtractor.fast(doc: PdfDocument): AutoExtractor
AutoExtractor.balanced(doc: PdfDocument): AutoExtractor
AutoExtractor.highFidelity(doc: PdfDocument): AutoExtractor

在速度与保真度之间权衡的预设配置。

提取

extractor.extractText(): String
extractor.extractTextForPage(pageIndex: Int): String

针对整个文档或单个页面的纯文本提取。

extractor.extractDocument(): AutoResult
extractor.extractPage(pageIndex: Int): AutoResult

返回 AutoResult 的完整自适应提取（文本、可选的 Markdown/HTML、原因、置信度、OCR 标志、区域）。

extractor.extractAutoDocument(): AutoResult
extractor.extractAutoPage(pageIndex: Int): AutoResult

文档级和页面级提取的自动模式变体。

extractor.extractDocumentJson(): String
extractor.extractPageJson(pageIndex: Int): String

以 JSON 字符串形式序列化的提取结果。

分类

extractor.classifyDocument(): ClassifyResult
extractor.classifyPage(pageIndex: Int): ClassifyResult

对文档或某一页进行分类，返回一个 ClassifyResult（每页的类别，以及需要 OCR、含有图表或已加密的页面列表）。

extractor.classifyPageKind(pageIndex: Int): PageClass
extractor.classifyDocumentKinds(): List<PageClass>

获取某一页或所有页面的 PageClass（TEXT_LAYER / SCANNED / MIXED）。

访问器

extractor.document(): PdfDocument
extractor.config(): AutoExtractConfig

所包裹的文档与当前生效的配置。

MarkdownConverter

从 PdfDocument 转换为 Markdown 或 HTML 的无状态、线程安全转换器。

import fyi.oxide.pdf.MarkdownConverter

MarkdownConverter.toMarkdown(doc: PdfDocument): String
MarkdownConverter.toMarkdown(doc: PdfDocument, pageIndex: Int): String
MarkdownConverter.toHtml(doc: PdfDocument): String
MarkdownConverter.toHtml(doc: PdfDocument, pageIndex: Int): String

将整个文档或单个页面转换为 Markdown / HTML。

PdfSigner

使用 PKCS#12 密钥库对 PDF 进行数字签名与验证（PAdES B-B / B-T / B-LT 级别）。

import fyi.oxide.pdf.PdfSigner

PdfSigner.fromPkcs12(keystore: Path, password: String): PdfSigner
PdfSigner.fromPkcs12(keystoreBytes: ByteArray, password: String): PdfSigner

从磁盘上或内存中的 PKCS#12 密钥库加载一个签名器。

signer.sign(pdf: ByteArray, opts: SignOptions): ByteArray

用给定的 SignOptions（级别、原因、位置、联系方式、TSA URL）对 PDF 字节进行签名。返回已签名的 PDF。

signer.verify(pdf: ByteArray): Boolean

验证 PDF 中的所有签名。当每个签名在密码学上都有效时返回 true。

PdfSigner.classifyLevel(pdf: ByteArray): SignatureLevel

静态辅助方法：检测一个已签名 PDF 的 PAdES 合规级别。

PdfValidator

针对 PDF/A、PDF/X 和 PDF/UA 合规级别的无状态、线程安全验证。

import fyi.oxide.pdf.PdfValidator

PdfValidator.isPdfA(doc: PdfDocument, level: PdfALevel): Boolean
PdfValidator.isPdfUa(doc: PdfDocument, level: PdfUaLevel): Boolean

快速的布尔合规性检查。

PdfValidator.validatePdfA(doc: PdfDocument, level: PdfALevel): ValidationResult
PdfValidator.validatePdfX(doc: PdfDocument, level: PdfXLevel): ValidationResult
PdfValidator.validatePdfUa(doc: PdfDocument, level: PdfUaLevel): ValidationResult

返回 ValidationResult（含违规项列表）的完整验证。

PdfPolicy

管控允许使用哪些密码学算法的全局安全策略控制项。

import fyi.oxide.pdf.PdfPolicy

PdfPolicy.current(): PolicyMode
PdfPolicy.set(mode: PolicyMode)
PdfPolicy.compat(): PolicyMode
PdfPolicy.strict(): PolicyMode
PdfPolicy.fipsStrict(): PolicyMode

读取或设置当前生效的 PolicyMode，并获取内置的 compat / strict / FIPS-strict 模式。

Kotlin 扩展

Kotlin 门面层唯一新增的接口面：Optional<T> 转 T? 的转换器，以及通用的 orNull() 辅助函数。从 fyi.oxide.pdf 导入。

fun <T : Any> Optional<T>.orNull(): T?

通用：空的 Optional 变为 null。

fun PdfDocument.producerOrNull(): String?
fun PdfDocument.creatorOrNull(): String?

文档的 /Producer 与 /Creator，若不存在则为 null。

fun FormField.valueOrNull(): String?
fun FormField.bboxOrNull(): BBox?

表单字段的值与控件边界框，若不存在则为 null。

fun Annotation.contentsOrNull(): String?
fun Annotation.uriOrNull(): String?

注释的 /Contents 与链接目标 URI，若不存在则为 null。

fun AutoResult.markdownOrNull(): String?
fun AutoResult.htmlOrNull(): String?

自动提取结果的 Markdown / HTML 渲染，若未产生则为 null。

fun ValidationViolation.pageIndexOrNull(): Int?

违规项所适用的页码索引，对于文档级规则则为 null。

几何类型

BBox

以 PDF 点为单位、坐标轴对齐的边界框。

BBox(x0: Double, y0: Double, x1: Double, y1: Double)

访问器	类型	说明
`x0()`、`y0()`、`x1()`、`y1()`	`Double`	角点坐标
`width()`	`Double`	`x1 - x0`
`height()`	`Double`	`y1 - y0`

Color

8 位 RGBA 颜色，带有命名常量 Color.BLACK、Color.WHITE、Color.TRANSPARENT。

Color(r: Int, g: Int, b: Int, a: Int)
Color(r: Int, g: Int, b: Int)            // a = 255

访问器：r(): Int、g(): Int、b(): Int、a(): Int。

Point

Point(x: Double, y: Double)

访问器：x(): Double、y(): Double。

Rect

位置加尺寸的矩形。

Rect(x: Double, y: Double, width: Double, height: Double)

访问器：x()、y()、width()、height()（均为 Double），以及 toBBox(): BBox。

文本类型

TextChar

单个提取出的字符。

TextChar(codepoint: Int, bbox: BBox, confidence: Float)

访问器：codepoint(): Int、bbox(): BBox、confidence(): Float、asString(): String。

TextWord

TextWord(text: String, bbox: BBox, confidence: Float)

访问器：text(): String、bbox(): BBox、confidence(): Float。

TextLine

TextLine(text: String, bbox: BBox, words: List<TextWord>)

访问器：text(): String、bbox(): BBox、words(): List<TextWord>。

TextSpan

一段样式完全相同的文本。

TextSpan(text: String, bbox: BBox, style: TextStyle)

访问器：text(): String、bbox(): BBox、style(): TextStyle。

TextStyle

TextStyle(font: String?, size: Double, color: Color, bold: Boolean, italic: Boolean)

访问器：font(): String?、size(): Double、color(): Color、bold(): Boolean、italic(): Boolean。

表格类型

Table

Table(bbox: BBox, rows: Int, cols: Int, cells: List<TableCell>)

访问器：bbox(): BBox、rows(): Int、cols(): Int、cells(): List<TableCell>。

TableCell

TableCell(text: String, bbox: BBox, row: Int, col: Int, rowSpan: Int, colSpan: Int)

访问器：text(): String、bbox(): BBox、row(): Int、col(): Int、rowSpan(): Int、colSpan(): Int。

搜索类型

SearchMatch

SearchMatch(pageIndex: Int, bbox: BBox, text: String)

访问器：pageIndex(): Int、bbox(): BBox、text(): String。

SearchResult

SearchResult(query: String, matches: List<SearchMatch>)

访问器：query(): String、matches(): List<SearchMatch>、count(): Int、isEmpty(): Boolean。

SearchOptions

通过流式构建器构建的不可变选项。SearchOptions.DEFAULT 是默认实例。

SearchOptions.builder()
    .withCaseSensitive(true)
    .withWholeWord(true)
    .withRegex(false)
    .withMaxResults(50)
    .build()

访问器：caseSensitive(): Boolean、wholeWord(): Boolean、regex(): Boolean、maxResults(): Optional<Int>。构建器方法：withCaseSensitive(Boolean)、withWholeWord(Boolean)、withRegex(Boolean)、withMaxResults(Int) / withMaxResults(Int?)、build()。

注意：该类型目前尚未接入 PdfDocument.search()——请改用上面的 caseInsensitive/regex/maxResults 重载。

表单类型

FormField

FormField(name: String, type: FormFieldType, value: String?, bbox: BBox?, pageIndex: Int)

访问器：name(): String、type(): FormFieldType、value(): Optional<String>、bbox(): Optional<BBox>、pageIndex(): Int。如需基于 null 的访问方式，请使用 valueOrNull() / bboxOrNull()。

注释类型

Annotation

Annotation(type: AnnotationType, pageIndex: Int, bbox: BBox, contents: String?, uri: String?)

访问器：type(): AnnotationType、pageIndex(): Int、bbox(): BBox、contents(): Optional<String>、uri(): Optional<String>。如需基于 null 的访问方式，请使用 contentsOrNull() / uriOrNull()。

图像类型

ExtractedImage

ExtractedImage(bytes: ByteArray, format: ImageFormat, bbox: BBox, width: Int, height: Int)

访问器：bytes(): ByteArray、format(): ImageFormat、bbox(): BBox、width(): Int、height(): Int。

自动提取类型

AutoResult

一次自适应提取的结果。

result.text(): String
result.markdown(): Optional<String>
result.html(): Optional<String>
result.reason(): ExtractReason
result.confidence(): Double
result.ocrUsed(): Boolean
result.regions(): List<RegionResult>
result.pagesNeedingOcr(): List<Int>

如需基于 null 的方式访问渲染输出，请使用 markdownOrNull() / htmlOrNull()。

RegionResult

AutoResult 内某个区域的逐区域提取细节。

region.pageIndex(): Int
region.bbox(): BBox
region.text(): String
region.reason(): ExtractReason
region.confidence(): Double
region.ocrUsed(): Boolean
region.table(): Optional<Table>

ClassifyResult

result.pages(): List<PageClass>
result.pagesNeedingOcr(): List<Int>
result.pagesWithChart(): List<Int>
result.pagesEncrypted(): List<Int>

AutoExtractConfig

通过流式构建器构建的不可变配置；AutoExtractConfig.DEFAULT 是默认值。用 toBuilder() 可将已有配置转换回构建器。

AutoExtractConfig.builder()
    .withMode(ExtractMode.AUTO)
    .withForceOcrPages(listOf(2, 5))
    .withMinOcrConfidence(0.6)
    .withOcrLanguages("eng", "deu")
    .withPasswords("secret")
    .withTopMarginFraction(0.05)
    .withBottomMarginFraction(0.05)
    .withAllowSingleColumnTables(true)
    .withOcrInlineImages(false)
    .withCancelToken("token-id")
    .build()

每个字段的访问器均返回 Optional<...>：mode()、forceOcrPages()、minOcrConfidence()、ocrLanguages()、passwords()、topMarginFraction()、bottomMarginFraction()、allowSingleColumnTables()、ocrInlineImages()、cancelToken()。构建器 setter 同时接受装箱可空和原始类型的重载（例如 withMinOcrConfidence(Double?) 与 withTopMarginFraction(double)），并提供 withOcrLanguages(vararg String) / withPasswords(vararg String) 这样的可变参数形式。

合规类型

ValidationResult

ValidationResult(valid: Boolean, violations: List<ValidationViolation>)

访问器：valid(): Boolean、violations(): List<ValidationViolation>。

ValidationViolation

ValidationViolation(ruleId: String, description: String, pageIndex: Int?)

访问器：ruleId(): String、description(): String、pageIndex(): Optional<Int>。如需基于 null 的访问方式，请使用 pageIndexOrNull()。

元数据类型

DocumentInfo

DocumentInfo(/* title, author, subject, keywords, creator, producer, creationDate, modificationDate */)

所有访问器均返回 Optional<String>：title()、author()、subject()、keywords()、creator()、producer()、creationDate()、modificationDate()。

XmpMetadata

原始 XMP 数据包。XmpMetadata.EMPTY 是空实例。

XmpMetadata(xml: String)

访问器：xml(): String、isEmpty(): Boolean。

安全与密文涂黑类型

SecurityPolicy

通过流式构建器构建的不可变策略。

SecurityPolicy.builder()
    .withMode(PolicyMode.STRICT)
    .allow("algorithm-id")
    .deny("algorithm-id")
    .build()

访问器：mode(): PolicyMode、additionalAllow(): List<String>、additionalDeny(): List<String>。构建器方法：withMode(PolicyMode)、allow(String)、deny(String)、build()。

RedactResult

RedactResult(regionsApplied: Int, oracleVerified: Boolean)

访问器：regionsApplied(): Int、oracleVerified(): Boolean。

签名类型

SignOptions

通过流式构建器构建的不可变签名选项。

SignOptions.builder()
    .withLevel(SignatureLevel.B_T)
    .withReason("Approved")
    .withLocation("HQ")
    .withContactInfo("ops@example.com")
    .withTsaUrl("https://freetsa.org/tsr")
    .build()

访问器：level(): SignatureLevel、reason(): Optional<String>、location(): Optional<String>、contactInfo(): Optional<String>、tsaUrl(): Optional<String>。构建器方法：withLevel、withReason、withLocation、withContactInfo、withTsaUrl、build()。

拆分类型

BookmarkSegment

BookmarkSegment(title: String, firstPage: Int, lastPage: Int, filename: String)

访问器：title(): String、firstPage(): Int、lastPage(): Int、filename(): String。

SplitByBookmarksOptions

通过流式构建器构建的不可变选项。

SplitByBookmarksOptions.builder()
    .withLevel(1)
    .withFilenamePrefix("chapter-")
    .build()

访问器：level(): Int、filenamePrefix(): Optional<String>。构建器方法：withLevel(Int)、withFilenamePrefix(String?)、build()。

枚举

枚举	取值
`FormFieldType`	`TEXT`、`CHECKBOX`、`RADIO`、`CHOICE`
`AnnotationType`	`HIGHLIGHT`、`TEXT`、`LINK`、`STAMP`、`UNDERLINE`、`STRIKEOUT`、`SQUIGGLY`、`FREE_TEXT`、`LINE`、`SQUARE`、`CIRCLE`、`FILE_ATTACHMENT`
`ImageFormat`	`JPEG`、`PNG`、`CCITT`、`RAW`
`ExtractMode`	`TEXT_ONLY`、`AUTO`
`ExtractReason`	`OK`、`SCANNED_NO_TEXT_LAYER`、`GLYPH_MAPPING_MISSING`、`ENCRYPTED_NO_EXTRACT_PERMISSION`、`IMAGE_TABLE_NO_STRUCTURE`、`CHART_NOT_TRANSCRIBED`、`OCR_REQUESTED_BUT_UNAVAILABLE`、`OCR_LOW_CONFIDENCE`、`EMPTY`
`PageClass`	`TEXT_LAYER`、`SCANNED`、`MIXED`
`PixelFormat`	`RGBA_8888`、`RGB_888`、`GRAY_8`、`PNG`
`PolicyMode`	`COMPAT`、`STRICT`
`SignatureLevel`	`B_B`、`B_T`、`B_LT`
`PdfALevel`	`A_1B`、`A_1A`、`A_2B`、`A_2A`、`A_2U`、`A_3B`、`A_3A`、`A_3U`、`A_4`、`A_4E`、`A_4F`
`PdfXLevel`	`X_1A_2001`、`X_1A_2003`、`X_3_2002`、`X_3_2003`、`X_4`、`X_4P`、`X_5G`、`X_5N`、`X_5PG`、`X_6`、`X_6P`、`X_6N`
`PdfUaLevel`	`UA_1`、`UA_2`（各自暴露 `code(): Int`）
`PdfErrorKind`	`PARSE`、`ENCRYPTED`、`PERMISSION`、`IO`、`OCR_UNAVAILABLE`、`SIGNATURE`、`INVALID_STATE`、`UNSUPPORTED`、`OTHER`

异常

所有失败都会抛出 PdfException（一种非受检异常）或其某个针对特定类别的子类。kind() 访问器返回一个 PdfErrorKind。

import fyi.oxide.pdf.exception.PdfException

try {
    PdfDocument.open(bytes).use { doc ->
        println(doc.extractText(0))
    }
} catch (e: PdfException) {
    println("PDF error [${e.kind()}]: ${e.message}")
}

PdfException(message: String)
PdfException(kind: PdfErrorKind, message: String)
PdfException(kind: PdfErrorKind, message: String, cause: Throwable)

e.kind(): PdfErrorKind

异常	原因
`PdfParseException`	PDF 格式错误或已损坏
`PdfEncryptedException`	在未提供有效密码的情况下打开了加密文档
`PdfPermissionException`	操作被文档权限阻止
`PdfIoException`	底层 I/O 失败
`PdfOcrUnavailableException`	请求了 OCR，但未编译进 `ocr` 特性
`PdfSignatureException`	签名或签名验证失败
`PdfInvalidStateException`	操作对当前句柄状态无效
`PdfUnsupportedException`	不受支持的特性或格式

完整示例

import fyi.oxide.pdf.AutoExtractor
import fyi.oxide.pdf.DocumentEditor
import fyi.oxide.pdf.Pdf
import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.geometry.BBox
import fyi.oxide.pdf.producerOrNull

// --- Creation ---
val bytes = Pdf.fromMarkdown("# Report\n\nGenerated by PDF Oxide.").use { it.save() }

// --- Extraction ---
PdfDocument.open(bytes).use { doc ->
    println("Pages: ${doc.pageCount()}")
    println("Producer: ${doc.producerOrNull() ?: "(none)"}")

    val page = doc.page(0)
    println("Words: ${page.words().map { it.text() }}")
    println("Tables: ${page.tables().size}")

    // Case-insensitive search
    val matches = doc.search("Report", caseInsensitive = true, regex = false, maxResults = 0)
    matches.forEach { m -> println("p${m.pageIndex()} '${m.text()}' @ ${m.bbox()}") }

    // Adaptive extraction
    val result = AutoExtractor.balanced(doc).extractDocument()
    println("confidence=${result.confidence()} ocr=${result.ocrUsed()}")
}

// --- Editing: redact + fill forms ---
DocumentEditor.open(bytes).use { editor ->
    editor.setFormField("name", "Jane Doe")
        .addRedaction(0, BBox(72.0, 700.0, 272.0, 720.0))
        .scrubMetadata()
    val redaction = editor.applyRedactionsDestructive()
    println("Redacted ${redaction.regionsApplied()} regions")
    val out: ByteArray = editor.save()
}

Other Language Bindings

PDF Oxide 为所有主流生态系统提供原生绑定：Rust, Python, Node.js, WASM, C#, Golang, Java, PHP, Ruby, C++, Swift, Dart, R, Julia, Zig, Scala, Clojure, Objective-C, Elixir。

后续步骤

类型与枚举 — 所有共享类型与枚举
Page API 参考 — 各绑定间一致的逐页迭代方式
Kotlin 快速上手 — 教程