What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Python API 参考

PDF Oxide 提供基于 PyO3 构建的原生 Python 绑定。预编译 wheel 覆盖 Linux、macOS 和 Windows（x86_64 与 ARM64）上的 Python 3.8–3.14。

pip install pdf_oxide

Rust API 请参阅 Rust API 参考。JavaScript API 请参阅 Node.js API 参考或 WASM API 参考。类型细节请参阅类型与枚举。

PdfDocument

用于打开、提取、编辑和保存 PDF 文件的核心类。

from pdf_oxide import PdfDocument

构造函数

PdfDocument(path: str, password: str | None = None)

参数	类型	说明
`path`	`str`	PDF 文件路径
`password`	`str \| None`	加密 PDF 的可选密码（默认：`None`）

传入 password= 可一步打开加密 PDF。也可以在打开后调用 doc.authenticate(password) 作为替代方案。

文件不存在时抛出 FileNotFoundError。文件不是有效 PDF 时抛出 PdfError。

类方法

PdfDocument.from_bytes(data: bytes, password: str | None = None) -> PdfDocument

从内存字节打开 PDF（例如从 S3 下载、通过 HTTP 接收）。可接受加密 PDF 的可选密码。

参数	类型	说明
`data`	`bytes`	原始 PDF 文件字节
`password`	`str \| None`	加密 PDF 的可选密码（默认：`None`）

from pdf_oxide import PdfDocument

# Open PDF from bytes (e.g., downloaded from S3)
doc = PdfDocument.from_bytes(pdf_bytes)

# Also supports password:
doc = PdfDocument.from_bytes(pdf_bytes, password="secret")

方法

通用

方法	返回类型	说明
`version()`	`tuple[int, int]`	PDF 版本，形如 `(major, minor)`（例如 `(1, 7)`）
`authenticate(password)`	`bool`	使用用户密码或所有者密码对加密 PDF 进行认证

文档信息

doc.page_count() -> int

返回文档的页数。

doc.has_structure_tree() -> bool

检查文档是否为带结构树的 Tagged PDF。

认证

doc.authenticate(password: str) -> bool

在打开后用密码进行认证。认证成功时返回 True。

文本提取

doc.extract_text(
    page: int,
    region: tuple[float, float, float, float] | None = None,
    exclude_layers: list[str] | None = None,
    exclude_inks: list[str] | None = None,
    extract_tables: bool = True
) -> str

从单个页面提取纯文本。页码从 0 开始。可选裁剪到指定 region、排除命名的可选内容图层或油墨/分色名称，以及切换表格重建。

doc.extract_chars(
    page: int,
    region: tuple[float, float, float, float] | None = None,
    exclude_layers: list[str] | None = None,
    exclude_inks: list[str] | None = None
) -> list[TextChar]

提取逐字符的位置信息和字体元数据。返回一组 TextChar 对象。

doc.extract_spans(page: int, region: tuple | None = None, reading_order: str | None = None) -> list[TextSpan]

提取带字体元数据的文本 span。每个 span 是一段样式完全一致的连续文本。对于多栏 PDF，传入 reading_order="column_aware"。

doc.extract_words(
    page: int,
    *,
    include_artifacts: bool = True,
    region: tuple | None = None,
    word_gap_threshold: float | None = None,
    profile: ExtractionProfile | None = None
) -> list[TextWord]

提取按词分组并带包围盒的文本。返回一组 TextWord 对象。

doc.extract_text_lines(
    page: int,
    *,
    include_artifacts: bool = True,
    region: tuple | None = None,
    word_gap_threshold: float | None = None,
    line_gap_threshold: float | None = None,
    profile: ExtractionProfile | None = None
) -> list[TextLine]

提取按行分组的文本。返回一组 TextLine 对象。

doc.extract_page_text(page: int, reading_order: str | None = None) -> dict

一次性提取 span、字符和页面尺寸。返回的 dict 包含以下键：spans、chars、page_width、page_height、text。比分别调用 extract_spans() + extract_chars() 更高效。

doc.page_layout_params(page: int) -> LayoutParams

为页面计算自适应布局参数（词/行间距阈值、中位指标、栏数）。参阅 LayoutParams。

doc.within(page: int, bbox: tuple[float, float, float, float]) -> PdfPageRegion

创建裁剪区域句柄，用于在 bbox 内提取文本、词、行、表格、图像和路径。参阅 PdfPageRegion。

自动提取与分类

doc.extract_text_auto(page: int) -> str

为页面自动选择最佳提取策略（原生文本 vs. OCR）并返回纯文本。

doc.extract_page_auto(page: int, options_json: str | None = None) -> str

自动提取页面并返回 JSON 文档；传入 JSON 字符串 options_json 可调优处理流程。

doc.classify_page(page: int) -> str

对单个页面进行分类（例如 "text"、"scanned"、"mixed"）。

doc.classify_document() -> str

通过抽样页面对整个文档进行分类。

doc.has_text_layer(page: int) -> bool

检查页面是否已具备可提取的原生文本层（而非需要 OCR）。

转换

doc.to_plain_text(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None
) -> str

按布局选项将页面转换为纯文本。

doc.to_plain_text_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None
) -> str

将所有页面转换为纯文本。

doc.to_markdown(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

将页面转换为 Markdown。

doc.to_markdown_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

将所有页面转换为 Markdown。

doc.to_html(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

将页面转换为 HTML。

doc.to_html_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

将所有页面转换为 HTML。

Office 转换

方法	返回类型	说明
`to_docx(path)`	–	将 PDF 转换为 Word 文档文件
`to_docx_bytes()`	`bytes`	将 PDF 转换为 DOCX 字节
`to_pptx(path)`	–	将 PDF 转换为 PowerPoint 文件
`to_pptx_bytes()`	`bytes`	将 PDF 转换为 PPTX 字节
`to_xlsx(path)`	–	将 PDF 转换为 Excel 工作簿文件
`to_xlsx_bytes()`	`bytes`	将 PDF 转换为 XLSX 字节

图像提取

doc.extract_images(page: int) -> list[ImageInfo]

提取页面中的所有图像，包括内容流中的图像以及嵌套 Form XObject 内的图像。

doc.extract_image_bytes(page: int) -> list[dict]

从页面提取原始图像字节。每个 dict 包含 width、height、data（bytes）和 format。

搜索

doc.search(
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0
) -> list[SearchResult]

在所有页面中搜索文本。设 max_results=0 表示不限结果数量。返回一组匹配项，含页码、文本和坐标。

doc.search_page(
    page: int,
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0
) -> list[SearchResult]

在单个页面上搜索文本。

元数据编辑

方法	参数	说明
`set_title(title)`	`str`	设置文档标题
`set_author(author)`	`str`	设置文档作者
`set_subject(subject)`	`str`	设置文档主题
`set_keywords(keywords)`	`str`	设置文档关键词

页面旋转

方法	参数	返回	说明
`page_rotation(page)`	`int`	`int`	获取当前旋转角度（0、90、180、270）
`set_page_rotation(page, degrees)`	`int, int`	–	设置绝对旋转角度
`rotate_page(page, degrees)`	`int, int`	–	在当前角度上叠加旋转
`rotate_all_pages(degrees)`	`int`	–	旋转所有页面

页面尺寸

方法	参数	返回	说明
`page_media_box(page)`	`int`	`tuple[float, float, float, float]`	获取 MediaBox `(llx, lly, urx, ury)`
`set_page_media_box(page, llx, lly, urx, ury)`	`int, float, float, float, float`	–	设置 MediaBox
`page_crop_box(page)`	`int`	`tuple	None`
`set_page_crop_box(page, llx, lly, urx, ury)`	`int, float, float, float, float`	–	设置 CropBox
`crop_margins(left, right, top, bottom)`	`float, float, float, float`	–	裁剪所有页面的边距

擦除 / 涂白

方法	参数	说明
`erase_region(page, llx, lly, urx, ury)`	`int, float, float, float, float`	擦除一个矩形区域
`erase_regions(page, rects)`	`int, list[tuple]`	擦除多个区域
`clear_erase_regions(page)`	`int`	清除待执行的擦除操作

注释

doc.get_annotations(page: int) -> list[dict]

获取页面的注释元数据（类型、矩形、内容等）。

方法	参数	返回	说明
`flatten_page_annotations(page)`	`int`	–	拍平页面上的注释
`flatten_all_annotations()`	–	–	拍平所有注释
`is_page_marked_for_flatten(page)`	`int`	`bool`	检查页面是否已标记为待拍平
`unmark_page_for_flatten(page)`	`int`	–	取消页面的待拍平标记

密文遮盖（Redaction）

doc.add_redaction(
    page: int,
    rect: tuple[float, float, float, float],
    fill: tuple[float, float, float] | None = None
) -> None

将一个矩形区域标记为待遮盖，可选指定 RGB 填充色。

doc.redaction_count(page: int) -> int

返回页面上待执行的遮盖数量。

doc.apply_redactions_destructive(
    scrub_metadata: bool = True,
    remove_javascript: bool = True,
    remove_embedded_files: bool = True,
    fill: tuple[float, float, float] = (0.0, 0.0, 0.0)
) -> None

破坏性地应用所有遮盖，移除底层内容，并可选清除元数据、JavaScript 和嵌入文件。

doc.sanitize_document(
    scrub_metadata: bool = True,
    remove_javascript: bool = True,
    remove_embedded_files: bool = True
) -> None

在不遮盖区域的前提下净化文档：剥离元数据、JavaScript 和/或嵌入文件。

方法	参数	返回	说明
`apply_page_redactions(page)`	`int`	–	应用页面上的遮盖
`apply_all_redactions()`	–	–	应用所有待执行的遮盖
`is_page_marked_for_redaction(page)`	`int`	`bool`	检查页面是否已标记为待遮盖
`unmark_page_for_redaction(page)`	`int`	–	取消页面的待遮盖标记

图层与油墨

方法	参数	返回	说明
`get_layers()`	–	`list[str]`	列出可选内容（OCG）图层名称
`get_page_inks(page)`	`int`	`list[str]`	列出页面上的油墨 / 分色着色剂名称
`get_page_inks_deep(page)`	`int`	`list[str]`	列出油墨，包括嵌套在 Form XObject 中的油墨

页眉 / 页脚清理

doc.remove_headers(threshold: float = 0.8) -> int
doc.remove_footers(threshold: float = 0.8) -> int
doc.remove_artifacts(threshold: float = 0.8) -> int

检测并移除文档中重复出现的页眉、页脚或页面修饰元素。threshold 为跨页重复比率。返回移除的元素数量。

方法	参数	说明
`erase_header(page)`	`int`	擦除页面上检测到的页眉区域
`edit_header(page)`	`int`	将页眉区域标记为待编辑
`erase_footer(page)`	`int`	擦除页面上检测到的页脚区域
`edit_footer(page)`	`int`	将页脚区域标记为待编辑
`erase_artifacts(page)`	`int`	擦除页面上检测到的修饰元素
`sync_editor_erasures()`	–	将待执行的页眉/页脚/修饰元素擦除刷入编辑器

表单字段

doc.get_form_fields() -> list[FormField]

获取所有表单字段。属性参阅 FormField。

doc.get_form_field_value(name: str) -> str | bool | list | None

按名称获取表单字段值。根据字段类型返回相应的 Python 类型。

doc.set_form_field_value(name: str, value: str | bool) -> None

按名称设置表单字段值。

doc.has_xfa() -> bool

检查文档是否包含 XFA 表单。

doc.export_form_data(path: str, format: str = "fdf") -> None

将表单数据导出到文件。支持的格式："fdf" 和 "xfdf"。

方法	参数	说明
`flatten_forms()`	–	将所有表单字段拍平到页面内容中
`flatten_forms_on_page(page)`	`int`	拍平指定页面上的表单

图像操作

doc.page_images(page: int) -> list[dict]

获取图像名称和边界，用于定位操作。每个 dict 包含 name、bounds [x, y, width, height] 和 matrix。

方法	参数	说明
`reposition_image(page, name, x, y)`	`int, str, float, float`	移动图像
`resize_image(page, name, width, height)`	`int, str, float, float`	缩放图像
`set_image_bounds(page, name, x, y, width, height)`	`int, str, float, float, float, float`	设置图像位置和大小
`clear_image_modifications(page)`	`int`	清除待执行的图像修改
`has_image_modifications(page)`	`int` → `bool`	检查是否存在待执行的图像修改

文档操作

doc.merge_from(source: str | PdfDocument) -> int

从另一个 PDF 合并页面。接受文件路径或 PdfDocument 实例。返回合并的页数。

doc.embed_file(name: str, data: bytes) -> None

向 PDF 附加一个文件。

doc.get_outline() -> list[dict] | None

获取文档书签 / 目录。无大纲时返回 None。

doc.extract_paths(page: int, region: tuple | None = None) -> list[dict]

获取页面的矢量路径（线条、曲线、形状）。

doc.extract_rects(page: int, region: tuple | None = None) -> list[dict]

获取页面上的轴对齐矩形（来自填充/描边路径）。

doc.extract_lines(page: int, region: tuple | None = None) -> list[dict]

获取页面上的直线段。

doc.extract_tables(page: int, region: tuple | None = None, table_settings: dict | None = None) -> list[dict]

检测并提取表格。每个表格是一个含行和单元格（文本 + 包围盒）的 dict。传入 table_settings 可调优检测策略。

doc.extract_structured(page: int) -> str

将页面提取为结构化 JSON 文档（逻辑阅读顺序、块和角色）。

doc.page_labels() -> list[dict]

获取页面标签范围。每个 dict 包含 start_page、style、prefix 和 start_value。

doc.xmp_metadata() -> dict | None

以字典形式获取 XMP 元数据，含 dc_title、dc_creator、xmp_create_date 等字段。无 XMP 元数据时返回 None。

OCR

doc.extract_text_ocr(page: int, engine: OcrEngine | None = None) -> str

使用 OCR 提取文本。需要 Rust 构建中的 ocr 特性。传入自定义 OcrEngine，或传入 None 使用默认引擎。

页面抽取与重排序

doc.extract_pages(pages: list[int], output: str) -> None

将给定的页索引抽取到 output 处的新 PDF 文件中。

doc.extract_pages_to_bytes(pages: list[int]) -> bytes

将给定的页索引抽取到新 PDF，并以字节形式返回。

doc.extract_page_ranges_to_bytes(ranges: list[tuple[int, int]]) -> bytes

将一个或多个 (start, end) 页范围抽取到新 PDF，并以字节形式返回。

方法	参数	说明
`select_pages(pages)`	`list[int]`	仅保留列出的页面，并按给定顺序排列
`delete_page(index)`	`int`	删除单个页面
`move_page(from_index, to_index)`	`int, int`	将页面移动到新位置

合规与校验

doc.validate_pdf_a(level: str = "1b") -> dict

针对某个 PDF/A 一致性级别（例如 "1b"、"2b"、"3b"）进行校验。返回一份报告 dict。

doc.convert_to_pdf_a(level: str = "2b") -> dict

将文档转换为 PDF/A，并返回一份转换报告 dict。

doc.validate_pdf_ua() -> dict

针对 PDF/UA（无障碍访问）要求进行校验。

doc.validate_pdf_x(level: str = "1a_2001") -> dict

针对某个 PDF/X（印刷生产）一致性级别进行校验。

权限与警告

doc.permissions() -> dict

返回文档的加密权限标志（打印、复制、修改、注释等）。

doc.structured_warnings() -> list

返回在结构化 / 标签内容提取过程中收集到的警告。

doc.flatten_warnings() -> list[str]

返回在表单/注释拍平过程中收集到的警告。

签名与文档安全存储

doc.signatures() -> list[Signature]

返回文档中的所有数字签名。参阅 Signature。

doc.signature_count() -> int

返回数字签名的数量。

doc.dss() -> Dss | None

返回文档中解析出的 Document Security Store（LTV 材料），或 None。参阅 Dss。

Page API（v0.3.34）

PdfDocument 可迭代、可索引，返回懒加载的 Page 对象。参阅 Page。

len(doc)                  # number of pages
doc[i]                    # page at index i (negative indexing supported)
doc[-1]                   # last page
for page in doc: ...      # iterate pages

DOM 访问

doc.page(index: int) -> PdfPage

获取类 DOM 的页面句柄，用于元素级编辑。参阅 PdfPage。

doc.save_page(page: PdfPage) -> None

将修改后的 PdfPage 保存回文档。

渲染

doc.render_page(
    page: int,
    dpi: int | None = None,
    format: str | None = None,
    background: tuple[float, float, float, float] | None = None,
    transparent: bool = False,
    render_annotations: bool | None = None,
    jpeg_quality: int | None = None,
    excluded_layers: list[str] | None = None
) -> bytes

将页面渲染为 PNG 或 JPEG 字节，可控制 DPI、背景、透明度、注释渲染、JPEG 质量和排除的图层。

doc.render_pixmap(page: int, dpi: int | None = None) -> RenderedPixmap

将页面渲染为原始 RGBA RenderedPixmap（含 width、height、data 的具名元组）。

doc.render_separations(page: int, dpi: int | None = None) -> list[SeparationPlate]

为页面渲染逐油墨的分色版。返回一组 SeparationPlate 具名元组（name、width、height、data）。

doc.render_separation(page: int, ink_name: str, dpi: int | None = None) -> SeparationPlate

渲染单个命名油墨的分色版。

方法	返回类型	说明
`render_page_fit(page, fit_width, fit_height, format=0)`	`bytes`	将页面缩放至适配像素框后渲染
`flatten_to_images(dpi=150)`	`bytes`	将所有页面拍平为基于图像的 PDF

保存

doc.save(path: str, compress: bool = True, garbage_collect: bool = True, linearize: bool = False) -> None

将 PDF 保存到文件。可切换流压缩、无用对象垃圾回收和线性化（快速 Web 视图）。

doc.to_bytes(compress: bool = True, garbage_collect: bool = True, linearize: bool = False) -> bytes

以与 save() 相同的选项将 PDF 序列化为字节。

doc.save_encrypted(
    path: str,
    user_password: str,
    owner_password: str | None = None,
    allow_print: bool = True,
    allow_copy: bool = True,
    allow_modify: bool = True,
    allow_annotate: bool = True
) -> None

以 AES-256 密码保护和权限控制保存。若 owner_password 为 None，则使用用户密码。

doc.to_bytes_encrypted(
    user_password: str,
    owner_password: str | None = None,
    allow_print: bool = True,
    allow_copy: bool = True,
    allow_modify: bool = True,
    allow_annotate: bool = True
) -> bytes

将 AES-256 加密的 PDF 序列化为字节。

Page

由 doc[i] 或对 PdfDocument 迭代返回的懒加载页面句柄。所有属性均在访问时计算，并转发到父文档。

from pdf_oxide import PdfDocument

with PdfDocument("paper.pdf") as doc:
    page = doc[0]
    text = page.text
    md = page.markdown(detect_headings=True)

属性（懒加载）

属性	类型	说明
`index`	`int`	从 0 开始的页索引
`width`、`height`	`float`	以 PDF 点为单位的页面尺寸
`bbox`	`tuple[float, 4]`	`(llx, lly, urx, ury)`
`text`	`str`	提取的纯文本
`chars`、`words`、`lines`、`spans`	`list[...]`	结构化文本
`tables`	`list[dict]`	含行 + 单元格（文本 + 包围盒）的表格
`images`、`paths`、`annotations`	`list[...]`	页面内容

方法

page.markdown(preserve_layout=False, detect_headings=True,
              include_images=False, image_output_dir=None,
              embed_images=True, include_form_fields=True) -> str
page.plain_text(...) -> str
page.html(...) -> str
page.render(dpi=None, format=None, background=None, transparent=False,
            render_annotations=None, jpeg_quality=None, excluded_layers=None) -> bytes
page.render_pixmap(dpi=None) -> RenderedPixmap
page.search(pattern, case_insensitive=False, literal=False,
            whole_word=False, max_results=100) -> list
page.region(x, y, width, height) -> PdfPageRegion

懒加载页面对象也可通过 doc.pages() 获得（一个等价于直接迭代文档的迭代器）。

PdfPage

类 DOM 的页面句柄，用于元素级访问和编辑。通过 PdfDocument.page() 获得。

from pdf_oxide import PdfDocument

doc = PdfDocument("file.pdf")
page = doc.page(0)

属性

属性	类型	说明
`index`	`int`	从 0 开始的页索引
`width`	`float`	以 PDF 点为单位的页面宽度
`height`	`float`	以 PDF 点为单位的页面高度

方法

page.children() -> list[PdfElement]

获取页面上的所有元素。

page.find_text_containing(needle: str) -> list[PdfText]

查找所有包含给定子串的文本元素。

page.find_images() -> list[PdfImage]

查找页面上的所有图像元素。

page.get_element(element_id: str) -> PdfElement | None

按 ID 获取特定元素。

page.set_text(text_id: PdfTextId, new_text: str) -> None

替换由 PdfTextId 标识的元素的文本内容。

page.annotations() -> list[PdfAnnotation]

获取页面上的所有注释。

page.add_link(x: float, y: float, width: float, height: float, url: str) -> str

添加一个 URL 链接注释。返回注释 ID。

page.add_highlight(x: float, y: float, width: float, height: float, color: tuple[float, float, float]) -> str

添加一个带 RGB 颜色的高亮注释。返回注释 ID。

page.add_note(x: float, y: float, text: str) -> str

添加一个便笺注释。返回注释 ID。

page.remove_annotation(index: int) -> bool

按索引移除注释。移除成功时返回 True。

page.add_text(text: str, x: float, y: float, font_size: float = 12.0) -> PdfTextId

向页面添加新文本。返回一个 PdfTextId 供后续引用。

page.remove_element(element_id: PdfTextId) -> bool

按 ID 移除元素。移除成功时返回 True。

示例

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
page = doc.page(0)

# Find and replace text
for text in page.find_text_containing("DRAFT"):
    page.set_text(text.id, "FINAL")

# Add a link
page.add_link(100, 700, 200, 20, "https://example.com")

doc.save_page(page)
doc.save("invoice_updated.pdf")

Pdf

用于从多种源格式创建 PDF 的统一类。

from pdf_oxide import Pdf

工厂方法

Pdf.from_markdown(content: str, title: str | None = None, author: str | None = None) -> Pdf

从 Markdown 内容创建 PDF。

Pdf.from_html(content: str, title: str | None = None, author: str | None = None) -> Pdf

从 HTML 内容创建 PDF。

Pdf.from_text(content: str, title: str | None = None, author: str | None = None) -> Pdf

从纯文本创建 PDF。

Pdf.from_markdown_with_template(content: str, template: str, title: str | None = None, author: str | None = None) -> Pdf

将 Markdown 经由命名的 CSS/布局模板渲染后创建 PDF。

Pdf.from_image(path: str) -> Pdf

从图像文件（JPEG、PNG）创建单页 PDF。

Pdf.from_bytes(data: bytes) -> Pdf

从内存字节打开现有 PDF 以进行修改。适用于加载从 S3、HTTP 或数据库下载的 PDF。

from pdf_oxide import Pdf

pdf = Pdf.from_bytes(existing_pdf_bytes)
pdf.save("modified.pdf")

Pdf.from_images(paths: list[str]) -> Pdf

从多个图像文件创建多页 PDF，每张图像一页。

Pdf.from_image_bytes(data: bytes) -> Pdf

从图像字节创建单页 PDF。

Pdf.merge(paths: list[str]) -> Pdf

将多个 PDF 文件（按路径）合并为单个 Pdf。

方法

pdf.save(path: str) -> None

将 PDF 保存到文件。

pdf.to_bytes() -> bytes

以字节形式获取 PDF 内容。

len(pdf) -> int

获取 PDF 的字节大小（通过 __len__）。

PdfText

表示页面上的一个文本元素。由 PdfPage.find_text_containing() 返回。

属性	类型	说明
`id`	`PdfTextId`	唯一元素标识符
`value`	`str`	文本内容
`text`	`str`	文本内容（`value` 的别名）
`bbox`	`tuple[float, float, float, float]`	包围盒 `(x0, y0, x1, y1)`
`font_name`	`str`	PostScript 字体名
`font_size`	`float`	字号（点）
`is_bold`	`bool`	文本是否加粗
`is_italic`	`bool`	文本是否斜体

方法

方法	参数	返回	说明
`contains(needle)`	`str`	`bool`	检查文本是否包含子串
`starts_with(prefix)`	`str`	`bool`	检查文本是否以前缀开头
`ends_with(suffix)`	`str`	`bool`	检查文本是否以后缀结尾

PdfImage

表示页面上的一个图像元素。由 PdfPage.find_images() 返回。

属性	类型	说明
`bbox`	`tuple[float, float, float, float]`	包围盒 `(x0, y0, x1, y1)`
`width`	`int`	图像宽度（像素）
`height`	`int`	图像高度（像素）
`aspect_ratio`	`float`	宽 / 高比

PdfAnnotation

表示页面上的一个注释。由 PdfPage.annotations() 返回。

属性	类型	说明
`subtype`	`str`	注释类型（例如 `"Link"`、`"Highlight"`、`"Text"`）
`rect`	`tuple[float, float, float, float]`	位置 `(x0, y0, x1, y1)`
`contents`	`str	None`
`color`	`tuple[float, float, float]	None`
`is_modified`	`bool`	注释是否已被修改
`is_new`	`bool`	注释是否为新添加

PdfElement

通用元素包装器。由 PdfPage.children() 返回。

方法	返回	说明
`is_text()`	`bool`	检查元素是否为文本
`is_image()`	`bool`	检查元素是否为图像
`is_path()`	`bool`	检查元素是否为矢量路径
`is_table()`	`bool`	检查元素是否为表格
`is_structure()`	`bool`	检查元素是否为结构元素
`as_text()`	`PdfText	None`
`as_image()`	`PdfImage	None`

属性	类型	说明
`bbox`	`tuple[float, float, float, float]`	包围盒

TextChar

表示带位置信息和字体元数据的单个字符。由 PdfDocument.extract_chars() 返回。

from pdf_oxide import TextChar  # or access via PdfDocument

属性	类型	说明
`char`	`str`	Unicode 字符
`bbox`	`tuple[float, float, float, float]`	包围盒 `(x0, y0, x1, y1)`
`font_name`	`str`	PostScript 字体名
`font_size`	`float`	字号（点）
`font_weight`	`str`	字重（`"thin"`、`"light"`、`"normal"`、`"medium"`、`"semi-bold"`、`"bold"`、`"extra-bold"`、`"black"`）
`is_italic`	`bool`	字符是否斜体
`color`	`tuple[float, float, float]`	RGB 颜色 `(r, g, b)`，取值 0.0–1.0
`rotation_degrees`	`float`	字符旋转角度（度）
`origin_x`	`float`	文本原点 X 坐标
`origin_y`	`float`	文本原点 Y 坐标
`advance_width`	`float`	字形步进宽度
`mcid`	`int	None`

示例

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
chars = doc.extract_chars(0)

for ch in chars[:5]:
    print(f"'{ch.char}' at bbox={ch.bbox} "
          f"font={ch.font_name} size={ch.font_size:.1f} "
          f"weight={ch.font_weight} italic={ch.is_italic}")

TextSpan

表示共享相同字体和样式的一段文本。由 PdfDocument.extract_spans() 返回。

属性	类型	说明
`text`	`str`	文本内容
`bbox`	`tuple[float, float, float, float]`	包围盒 `(x0, y0, x1, y1)`
`font_name`	`str`	PostScript 字体名
`font_size`	`float`	字号（点）
`is_bold`	`bool`	span 是否加粗
`is_italic`	`bool`	span 是否斜体
`is_monospace`	`bool`	字体是否为等宽（Courier、Consolas 等）
`char_widths`	`list[float]`	逐字形步进宽度，用于精确包围盒
`color`	`tuple[float, float, float]`	RGB 颜色 `(r, g, b)`，取值 0.0–1.0

示例

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
spans = doc.extract_spans(0)

for span in spans:
    print(f"'{span.text}' font={span.font_name} size={span.font_size:.1f} "
          f"bold={span.is_bold} italic={span.is_italic} color={span.color}")

图像提取

extract_images() 返回带图像元数据的 ImageInfo 对象。若需便于写入磁盘的原始图像数据，请使用 extract_image_bytes()。

extract_image_bytes() 返回格式

extract_image_bytes() 返回的每个 dict 具有以下键：

键	类型	说明
`width`	`int`	图像宽度（像素）
`height`	`int`	图像高度（像素）
`data`	`bytes`	原始图像数据
`format`	`str`	图像格式（例如 `"png"`、`"jpeg"`）

示例

from pdf_oxide import PdfDocument

doc = PdfDocument("brochure.pdf")
images = doc.extract_image_bytes(0)

for i, img in enumerate(images):
    print(f"Image {i}: {img['width']}x{img['height']}")
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

SearchResult

表示一次文本搜索匹配。由 search() 和 search_page() 返回。

属性	类型	说明
`page`	`int`	从 0 开始的页索引
`text`	`str`	匹配到的文本
`x`	`float`	X 坐标（PDF 点）
`y`	`float`	Y 坐标（PDF 点）

FormField

表示一个表单字段。由 PdfDocument.get_form_fields() 返回。

属性	类型	说明
`name`	`str`	完全限定的字段名
`field_type`	`str`	字段类型：`"text"`、`"button"`、`"choice"`、`"signature"` 或 `"unknown"`
`value`	`str	bool
`tooltip`	`str	None`
`bounds`	`tuple[float, float, float, float]	None`
`flags`	`int	None`
`max_length`	`int	None`
`is_readonly`	`bool`	字段是否只读
`is_required`	`bool`	字段是否必填

TextWord

按词分组的一段文本。由 PdfDocument.extract_words() 和 PdfPageRegion.extract_words() 返回。

属性	类型	说明
`text`	`str`	词文本
`bbox`	`tuple[float, float, float, float]`	包围盒 `(x0, y0, x1, y1)`
`font_name`	`str`	PostScript 字体名
`font_size`	`float`	字号（点）
`is_bold`	`bool`	词是否加粗
`is_italic`	`bool`	词是否斜体
`chars`	`list[TextChar]`	构成该词的字符

TextLine

按行分组的一段文本。由 PdfDocument.extract_text_lines() 和 PdfPageRegion.extract_text_lines() 返回。

属性	类型	说明
`text`	`str`	行文本
`bbox`	`tuple[float, float, float, float]`	包围盒 `(x0, y0, x1, y1)`
`words`	`list[TextWord]`	行中的词
`chars`	`list[TextChar]`	行中的字符

PdfPageRegion

页面的一个裁剪区域。由 PdfDocument.within() 和 PdfPage.region() 返回。

属性	类型	说明
`bbox`	`tuple[float, float, float, float]`	区域边界

方法

region.extract_text() -> str
region.extract_words() -> list[TextWord]
region.extract_text_lines() -> list[TextLine]
region.extract_tables(table_settings: dict | None = None) -> list[dict]
region.extract_images() -> list
region.extract_paths() -> list

作用域限定在区域包围盒内的提取方法。

LayoutParams

为页面计算出的自适应布局参数。由 PdfDocument.page_layout_params() 返回。

属性	类型	说明
`word_gap_threshold`	`float`	词间距阈值（点）
`line_gap_threshold`	`float`	行间距阈值（点）
`median_char_width`	`float`	字符宽度中位数
`median_font_size`	`float`	字号中位数
`median_line_spacing`	`float`	行距中位数
`column_count`	`int`	检测到的文本栏数

ExtractionProfile

传递给 extract_words() / extract_text_lines() 的可调文本提取配置。

from pdf_oxide import ExtractionProfile

静态构造器

ExtractionProfile.conservative()
ExtractionProfile.aggressive()
ExtractionProfile.balanced()
ExtractionProfile.academic()
ExtractionProfile.policy()
ExtractionProfile.form()
ExtractionProfile.government()
ExtractionProfile.scanned_ocr()
ExtractionProfile.adaptive()
ExtractionProfile.available() -> list[str]   # names of all built-in profiles

属性

属性	类型	说明
`name`	`str`	配置名称
`tj_offset_threshold`	`float`	TJ 数组偏移分词阈值
`word_margin_ratio`	`float`	词边距比率
`space_threshold_em_ratio`	`float`	空格宽度阈值（em 比率）
`space_char_multiplier`	`float`	空格字符乘数
`use_adaptive_threshold`	`bool`	是否启用自适应阈值

OfficeConverter

将 Office 文档（DOCX、XLSX、PPTX）转换为 PDF。需要 Rust 构建中的 office 特性。

from pdf_oxide import OfficeConverter

OfficeConverter()   # instances are stateless; the conversion methods are also usable as static methods

方法

OfficeConverter.from_docx(path: str) -> Pdf

将 Word 文档转换为 Pdf 对象。

OfficeConverter.from_docx_bytes(data: bytes) -> Pdf

将 Word 文档字节转换为 Pdf 对象。

OfficeConverter.from_xlsx(path: str) -> Pdf

将 Excel 电子表格转换为 Pdf 对象。

OfficeConverter.from_xlsx_bytes(data: bytes) -> Pdf

将 Excel 电子表格字节转换为 Pdf 对象。

OfficeConverter.from_pptx(path: str) -> Pdf

将 PowerPoint 演示文稿转换为 Pdf 对象。

OfficeConverter.from_pptx_bytes(data: bytes) -> Pdf

将 PowerPoint 演示文稿字节转换为 Pdf 对象。

OfficeConverter.convert(path: str) -> Pdf

自动检测格式，并将任意受支持的 Office 文档转换为 Pdf 对象。

示例

from pdf_oxide import OfficeConverter

pdf = OfficeConverter.from_docx("report.docx")
pdf.save("report.pdf")

# Or use convert() for auto-detection
pdf = OfficeConverter.convert("spreadsheet.xlsx")
pdf.save("spreadsheet.pdf")

图形类

以下类可用于带图形的高级 PDF 创建：

Color

from pdf_oxide import Color

Color(r: float, g: float, b: float)  # RGB, values 0.0-1.0
Color.from_hex("#ff0000")
Color.black()
Color.white()
Color.red()
Color.green()
Color.blue()

BlendMode

from pdf_oxide import BlendMode

BlendMode.NORMAL()
BlendMode.MULTIPLY()
BlendMode.SCREEN()
BlendMode.OVERLAY()
BlendMode.DARKEN()
BlendMode.LIGHTEN()
BlendMode.COLOR_DODGE()
BlendMode.COLOR_BURN()
BlendMode.HARD_LIGHT()
BlendMode.SOFT_LIGHT()
BlendMode.DIFFERENCE()
BlendMode.EXCLUSION()

ExtGState

from pdf_oxide import ExtGState

gs = ExtGState()
gs = gs.fill_alpha(0.5)
gs = gs.stroke_alpha(0.8)
gs = gs.alpha(0.5)  # Set both fill and stroke
gs = gs.blend_mode(BlendMode.MULTIPLY())

gs = ExtGState.semi_transparent()  # Preset

LineCap / LineJoin

from pdf_oxide import LineCap, LineJoin

LineCap.BUTT()       # Default
LineCap.ROUND()
LineCap.SQUARE()

LineJoin.MITER()     # Default
LineJoin.ROUND()
LineJoin.BEVEL()

渐变

from pdf_oxide import LinearGradient, RadialGradient, Color

# Linear gradient (fluent API)
grad = (LinearGradient()
    .start(0, 0)
    .end(100, 0)
    .add_stop(0.0, Color.red())
    .add_stop(1.0, Color.blue()))

# Convenience constructors
hgrad = LinearGradient.horizontal(200, Color.red(), Color.blue())
vgrad = LinearGradient.vertical(100, Color(1, 1, 0), Color(0, 0, 1))

# Radial gradient
rgrad = RadialGradient.centered(50, 50, 50)
rgrad = rgrad.add_stop(0.0, Color(1, 1, 0))
rgrad = rgrad.add_stop(1.0, Color(1, 0, 0))

PatternPresets

from pdf_oxide import PatternPresets, Color

PatternPresets.horizontal_stripes(width, height, stripe_height, color)
PatternPresets.vertical_stripes(width, height, stripe_width, color)
PatternPresets.checkerboard(size, color1, color2)
PatternPresets.dots(spacing, radius, color)
PatternPresets.diagonal_lines(size, line_width, color)
PatternPresets.crosshatch(size, line_width, color)

OCR 类

需要 Rust 构建中的 ocr 特性。

OcrEngine

from pdf_oxide import OcrEngine, OcrConfig

engine = OcrEngine(
    det_model_path: str,
    rec_model_path: str,
    dict_path: str,
    config: OcrConfig | None = None
)

OcrConfig

from pdf_oxide import OcrConfig

config = OcrConfig(
    det_threshold: float | None = None,
    box_threshold: float | None = None,
    rec_threshold: float | None = None,
    num_threads: int | None = None,
    max_candidates: int | None = None,
    use_v5: bool = False
)

DocumentBuilder

逐页组装 PDF 的流式构建器。参阅下方示例和从零创建。

from pdf_oxide import DocumentBuilder

文档级方法

方法	参数	说明
`DocumentBuilder()`	–	构造一个新的构建器
`title(title)`	`str`	设置文档标题
`author(author)`	`str`	设置文档作者
`subject(subject)`	`str`	设置文档主题
`keywords(keywords)`	`str`	设置文档关键词
`creator(creator)`	`str`	设置生成应用程序名称
`on_open(script)`	`str`	设置文档级打开时的 JavaScript 动作
`tagged_pdf_ua1()`	–	输出 Tagged PDF/UA-1 无障碍文档
`language(lang)`	`str`	设置文档语言（例如 `"en-US"`）
`role_map(custom, standard)`	`str, str`	将自定义结构标签映射到标准标签
`register_embedded_font(name, font)`	`str, EmbeddedFont`	注册字体（会消耗该 `EmbeddedFont`）

页面工厂

builder.a4_page() -> FluentPageBuilder       # 595 x 842 pt
builder.letter_page() -> FluentPageBuilder   # 612 x 792 pt
builder.page(width: float, height: float) -> FluentPageBuilder

输出

builder.build() -> bytes
builder.save(path: str) -> None
builder.save_encrypted(path: str, user_password: str, owner_password: str) -> None
builder.to_bytes_encrypted(user_password: str, owner_password: str) -> bytes

FluentPageBuilder

缓冲页面级操作，直到 done()。由 DocumentBuilder.a4_page() / letter_page() / page() 返回。每个方法都返回 self 以便链式调用；done() 提交页面并返回父级 DocumentBuilder。

文本与布局

方法	参数	说明
`font(name, size)`	`str, float`	设置当前字体和字号
`at(x, y)`	`float, float`	将光标移到绝对位置
`text(text)`	`str`	在光标处绘制文本
`heading(level, text)`	`int, str`	绘制标题（级别 1–6）
`paragraph(text)`	`str`	绘制自动换行的段落
`space(points)`	`float`	推进垂直间距
`horizontal_rule()`	–	绘制一条水平分隔线
`columns(column_count, gap_pt, text)`	`int, float, str`	均衡的多栏文本排版
`footnote(ref_mark, note_text)`	`str, str`	内联引用标记 + 页底脚注
`new_page_same_size()`	–	以相同尺寸开启一个新页面
`measure(text) -> float`	`str`	测量渲染文本的宽度（点）
`remaining_space() -> float`	–	页面上剩余的垂直空间

内联文本段

page.inline(text: str)
page.inline_bold(text: str)
page.inline_italic(text: str)
page.inline_color(text: str, r: float, g: float, b: float)
page.newline()

链接与动作

page.link_url(url: str)
page.link_page(page: int)
page.link_named(name: str)
page.link_javascript(script: str)
page.on_open(script: str)
page.on_close(script: str)
page.field_keystroke(script: str)
page.field_format(script: str)
page.field_validate(script: str)
page.field_calculate(script: str)

标注注释

page.highlight(color: tuple[float, float, float])
page.underline(color: tuple[float, float, float])
page.strikeout(color: tuple[float, float, float])
page.squiggly(color: tuple[float, float, float])
page.sticky_note(text: str)
page.sticky_note_at(x: float, y: float, text: str)
page.watermark(text: str)
page.watermark_confidential()
page.watermark_draft()
page.stamp(name: str)
page.freetext(x: float, y: float, w: float, h: float, text: str)

AcroForm 控件

page.text_field(name: str, x: float, y: float, w: float, h: float, default_value: str | None = None)
page.checkbox(name: str, x: float, y: float, w: float, h: float, checked: bool = False)
page.combo_box(name: str, x: float, y: float, w: float, h: float, options: list[str], selected: str | None = None)
page.radio_group(name: str, buttons: list[tuple[str, float, float, float, float]], selected: str | None = None)
page.push_button(name: str, x: float, y: float, w: float, h: float, caption: str)
page.signature_field(name: str, x: float, y: float, w: float, h: float)

图形

page.rect(x: float, y: float, w: float, h: float)
page.filled_rect(x: float, y: float, w: float, h: float, r: float, g: float, b: float)
page.line(x1: float, y1: float, x2: float, y2: float)
page.text_in_rect(x: float, y: float, w: float, h: float, text: str, align: int | None = None)
page.stroke_rect(x, y, w, h, width=1.0, color=(0.0, 0.0, 0.0))
page.stroke_rect_dashed(x, y, w, h, dash, width=1.0, color=(0.0, 0.0, 0.0), phase=0.0)
page.stroke_line(x1, y1, x2, y2, width=1.0, color=(0.0, 0.0, 0.0))
page.stroke_line_dashed(x1, y1, x2, y2, dash, width=1.0, color=(0.0, 0.0, 0.0), phase=0.0)

图像与条形码

page.image_with_alt(bytes: bytes, x: float, y: float, w: float, h: float, alt_text: str)
page.image_artifact(bytes: bytes, x: float, y: float, w: float, h: float)
page.barcode_1d(barcode_type: int, data: str, x: float, y: float, w: float, h: float)
page.barcode_qr(data: str, x: float, y: float, size: float)

barcode_type：0=Code128，1=Code39，2=EAN13，3=EAN8，4=UPCA，5=ITF，6=Code93，7=Codabar。

表格

page.table(table: Table)
page.streaming_table(
    columns: list[Column],
    repeat_header: bool = False,
    mode: str = "fixed",
    sample_rows: int = 50,
    min_col_width_pt: float = 20.0,
    max_col_width_pt: float = 400.0,
    max_rowspan: int = 1,
    batch_size: int = 256
) -> StreamingTable

提交

page.done() -> DocumentBuilder

EmbeddedFont

向 DocumentBuilder 注册的 TTF/OTF 字体。

from pdf_oxide import EmbeddedFont

EmbeddedFont.from_file(path: str) -> EmbeddedFont
EmbeddedFont.from_bytes(data: bytes, name: str | None = None) -> EmbeddedFont

属性	类型	说明
`name`	`str`	字体注册时的名称

表格

流式表格 API 的值对象。

Align

from pdf_oxide import Align

Align.LEFT     # 0
Align.CENTER   # 1
Align.RIGHT    # 2

Column

from pdf_oxide import Column

Column(header: str, width: float = 100.0, align: Align | int | None = None)

属性	类型	说明
`header`	`str`	列标题文本
`width`	`float`	列宽（点）
`align`	`int`	单元格对齐方式

Table

from pdf_oxide import Table

Table(columns: list[Column], rows: list[list[str]], has_header: bool = False)

由 FluentPageBuilder.table() 消费的缓冲表格。当 has_header=True 时，列标题会渲染为带样式的标题行。

StreamingTable

由 FluentPageBuilder.streaming_table() 返回的逐行流式表格句柄，用于一次性无法全部物化的超大表格。

方法	参数	说明
`push_row(cells)`	`list[str]`	追加一行单元格字符串
`push_row_span(cells)`	`list[tuple[str, int]]`	追加一行 `(text, colspan)` 单元格
`flush()`	–	刷新当前批次
`finish()`	–	结束表格，返回 `FluentPageBuilder`
`column_count()`	– → `int`	列数
`pending_row_count()`	– → `int`	已缓冲但尚未提交的行数
`batch_count()`	– → `int`	已完成的批次数

页面模板

跨页面应用的重复页眉/页脚修饰元素。

Artifact / ArtifactStyle

from pdf_oxide import Artifact, ArtifactStyle

Artifact()                       # empty artifact
Artifact.center(text: str)       # centered artifact text
artifact.with_left(text: str)    # add left-aligned text

style = ArtifactStyle()
style = style.font(name: str, size: float)
style = style.bold()

Header / Footer

from pdf_oxide import Header, Footer

Header()                  # or Header.center(text: str)
Footer()                  # or Footer.center(text: str)

PageTemplate

from pdf_oxide import PageTemplate, Header, Footer

template = (PageTemplate()
    .header(Header.center("Confidential"))
    .footer(Footer.center("Page")))

数字签名

对 PDF 进行签名、加盖时间戳和验证（PAdES / LTV）。需要 Rust 构建中的 signatures（以及可选的 tsa-client）特性。

Certificate

from pdf_oxide import Certificate

Certificate.load(data: bytes) -> Certificate                       # DER certificate (verify only)
Certificate.load_pem(cert_pem: str, key_pem: str) -> Certificate   # signing credential
Certificate.load_pkcs12(data: bytes, password: str) -> Certificate # PKCS#12 / .p12 signing credential

方法	返回	说明
`subject()`	`str`	证书主体 DN
`issuer()`	`str`	证书颁发者 DN
`serial()`	`str`	序列号
`validity()`	`tuple[int, int]`	`(not_before, not_after)` Unix 时间戳
`is_valid()`	`bool`	证书当前是否处于有效期窗口内

Signature

由 PdfDocument.signatures() 返回。

属性 / 方法	类型	说明
`signer_name`	`str	None`
`reason`	`str	None`
`location`	`str	None`
`contact_info`	`str	None`
`signing_time`	`int	None`
`covers_whole_document`	`bool`	签名是否覆盖整个文件
`pades_level`	`PadesLevel`	检测到的 PAdES 基线（B-B/B-T/B-LT）
`verify()`	`bool`	对签名进行密码学验证
`verify_detached(pdf_data)`	`bool`	验证时将 `messageDigest` 与文件字节进行比对

Timestamp

from pdf_oxide import Timestamp

Timestamp.parse(data: bytes) -> Timestamp

属性 / 方法	类型	说明
`time`	`int`	时间戳时间（Unix）
`serial`	`str`	TSA 响应序列号
`policy_oid`	`str`	TSA 策略 OID
`tsa_name`	`str`	TSA 名称
`hash_algorithm`	`int`	消息印记哈希算法代码
`message_imprint`	`bytes`	经过哈希的消息印记
`verify()`	`bool`	验证时间戳令牌

TsaClient

from pdf_oxide import TsaClient

client = TsaClient(
    url: str,
    username: str | None = None,
    password: str | None = None,
    timeout_seconds: int = 30,
    hash_algorithm: int = 2,
    use_nonce: bool = True,
    cert_req: bool = True
)
client.request_timestamp(data: bytes) -> Timestamp
client.request_timestamp_hash(digest: bytes, algorithm: int = 2) -> Timestamp

PadesLevel

from pdf_oxide import PadesLevel

PadesLevel.B_B     # baseline
PadesLevel.B_T     # + trusted timestamp
PadesLevel.B_LT    # + long-term validation material
PadesLevel.B_LTA   # + archival timestamp

RevocationMaterial

from pdf_oxide import RevocationMaterial

RevocationMaterial(
    certs: list[bytes] | None = None,
    crls: list[bytes] | None = None,
    ocsps: list[bytes] | None = None
)

用于 B-LT 签名的 DER 编码证书、CRL 和 OCSP 响应。

Dss

解析出的 Document Security Store，由 PdfDocument.dss() 返回。

属性	类型	说明
`certs`	`list[bytes]`	文档级证书 DER 数据块
`crls`	`list[bytes]`	CRL DER 数据块
`ocsps`	`list[bytes]`	OCSP 响应 DER 数据块
`vri`	`list[str]`	逐签名的 VRI 键（`/Contents` 的十六进制 SHA-1）

模块级函数

from pdf_oxide import (
    sign_pdf_bytes, sign_pdf_bytes_pades, has_document_timestamp,
    generate_barcode_svg, generate_qr_svg,
    plan_split_by_bookmarks, split_by_bookmarks,
)

签名

sign_pdf_bytes(pdf_data: bytes, cert: Certificate, reason: str | None = None, location: str | None = None) -> bytes

用已加载的签名 Certificate 对原始 PDF 字节签名，并返回已签名的 PDF。

sign_pdf_bytes_pades(
    pdf_data: bytes,
    cert: Certificate,
    level: PadesLevel,
    tsa_url: str | None = None,
    reason: str | None = None,
    location: str | None = None,
    revocation: RevocationMaterial | None = None
) -> bytes

按某个 PAdES 基线级别对原始 PDF 字节签名。B_T/B_LT 需要 tsa_url。

has_document_timestamp(pdf_data: bytes) -> bool

PDF 是否携带文档级 RFC 3161 归档时间戳（PAdES-B-LTA）。

条形码

generate_barcode_svg(barcode_type: int, data: str) -> str
generate_qr_svg(data: str, error_correction: int, size: int) -> str

将一维条形码或二维码生成为 SVG 字符串。需要 barcodes 特性。

按书签拆分

plan_split_by_bookmarks(src_bytes: bytes, title_prefix: str | None = None, ignore_case: bool = False, level: int = 1, include_front_matter: bool = True) -> list[dict]
split_by_bookmarks(src_bytes: bytes, title_prefix: str | None = None, ignore_case: bool = False, level: int = 1, include_front_matter: bool = True) -> list[tuple[dict, bytes]]

规划或执行在书签边界处对 PDF 的拆分。plan_* 仅返回分段元数据；split_* 返回每个分段及其配对的 PDF 字节。

OCR 模型预置

prefetch_models(languages: list[str]) -> str
model_manifest() -> str
prefetch_available() -> bool

为离线/隔离网络环境预置 OCR 模型、检视模型清单（JSON），并检查当前构建是否能够下载模型。

日志

setup_logging() -> None
set_log_level(level: str) -> None     # "off" | "error" | "warn" | "info" | "debug" | "trace"
get_log_level() -> str
disable_logging() -> None

引擎调优

set_max_ops_per_stream(limit: int | None) -> int | None
set_preserve_unmapped_glyphs(preserve: bool) -> bool

调整每个流的操作符上限（对抗性输入防护）以及对未映射字形的 U+FFFD 保留。两者均返回先前的值。

密码学治理

crypto_active_provider() -> str
crypto_available_providers() -> list[str]
crypto_use_fips() -> None                 # install the FIPS aws-lc-rs provider (requires the fips feature)
crypto_set_policy(spec: str) -> None      # e.g. "strict" or "compat;deny:rc4@write"
crypto_policy() -> str
crypto_inventory() -> list[str]
crypto_cbom() -> str                      # CycloneDX 1.6 CBOM (JSON)

异步 API

在线程池中运行阻塞操作的 async/await 包装器。这些方法与其同步对应方法一一对应。

from pdf_oxide import AsyncPdfDocument, AsyncPdf, AsyncOfficeConverter

async def main():
    doc = await AsyncPdfDocument.open("input.pdf")
    text = await doc.extract_text(0)
    await doc.close()
    # Or use as an async context manager:
    async with await AsyncPdfDocument.from_bytes(pdf_bytes) as doc:
        md = await doc.to_markdown_all()

类	构造器	备注
`AsyncPdfDocument`	`await AsyncPdfDocument.open(path, password=None)`、`await AsyncPdfDocument.from_bytes(data, password=None)`	所有 `PdfDocument` 方法均可作为可等待对象使用；支持 `async with` 和 `.close()`
`AsyncPdf`	包装 `Pdf` 工厂方法	`await pdf.save(path)`、`await pdf.to_bytes()`
`AsyncOfficeConverter`	包装 `OfficeConverter` 静态方法	例如 `await AsyncOfficeConverter.from_docx(path)`

错误处理

PdfError

所有 PDF 专属错误都抛出 PdfError：

from pdf_oxide import PdfDocument, PdfError

try:
    doc = PdfDocument("file.pdf")
    text = doc.extract_text(0)
except PdfError as e:
    print(f"PDF error: {e}")
except FileNotFoundError:
    print("File not found")
except IndexError:
    print("Page index out of range")

常见错误场景：

异常	原因
`PdfError`	PDF 格式错误、加密但未提供密码、解析失败
`FileNotFoundError`	文件不存在
`IndexError`	页索引超出 `page_count()`
`ValueError`	参数无效（例如负数页索引）

完整示例

from pdf_oxide import PdfDocument, Pdf

# --- Extraction ---
doc = PdfDocument("input.pdf")
print(f"Pages: {doc.page_count()}")

for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(f"Page {i + 1}: {len(text)} characters")

# Character-level analysis
chars = doc.extract_chars(0)
fonts = set(ch.font_name for ch in chars)
print(f"Fonts on page 1: {fonts}")

# Image extraction
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"extracted_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

# --- Creation ---
pdf = Pdf.from_markdown("# Report\n\nGenerated by PDF Oxide.",
                        title="Report", author="PDF Oxide")
pdf.save("report.pdf")

# --- Editing ---
doc = PdfDocument("document.pdf")
doc.set_title("Updated Title")
doc.set_author("New Author")
doc.rotate_all_pages(90)

# Search and replace via DOM
page = doc.page(0)
for text in page.find_text_containing("DRAFT"):
    page.set_text(text.id, "FINAL")
doc.save_page(page)

# Form filling
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name} ({f.field_type}) = {f.value}")
doc.set_form_field_value("name", "John Doe")

# Merge another PDF
merged_count = doc.merge_from("appendix.pdf")
print(f"Merged {merged_count} pages")

doc.save("output.pdf")

# --- Search ---
results = doc.search("configuration", case_insensitive=True)
for r in results:
    print(f"Page {r.page + 1}: '{r.text}' at ({r.x:.0f}, {r.y:.0f})")

v0.3.38 新增内容

`DocumentBuilder` / `FluentPageBuilder` / `EmbeddedFont`

from pdf_oxide import DocumentBuilder, EmbeddedFont, StampType

font = EmbeddedFont.from_file("DejaVuSans.ttf")
# Alt: EmbeddedFont.from_bytes(data: bytes, name: str | None = None)

(DocumentBuilder()
    .register_embedded_font("DejaVu", font)
    .letter_page()           # or .a4_page() / .page(size)
        .at(72, 720).font("DejaVu", 12).text("Hello")
        .heading(1, "Title")
        .paragraph("Body text with automatic wrapping")
        # Annotations (15 methods)
        .link_url("https://example.com")
        .link_page(2)
        .link_named("glossary")
        .highlight((1.0, 1.0, 0.0))
        .underline((0.0, 0.0, 1.0))
        .strikeout((1.0, 0.0, 0.0))
        .squiggly((1.0, 0.5, 0.0))
        .sticky_note("Review this")
        .stamp(StampType.APPROVED)
        .freetext((100, 500, 200, 50), "Comment")
        .watermark("DRAFT")
        .watermark_confidential()
        .watermark_draft()
        # AcroForm widgets (5 types)
        .text_field("name", 150, 400, 200, 20, "Jane Doe")
        .checkbox("agree", 72, 380, 15, 15, True)
        .combo_box("country", 150, 360, 200, 20, ["US", "UK"], "US")
        .radio_group("tier", [("free", 72, 340, 15, 15), ("pro", 120, 340, 15, 15)], "pro")
        .push_button("submit", 72, 300, 80, 25, "Submit")
        # Graphics primitives
        .rect(50, 270, 500, 2)
        .filled_rect(50, 260, 500, 2, (0.9, 0.9, 0.9))
        .line(50, 250, 550, 250)
    .done()
    .save_encrypted("out.pdf", "user-pw", "owner-pw"))
# Alt: .save("out.pdf") / .build() -> bytes
# Alt: .to_bytes_encrypted("user-pw", "owner-pw") -> bytes

HTML + CSS 流水线

Pdf.from_html_css(html: str, css: str, font_bytes: bytes) -> Pdf
Pdf.from_html_css_with_fonts(html: str, css: str, fonts: list[tuple[str, bytes]]) -> Pdf

参阅从 HTML 创建。

签名验证

from pdf_oxide import PdfDocument, Timestamp, TsaClient

doc = PdfDocument("signed.pdf")
doc.signature_count()                # int
for sig in doc.signatures():
    sig.signer_name                  # str
    sig.reason                       # str | None
    sig.location                     # str | None
    sig.signing_time                 # datetime | None
    sig.verify()                     # "Valid" | "Invalid" | "Unknown"
    sig.verify_detached(pdf_bytes)   # adds messageDigest check

# Timestamp
ts = Timestamp.parse(tst_bytes)
ts.time, ts.serial, ts.policy_oid, ts.tsa_name, ts.hash_algorithm, ts.message_imprint

# TSA client (behind `tsa-client` feature)
client = TsaClient(url="https://freetsa.org/tsr",
                   username=None, password=None,
                   timeout_seconds=30, hash_algorithm=2,
                   use_nonce=True, cert_req=True)
ts = client.request_timestamp(pdf_bytes)
ts = client.request_timestamp_hash(digest, algorithm=2)

详情参阅数字签名。

渲染

doc.render_page_region(page: int, x: float, y: float, w: float, h: float, format: int = 0) -> bytes
doc.render_page_fit(page: int, fit_width: int, fit_height: int, format: int = 0) -> bytes

format：0 = PNG，1 = JPEG。坐标以 PDF 点为单位，从左下角起算。

`Pdf` 拍平

doc.flatten_to_images(dpi: int = 150) -> bytes

Other Language Bindings

PDF Oxide 为所有主流生态系统提供原生绑定：Rust, Node.js, WASM, C#, Golang, Java, PHP, Ruby, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Clojure, Objective-C, Elixir。

下一步

类型与枚举 — 所有共享类型与枚举
Page API 参考 — 各绑定间一致的逐页迭代方式
Python 快速上手 — 教程