What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Python API リファレンス

PDF Oxide は PyO3 で構築されたネイティブな Python バインディングを提供します。ビルド済みの wheel は Python 3.8〜3.14 向けに、Linux・macOS・Windows（x86_64 / ARM64）で利用できます。

pip install pdf_oxide

Rust API については Rust API リファレンスを参照してください。JavaScript API については Node.js API リファレンスまたは WASM API リファレンスを参照してください。型の詳細は型と列挙体を参照してください。

PdfDocument

PDF ファイルを開く・抽出する・編集する・保存するための中心となるクラスです。

from pdf_oxide import PdfDocument

コンストラクタ

PdfDocument(path: str, password: str | None = None)

パラメータ	型	説明
`path`	`str`	PDF ファイルへのパス
`password`	`str \| None`	暗号化された PDF 用の任意のパスワード（デフォルト: `None`）

password= を渡せば、暗号化された PDF を一度の呼び出しで開けます。開いたあとに doc.authenticate(password) を呼び出す方法もあります。

ファイルが存在しない場合は FileNotFoundError を送出します。有効な PDF でない場合は PdfError を送出します。

クラスメソッド

PdfDocument.from_bytes(data: bytes, password: str | None = None) -> PdfDocument

メモリ上のバイト列（例: S3 からダウンロードした、HTTP で受信したデータ）から PDF を開きます。暗号化された PDF 用の任意のパスワードを受け付けます。

パラメータ	型	説明
`data`	`bytes`	PDF ファイルの生バイト列
`password`	`str \| None`	暗号化された PDF 用の任意のパスワード（デフォルト: `None`）

from pdf_oxide import PdfDocument

# Open PDF from bytes (e.g., downloaded from S3)
doc = PdfDocument.from_bytes(pdf_bytes)

# Also supports password:
doc = PdfDocument.from_bytes(pdf_bytes, password="secret")

メソッド

一般

メソッド	戻り値の型	説明
`version()`	`tuple[int, int]`	PDF バージョンを `(major, minor)` で返す（例: `(1, 7)`）
`authenticate(password)`	`bool`	ユーザーパスワードまたはオーナーパスワードで暗号化 PDF を認証する

文書情報

doc.page_count() -> int

文書のページ数を返します。

doc.has_structure_tree() -> bool

文書が構造ツリーを持つタグ付き PDF かどうかを確認します。

認証

doc.authenticate(password: str) -> bool

開いたあとにパスワードで認証します。認証に成功すると True を返します。

テキスト抽出

doc.extract_text(
    page: int,
    region: tuple[float, float, float, float] | None = None,
    exclude_layers: list[str] | None = None,
    exclude_inks: list[str] | None = None,
    extract_tables: bool = True
) -> str

単一ページからプレーンテキストを抽出します。ページは 0 始まりのインデックスです。region で範囲を切り抜く、名前付きのオプショナルコンテンツレイヤーやインク／分版名を除外する、テーブル再構成を切り替える、といった指定が任意で行えます。

doc.extract_chars(
    page: int,
    region: tuple[float, float, float, float] | None = None,
    exclude_layers: list[str] | None = None,
    exclude_inks: list[str] | None = None
) -> list[TextChar]

文字単位の位置情報とフォントメタデータを抽出します。TextChar オブジェクトのリストを返します。

doc.extract_spans(page: int, region: tuple | None = None, reading_order: str | None = None) -> list[TextSpan]

フォントメタデータ付きのテキストスパンを抽出します。各スパンは同一スタイルが連続するテキストの並びです。多段組みの PDF には reading_order="column_aware" を渡します。

doc.extract_words(
    page: int,
    *,
    include_artifacts: bool = True,
    region: tuple | None = None,
    word_gap_threshold: float | None = None,
    profile: ExtractionProfile | None = None
) -> list[TextWord]

バウンディングボックス付きで単語単位にまとめたテキストを抽出します。TextWord オブジェクトのリストを返します。

doc.extract_text_lines(
    page: int,
    *,
    include_artifacts: bool = True,
    region: tuple | None = None,
    word_gap_threshold: float | None = None,
    line_gap_threshold: float | None = None,
    profile: ExtractionProfile | None = None
) -> list[TextLine]

行単位にまとめたテキストを抽出します。TextLine オブジェクトのリストを返します。

doc.extract_page_text(page: int, reading_order: str | None = None) -> dict

スパン・文字・ページ寸法を一度の処理で抽出します。spans、chars、page_width、page_height、text をキーに持つ dict を返します。extract_spans() と extract_chars() を別々に呼び出すよりも効率的です。

doc.page_layout_params(page: int) -> LayoutParams

ページの適応的レイアウトパラメータ（単語／行のギャップしきい値、各種中央値、段組み数）を算出します。LayoutParams を参照してください。

doc.within(page: int, bbox: tuple[float, float, float, float]) -> PdfPageRegion

bbox 内のテキスト・単語・行・テーブル・画像・パスを抽出するための、切り抜き範囲のハンドルを作成します。PdfPageRegion を参照してください。

自動抽出と分類

doc.extract_text_auto(page: int) -> str

ページに最適な抽出方式（ネイティブテキストか OCR か）を自動選択し、プレーンテキストを返します。

doc.extract_page_auto(page: int, options_json: str | None = None) -> str

ページを自動抽出して JSON ドキュメントを返します。パイプラインを調整するには JSON 文字列の options_json を渡します。

doc.classify_page(page: int) -> str

単一ページを分類します（例: "text"、"scanned"、"mixed"）。

doc.classify_document() -> str

文書のページをサンプリングして文書全体を分類します。

doc.has_text_layer(page: int) -> bool

ページに抽出可能なネイティブテキストレイヤーが既に存在するか（OCR が必要かどうか）を確認します。

変換

doc.to_plain_text(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None
) -> str

レイアウトオプション付きでページをプレーンテキストに変換します。

doc.to_plain_text_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None
) -> str

全ページをプレーンテキストに変換します。

doc.to_markdown(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

ページを Markdown に変換します。

doc.to_markdown_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

全ページを Markdown に変換します。

doc.to_html(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

ページを HTML に変換します。

doc.to_html_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

全ページを HTML に変換します。

Office 変換

メソッド	戻り値の型	説明
`to_docx(path)`	–	PDF を Word 文書ファイルに変換する
`to_docx_bytes()`	`bytes`	PDF を DOCX のバイト列に変換する
`to_pptx(path)`	–	PDF を PowerPoint ファイルに変換する
`to_pptx_bytes()`	`bytes`	PDF を PPTX のバイト列に変換する
`to_xlsx(path)`	–	PDF を Excel ブックファイルに変換する
`to_xlsx_bytes()`	`bytes`	PDF を XLSX のバイト列に変換する

画像抽出

doc.extract_images(page: int) -> list[ImageInfo]

ページからすべての画像を抽出します。コンテンツストリーム内の画像やネストされた Form XObject の画像も含まれます。

doc.extract_image_bytes(page: int) -> list[dict]

ページから画像の生バイト列を抽出します。各 dict は width、height、data（bytes）、format を含みます。

検索

doc.search(
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0
) -> list[SearchResult]

全ページを対象にテキストを検索します。結果数を無制限にするには max_results=0 を指定します。ページ番号・テキスト・座標を持つマッチのリストを返します。

doc.search_page(
    page: int,
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0
) -> list[SearchResult]

単一ページを対象にテキストを検索します。

メタデータの編集

メソッド	パラメータ	説明
`set_title(title)`	`str`	文書タイトルを設定する
`set_author(author)`	`str`	文書の作成者を設定する
`set_subject(subject)`	`str`	文書のサブジェクトを設定する
`set_keywords(keywords)`	`str`	文書のキーワードを設定する

ページ回転

メソッド	パラメータ	戻り値	説明
`page_rotation(page)`	`int`	`int`	現在の回転角を取得する（0, 90, 180, 270）
`set_page_rotation(page, degrees)`	`int, int`	–	絶対回転角を設定する
`rotate_page(page, degrees)`	`int, int`	–	現在の回転角に加算する
`rotate_all_pages(degrees)`	`int`	–	全ページを回転する

ページ寸法

メソッド	パラメータ	戻り値	説明
`page_media_box(page)`	`int`	`tuple[float, float, float, float]`	MediaBox `(llx, lly, urx, ury)` を取得する
`set_page_media_box(page, llx, lly, urx, ury)`	`int, float, float, float, float`	–	MediaBox を設定する
`page_crop_box(page)`	`int`	`tuple	None`
`set_page_crop_box(page, llx, lly, urx, ury)`	`int, float, float, float, float`	–	CropBox を設定する
`crop_margins(left, right, top, bottom)`	`float, float, float, float`	–	全ページの余白を切り抜く

消去 / 白塗り

メソッド	パラメータ	説明
`erase_region(page, llx, lly, urx, ury)`	`int, float, float, float, float`	矩形領域を消去する
`erase_regions(page, rects)`	`int, list[tuple]`	複数の領域を消去する
`clear_erase_regions(page)`	`int`	保留中の消去操作をクリアする

注釈

doc.get_annotations(page: int) -> list[dict]

ページの注釈メタデータ（種別、矩形、内容など）を取得します。

メソッド	パラメータ	戻り値	説明
`flatten_page_annotations(page)`	`int`	–	ページ上の注釈をフラット化する
`flatten_all_annotations()`	–	–	すべての注釈をフラット化する
`is_page_marked_for_flatten(page)`	`int`	`bool`	ページがフラット化対象としてマークされているか確認する
`unmark_page_for_flatten(page)`	`int`	–	ページのフラット化マークを解除する

墨消し（編集）

doc.add_redaction(
    page: int,
    rect: tuple[float, float, float, float],
    fill: tuple[float, float, float] | None = None
) -> None

任意の RGB 塗りつぶし色を指定して、矩形領域を墨消し対象としてマークします。

doc.redaction_count(page: int) -> int

ページ上で保留中の墨消しの数を返します。

doc.apply_redactions_destructive(
    scrub_metadata: bool = True,
    remove_javascript: bool = True,
    remove_embedded_files: bool = True,
    fill: tuple[float, float, float] = (0.0, 0.0, 0.0)
) -> None

すべての墨消しを破壊的に適用し、下層のコンテンツを除去します。任意でメタデータ・JavaScript・埋め込みファイルの除去も行います。

doc.sanitize_document(
    scrub_metadata: bool = True,
    remove_javascript: bool = True,
    remove_embedded_files: bool = True
) -> None

領域の墨消しを行わずに文書をサニタイズします。メタデータ・JavaScript・埋め込みファイルを除去します。

メソッド	パラメータ	戻り値	説明
`apply_page_redactions(page)`	`int`	–	ページ上の墨消しを適用する
`apply_all_redactions()`	–	–	保留中のすべての墨消しを適用する
`is_page_marked_for_redaction(page)`	`int`	`bool`	ページが墨消し対象としてマークされているか確認する
`unmark_page_for_redaction(page)`	`int`	–	ページの墨消しマークを解除する

レイヤーとインク

メソッド	パラメータ	戻り値	説明
`get_layers()`	–	`list[str]`	オプショナルコンテンツ（OCG）レイヤー名を一覧する
`get_page_inks(page)`	`int`	`list[str]`	ページ上のインク／分版カラーラント名を一覧する
`get_page_inks_deep(page)`	`int`	`list[str]`	Form XObject にネストされたものも含めてインクを一覧する

ヘッダー / フッターのクリーンアップ

doc.remove_headers(threshold: float = 0.8) -> int
doc.remove_footers(threshold: float = 0.8) -> int
doc.remove_artifacts(threshold: float = 0.8) -> int

文書全体にわたって繰り返されるヘッダー・フッター・ページアーティファクトを検出して除去します。threshold はページ横断での反復率です。除去された要素の数を返します。

メソッド	パラメータ	説明
`erase_header(page)`	`int`	ページ上で検出したヘッダー領域を消去する
`edit_header(page)`	`int`	ヘッダー領域を編集対象としてマークする
`erase_footer(page)`	`int`	ページ上で検出したフッター領域を消去する
`edit_footer(page)`	`int`	フッター領域を編集対象としてマークする
`erase_artifacts(page)`	`int`	ページ上で検出したアーティファクトを消去する
`sync_editor_erasures()`	–	保留中のヘッダー／フッター／アーティファクト消去をエディタに反映する

フォームフィールド

doc.get_form_fields() -> list[FormField]

すべてのフォームフィールドを取得します。プロパティについては FormField を参照してください。

doc.get_form_field_value(name: str) -> str | bool | list | None

名前でフォームフィールドの値を取得します。フィールドの型に応じた適切な Python 型を返します。

doc.set_form_field_value(name: str, value: str | bool) -> None

名前でフォームフィールドの値を設定します。

doc.has_xfa() -> bool

文書が XFA フォームを含むかどうかを確認します。

doc.export_form_data(path: str, format: str = "fdf") -> None

フォームデータをファイルにエクスポートします。対応形式は "fdf" と "xfdf" です。

メソッド	パラメータ	説明
`flatten_forms()`	–	すべてのフォームフィールドをページコンテンツにフラット化する
`flatten_forms_on_page(page)`	`int`	指定ページのフォームをフラット化する

画像の操作

doc.page_images(page: int) -> list[dict]

配置操作のために画像名と境界を取得します。各 dict は name、bounds [x, y, width, height]、matrix を含みます。

メソッド	パラメータ	説明
`reposition_image(page, name, x, y)`	`int, str, float, float`	画像を移動する
`resize_image(page, name, width, height)`	`int, str, float, float`	画像をリサイズする
`set_image_bounds(page, name, x, y, width, height)`	`int, str, float, float, float, float`	画像の位置とサイズを設定する
`clear_image_modifications(page)`	`int`	保留中の画像変更をクリアする
`has_image_modifications(page)`	`int` → `bool`	保留中の画像変更があるか確認する

文書操作

doc.merge_from(source: str | PdfDocument) -> int

別の PDF からページをマージします。ファイルパスまたは PdfDocument インスタンスを受け付けます。マージしたページ数を返します。

doc.embed_file(name: str, data: bytes) -> None

PDF にファイルを添付します。

doc.get_outline() -> list[dict] | None

文書のブックマーク／目次を取得します。アウトラインが存在しない場合は None を返します。

doc.extract_paths(page: int, region: tuple | None = None) -> list[dict]

ページからベクターパス（線・曲線・図形）を取得します。

doc.extract_rects(page: int, region: tuple | None = None) -> list[dict]

ページ上の軸並行な矩形（塗りつぶし／ストロークされたパスから）を取得します。

doc.extract_lines(page: int, region: tuple | None = None) -> list[dict]

ページ上の直線セグメントを取得します。

doc.extract_tables(page: int, region: tuple | None = None, table_settings: dict | None = None) -> list[dict]

テーブルを検出して抽出します。各テーブルは行とセル（テキスト＋バウンディングボックス）を持つ dict です。検出方式を調整するには table_settings を渡します。

doc.extract_structured(page: int) -> str

ページを構造化された JSON ドキュメント（論理的な読み順、ブロック、ロール）として抽出します。

doc.page_labels() -> list[dict]

ページラベルの範囲を取得します。各 dict は start_page、style、prefix、start_value を含みます。

doc.xmp_metadata() -> dict | None

XMP メタデータを dc_title、dc_creator、xmp_create_date などのフィールドを持つ辞書として取得します。XMP メタデータが存在しない場合は None を返します。

OCR

doc.extract_text_ocr(page: int, engine: OcrEngine | None = None) -> str

OCR を使ってテキストを抽出します。Rust ビルドで ocr フィーチャーが必要です。カスタムの OcrEngine、またはデフォルトエンジンを使う場合は None を渡します。

ページの抽出と並べ替え

doc.extract_pages(pages: list[int], output: str) -> None

指定したページインデックスを output の新しい PDF ファイルとして抽出します。

doc.extract_pages_to_bytes(pages: list[int]) -> bytes

指定したページインデックスを新しい PDF として抽出し、バイト列で返します。

doc.extract_page_ranges_to_bytes(ranges: list[tuple[int, int]]) -> bytes

1 つ以上の (start, end) ページ範囲を新しい PDF として抽出し、バイト列で返します。

メソッド	パラメータ	説明
`select_pages(pages)`	`list[int]`	指定したページのみを、指定した順序で残す
`delete_page(index)`	`int`	単一ページを削除する
`move_page(from_index, to_index)`	`int, int`	ページを新しい位置に移動する

準拠性とバリデーション

doc.validate_pdf_a(level: str = "1b") -> dict

PDF/A 適合レベル（例: "1b"、"2b"、"3b"）に対して検証します。レポートの dict を返します。

doc.convert_to_pdf_a(level: str = "2b") -> dict

文書を PDF/A に変換し、変換レポートの dict を返します。

doc.validate_pdf_ua() -> dict

PDF/UA（アクセシビリティ）要件に対して検証します。

doc.validate_pdf_x(level: str = "1a_2001") -> dict

PDF/X（印刷制作）適合レベルに対して検証します。

権限と警告

doc.permissions() -> dict

文書の暗号化権限フラグ（印刷、コピー、変更、注釈など）を返します。

doc.structured_warnings() -> list

構造化／タグ付きコンテンツの抽出中に収集された警告を返します。

doc.flatten_warnings() -> list[str]

フォーム／注釈のフラット化中に収集された警告を返します。

署名とドキュメントセキュリティストア

doc.signatures() -> list[Signature]

文書内のすべてのデジタル署名を返します。Signature を参照してください。

doc.signature_count() -> int

デジタル署名の数を返します。

doc.dss() -> Dss | None

文書のパース済みドキュメントセキュリティストア（LTV 用素材）を返します。存在しない場合は None です。Dss を参照してください。

Page API (v0.3.34)

PdfDocument はイテラブルかつインデックス可能で、遅延評価される Page オブジェクトを返します。Page を参照してください。

len(doc)                  # number of pages
doc[i]                    # page at index i (negative indexing supported)
doc[-1]                   # last page
for page in doc: ...      # iterate pages

DOM アクセス

doc.page(index: int) -> PdfPage

要素単位の編集を行うための DOM ライクなページハンドルを取得します。PdfPage を参照してください。

doc.save_page(page: PdfPage) -> None

変更した PdfPage を文書に書き戻します。

レンダリング

doc.render_page(
    page: int,
    dpi: int | None = None,
    format: str | None = None,
    background: tuple[float, float, float, float] | None = None,
    transparent: bool = False,
    render_annotations: bool | None = None,
    jpeg_quality: int | None = None,
    excluded_layers: list[str] | None = None
) -> bytes

DPI・背景・透過・注釈レンダリング・JPEG 品質・除外レイヤーを制御しながら、ページを PNG または JPEG のバイト列にレンダリングします。

doc.render_pixmap(page: int, dpi: int | None = None) -> RenderedPixmap

ページを生の RGBA RenderedPixmap（width、height、data を持つ名前付きタプル）にレンダリングします。

doc.render_separations(page: int, dpi: int | None = None) -> list[SeparationPlate]

ページのインクごとの分版プレートをレンダリングします。SeparationPlate 名前付きタプル（name、width、height、data）のリストを返します。

doc.render_separation(page: int, ink_name: str, dpi: int | None = None) -> SeparationPlate

指定した名前のインク分版プレートを 1 つレンダリングします。

メソッド	戻り値の型	説明
`render_page_fit(page, fit_width, fit_height, format=0)`	`bytes`	ピクセルボックスに収まるように拡大縮小してページをレンダリングする
`flatten_to_images(dpi=150)`	`bytes`	全ページを画像ベースの PDF にフラット化する

保存

doc.save(path: str, compress: bool = True, garbage_collect: bool = True, linearize: bool = False) -> None

PDF をファイルに保存します。ストリーム圧縮、不要オブジェクトのガベージコレクション、リニアライズ（高速 Web 表示）を切り替えられます。

doc.to_bytes(compress: bool = True, garbage_collect: bool = True, linearize: bool = False) -> bytes

save() と同じオプションで PDF をバイト列にシリアライズします。

doc.save_encrypted(
    path: str,
    user_password: str,
    owner_password: str | None = None,
    allow_print: bool = True,
    allow_copy: bool = True,
    allow_modify: bool = True,
    allow_annotate: bool = True
) -> None

AES-256 のパスワード保護と権限制御を付けて保存します。owner_password が None の場合はユーザーパスワードが使われます。

doc.to_bytes_encrypted(
    user_password: str,
    owner_password: str | None = None,
    allow_print: bool = True,
    allow_copy: bool = True,
    allow_modify: bool = True,
    allow_annotate: bool = True
) -> bytes

AES-256 で暗号化した PDF をバイト列にシリアライズします。

Page

doc[i] または PdfDocument のイテレーションで返される遅延評価ページハンドルです。すべてのプロパティはアクセス時に計算され、親文書にディスパッチされます。

from pdf_oxide import PdfDocument

with PdfDocument("paper.pdf") as doc:
    page = doc[0]
    text = page.text
    md = page.markdown(detect_headings=True)

プロパティ（遅延評価）

プロパティ	型	説明
`index`	`int`	0 始まりのページインデックス
`width`, `height`	`float`	ページ寸法（PDF ポイント）
`bbox`	`tuple[float, 4]`	`(llx, lly, urx, ury)`
`text`	`str`	抽出されたプレーンテキスト
`chars`, `words`, `lines`, `spans`	`list[...]`	構造化テキスト
`tables`	`list[dict]`	行＋セル（テキスト＋バウンディングボックス）を持つテーブル
`images`, `paths`, `annotations`	`list[...]`	ページコンテンツ

メソッド

page.markdown(preserve_layout=False, detect_headings=True,
              include_images=False, image_output_dir=None,
              embed_images=True, include_form_fields=True) -> str
page.plain_text(...) -> str
page.html(...) -> str
page.render(dpi=None, format=None, background=None, transparent=False,
            render_annotations=None, jpeg_quality=None, excluded_layers=None) -> bytes
page.render_pixmap(dpi=None) -> RenderedPixmap
page.search(pattern, case_insensitive=False, literal=False,
            whole_word=False, max_results=100) -> list
page.region(x, y, width, height) -> PdfPageRegion

この遅延評価ページオブジェクトは doc.pages()（文書を直接イテレートするのと等価なイテレータ）としても公開されています。

PdfPage

要素単位のアクセスと編集を行うための DOM ライクなページハンドルです。PdfDocument.page() 経由で取得します。

from pdf_oxide import PdfDocument

doc = PdfDocument("file.pdf")
page = doc.page(0)

プロパティ

プロパティ	型	説明
`index`	`int`	0 始まりのページインデックス
`width`	`float`	ページ幅（PDF ポイント）
`height`	`float`	ページ高さ（PDF ポイント）

メソッド

page.children() -> list[PdfElement]

ページ上のすべての要素を取得します。

page.find_text_containing(needle: str) -> list[PdfText]

指定した部分文字列を含むすべてのテキスト要素を検索します。

page.find_images() -> list[PdfImage]

ページ上のすべての画像要素を検索します。

page.get_element(element_id: str) -> PdfElement | None

ID を指定して特定の要素を取得します。

page.set_text(text_id: PdfTextId, new_text: str) -> None

PdfTextId で識別される要素のテキスト内容を置き換えます。

page.annotations() -> list[PdfAnnotation]

ページ上のすべての注釈を取得します。

page.add_link(x: float, y: float, width: float, height: float, url: str) -> str

URL リンク注釈を追加します。注釈 ID を返します。

page.add_highlight(x: float, y: float, width: float, height: float, color: tuple[float, float, float]) -> str

RGB 色を指定してハイライト注釈を追加します。注釈 ID を返します。

page.add_note(x: float, y: float, text: str) -> str

付箋（スティッキーノート）注釈を追加します。注釈 ID を返します。

page.remove_annotation(index: int) -> bool

インデックスで注釈を削除します。削除された場合は True を返します。

page.add_text(text: str, x: float, y: float, font_size: float = 12.0) -> PdfTextId

ページに新しいテキストを追加します。後で参照するための PdfTextId を返します。

page.remove_element(element_id: PdfTextId) -> bool

ID で要素を削除します。削除された場合は True を返します。

例

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
page = doc.page(0)

# Find and replace text
for text in page.find_text_containing("DRAFT"):
    page.set_text(text.id, "FINAL")

# Add a link
page.add_link(100, 700, 200, 20, "https://example.com")

doc.save_page(page)
doc.save("invoice_updated.pdf")

Pdf

さまざまなソース形式から PDF を作成するための統合クラスです。

from pdf_oxide import Pdf

ファクトリメソッド

Pdf.from_markdown(content: str, title: str | None = None, author: str | None = None) -> Pdf

Markdown コンテンツから PDF を作成します。

Pdf.from_html(content: str, title: str | None = None, author: str | None = None) -> Pdf

HTML コンテンツから PDF を作成します。

Pdf.from_text(content: str, title: str | None = None, author: str | None = None) -> Pdf

プレーンテキストから PDF を作成します。

Pdf.from_markdown_with_template(content: str, template: str, title: str | None = None, author: str | None = None) -> Pdf

名前付きの CSS／レイアウトテンプレートを通してレンダリングした Markdown から PDF を作成します。

Pdf.from_image(path: str) -> Pdf

画像ファイル（JPEG、PNG）から 1 ページの PDF を作成します。

Pdf.from_bytes(data: bytes) -> Pdf

メモリ上のバイト列から既存の PDF を開いて編集します。S3・HTTP・データベースからダウンロードした PDF の読み込みに便利です。

from pdf_oxide import Pdf

pdf = Pdf.from_bytes(existing_pdf_bytes)
pdf.save("modified.pdf")

Pdf.from_images(paths: list[str]) -> Pdf

複数の画像ファイルから、1 画像 1 ページの複数ページ PDF を作成します。

Pdf.from_image_bytes(data: bytes) -> Pdf

画像のバイト列から 1 ページの PDF を作成します。

Pdf.merge(paths: list[str]) -> Pdf

複数の PDF ファイル（パス指定）を 1 つの Pdf にマージします。

メソッド

pdf.save(path: str) -> None

PDF をファイルに保存します。

pdf.to_bytes() -> bytes

PDF の内容をバイト列で取得します。

len(pdf) -> int

PDF のサイズをバイト数で取得します（__len__ 経由）。

PdfText

ページ上のテキスト要素を表します。PdfPage.find_text_containing() が返します。

プロパティ	型	説明
`id`	`PdfTextId`	一意の要素識別子
`value`	`str`	テキスト内容
`text`	`str`	テキスト内容（`value` のエイリアス）
`bbox`	`tuple[float, float, float, float]`	バウンディングボックス `(x0, y0, x1, y1)`
`font_name`	`str`	PostScript フォント名
`font_size`	`float`	フォントサイズ（ポイント）
`is_bold`	`bool`	テキストが太字かどうか
`is_italic`	`bool`	テキストがイタリックかどうか

メソッド

メソッド	パラメータ	戻り値	説明
`contains(needle)`	`str`	`bool`	テキストが部分文字列を含むか確認する
`starts_with(prefix)`	`str`	`bool`	テキストが接頭辞で始まるか確認する
`ends_with(suffix)`	`str`	`bool`	テキストが接尾辞で終わるか確認する

PdfImage

ページ上の画像要素を表します。PdfPage.find_images() が返します。

プロパティ	型	説明
`bbox`	`tuple[float, float, float, float]`	バウンディングボックス `(x0, y0, x1, y1)`
`width`	`int`	画像の幅（ピクセル）
`height`	`int`	画像の高さ（ピクセル）
`aspect_ratio`	`float`	幅 / 高さの比率

PdfAnnotation

ページ上の注釈を表します。PdfPage.annotations() が返します。

プロパティ	型	説明
`subtype`	`str`	注釈の種別（例: `"Link"`、`"Highlight"`、`"Text"`）
`rect`	`tuple[float, float, float, float]`	位置 `(x0, y0, x1, y1)`
`contents`	`str	None`
`color`	`tuple[float, float, float]	None`
`is_modified`	`bool`	注釈が変更されたかどうか
`is_new`	`bool`	注釈が新規追加されたかどうか

PdfElement

汎用の要素ラッパーです。PdfPage.children() が返します。

メソッド	戻り値	説明
`is_text()`	`bool`	要素がテキストか確認する
`is_image()`	`bool`	要素が画像か確認する
`is_path()`	`bool`	要素がベクターパスか確認する
`is_table()`	`bool`	要素がテーブルか確認する
`is_structure()`	`bool`	要素が構造要素か確認する
`as_text()`	`PdfText	None`
`as_image()`	`PdfImage	None`

プロパティ	型	説明
`bbox`	`tuple[float, float, float, float]`	バウンディングボックス

TextChar

位置情報とフォントメタデータを持つ 1 文字を表します。PdfDocument.extract_chars() が返します。

from pdf_oxide import TextChar  # or access via PdfDocument

属性	型	説明
`char`	`str`	Unicode 文字
`bbox`	`tuple[float, float, float, float]`	バウンディングボックス `(x0, y0, x1, y1)`
`font_name`	`str`	PostScript フォント名
`font_size`	`float`	フォントサイズ（ポイント）
`font_weight`	`str`	ウェイト（`"thin"`、`"light"`、`"normal"`、`"medium"`、`"semi-bold"`、`"bold"`、`"extra-bold"`、`"black"`）
`is_italic`	`bool`	文字がイタリックかどうか
`color`	`tuple[float, float, float]`	RGB 色 `(r, g, b)`、値は 0.0〜1.0
`rotation_degrees`	`float`	文字の回転角（度）
`origin_x`	`float`	テキスト原点の X 位置
`origin_y`	`float`	テキスト原点の Y 位置
`advance_width`	`float`	グリフの送り幅
`mcid`	`int	None`

例

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
chars = doc.extract_chars(0)

for ch in chars[:5]:
    print(f"'{ch.char}' at bbox={ch.bbox} "
          f"font={ch.font_name} size={ch.font_size:.1f} "
          f"weight={ch.font_weight} italic={ch.is_italic}")

TextSpan

同じフォントとスタイルを共有するテキストの並びを表します。PdfDocument.extract_spans() が返します。

属性	型	説明
`text`	`str`	テキスト内容
`bbox`	`tuple[float, float, float, float]`	バウンディングボックス `(x0, y0, x1, y1)`
`font_name`	`str`	PostScript フォント名
`font_size`	`float`	フォントサイズ（ポイント）
`is_bold`	`bool`	スパンが太字かどうか
`is_italic`	`bool`	スパンがイタリックかどうか
`is_monospace`	`bool`	フォントが等幅か（Courier、Consolas など）
`char_widths`	`list[float]`	正確なバウンディングボックス算出用のグリフごとの送り幅
`color`	`tuple[float, float, float]`	RGB 色 `(r, g, b)`、値は 0.0〜1.0

例

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
spans = doc.extract_spans(0)

for span in spans:
    print(f"'{span.text}' font={span.font_name} size={span.font_size:.1f} "
          f"bold={span.is_bold} italic={span.is_italic} color={span.color}")

画像抽出

extract_images() は画像メタデータを持つ ImageInfo オブジェクトを返します。ディスクへの保存に適した生の画像データには extract_image_bytes() を使います。

extract_image_bytes() の戻り値フォーマット

extract_image_bytes() が返す各 dict は以下のキーを持ちます。

キー	型	説明
`width`	`int`	画像の幅（ピクセル）
`height`	`int`	画像の高さ（ピクセル）
`data`	`bytes`	生の画像データ
`format`	`str`	画像フォーマット（例: `"png"`、`"jpeg"`）

例

from pdf_oxide import PdfDocument

doc = PdfDocument("brochure.pdf")
images = doc.extract_image_bytes(0)

for i, img in enumerate(images):
    print(f"Image {i}: {img['width']}x{img['height']}")
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

SearchResult

テキスト検索のマッチを表します。search() と search_page() が返します。

属性	型	説明
`page`	`int`	0 始まりのページインデックス
`text`	`str`	マッチしたテキスト
`x`	`float`	X 位置（PDF ポイント）
`y`	`float`	Y 位置（PDF ポイント）

FormField

フォームフィールドを表します。PdfDocument.get_form_fields() が返します。

プロパティ	型	説明
`name`	`str`	完全修飾フィールド名
`field_type`	`str`	フィールド型: `"text"`、`"button"`、`"choice"`、`"signature"`、`"unknown"`
`value`	`str	bool
`tooltip`	`str	None`
`bounds`	`tuple[float, float, float, float]	None`
`flags`	`int	None`
`max_length`	`int	None`
`is_readonly`	`bool`	フィールドが読み取り専用かどうか
`is_required`	`bool`	フィールドが必須かどうか

TextWord

単語単位にまとめたテキストの並びです。PdfDocument.extract_words() と PdfPageRegion.extract_words() が返します。

プロパティ	型	説明
`text`	`str`	単語のテキスト
`bbox`	`tuple[float, float, float, float]`	バウンディングボックス `(x0, y0, x1, y1)`
`font_name`	`str`	PostScript フォント名
`font_size`	`float`	フォントサイズ（ポイント）
`is_bold`	`bool`	単語が太字かどうか
`is_italic`	`bool`	単語がイタリックかどうか
`chars`	`list[TextChar]`	構成する文字

TextLine

行単位にまとめたテキストの並びです。PdfDocument.extract_text_lines() と PdfPageRegion.extract_text_lines() が返します。

プロパティ	型	説明
`text`	`str`	行のテキスト
`bbox`	`tuple[float, float, float, float]`	バウンディングボックス `(x0, y0, x1, y1)`
`words`	`list[TextWord]`	行内の単語
`chars`	`list[TextChar]`	行内の文字

PdfPageRegion

ページの切り抜き範囲です。PdfDocument.within() と PdfPage.region() が返します。

プロパティ	型	説明
`bbox`	`tuple[float, float, float, float]`	範囲の境界

メソッド

region.extract_text() -> str
region.extract_words() -> list[TextWord]
region.extract_text_lines() -> list[TextLine]
region.extract_tables(table_settings: dict | None = None) -> list[dict]
region.extract_images() -> list
region.extract_paths() -> list

抽出メソッドは範囲のバウンディングボックスにスコープされます。

LayoutParams

ページに対して算出された適応的レイアウトパラメータです。PdfDocument.page_layout_params() が返します。

プロパティ	型	説明
`word_gap_threshold`	`float`	単語間ギャップのしきい値（ポイント）
`line_gap_threshold`	`float`	行間ギャップのしきい値（ポイント）
`median_char_width`	`float`	文字幅の中央値
`median_font_size`	`float`	フォントサイズの中央値
`median_line_spacing`	`float`	行間隔の中央値
`column_count`	`int`	検出されたテキスト段組み数

ExtractionProfile

extract_words() / extract_text_lines() に渡す、調整可能なテキスト抽出プロファイルです。

from pdf_oxide import ExtractionProfile

静的コンストラクタ

ExtractionProfile.conservative()
ExtractionProfile.aggressive()
ExtractionProfile.balanced()
ExtractionProfile.academic()
ExtractionProfile.policy()
ExtractionProfile.form()
ExtractionProfile.government()
ExtractionProfile.scanned_ocr()
ExtractionProfile.adaptive()
ExtractionProfile.available() -> list[str]   # names of all built-in profiles

プロパティ

プロパティ	型	説明
`name`	`str`	プロファイル名
`tj_offset_threshold`	`float`	TJ 配列オフセットの単語区切りしきい値
`word_margin_ratio`	`float`	単語マージン比率
`space_threshold_em_ratio`	`float`	スペース幅しきい値（em 比率）
`space_char_multiplier`	`float`	スペース文字の乗数
`use_adaptive_threshold`	`bool`	適応的しきい値が有効かどうか

OfficeConverter

Office 文書（DOCX、XLSX、PPTX）を PDF に変換します。Rust ビルドで office フィーチャーが必要です。

from pdf_oxide import OfficeConverter

OfficeConverter()   # instances are stateless; the conversion methods are also usable as static methods

メソッド

OfficeConverter.from_docx(path: str) -> Pdf

Word 文書を Pdf オブジェクトに変換します。

OfficeConverter.from_docx_bytes(data: bytes) -> Pdf

Word 文書のバイト列を Pdf オブジェクトに変換します。

OfficeConverter.from_xlsx(path: str) -> Pdf

Excel スプレッドシートを Pdf オブジェクトに変換します。

OfficeConverter.from_xlsx_bytes(data: bytes) -> Pdf

Excel スプレッドシートのバイト列を Pdf オブジェクトに変換します。

OfficeConverter.from_pptx(path: str) -> Pdf

PowerPoint プレゼンテーションを Pdf オブジェクトに変換します。

OfficeConverter.from_pptx_bytes(data: bytes) -> Pdf

PowerPoint プレゼンテーションのバイト列を Pdf オブジェクトに変換します。

OfficeConverter.convert(path: str) -> Pdf

形式を自動判別し、対応する任意の Office 文書を Pdf オブジェクトに変換します。

例

from pdf_oxide import OfficeConverter

pdf = OfficeConverter.from_docx("report.docx")
pdf.save("report.pdf")

# Or use convert() for auto-detection
pdf = OfficeConverter.convert("spreadsheet.xlsx")
pdf.save("spreadsheet.pdf")

グラフィッククラス

これらのクラスは、グラフィックを伴う高度な PDF 作成のために利用できます。

Color

from pdf_oxide import Color

Color(r: float, g: float, b: float)  # RGB, values 0.0-1.0
Color.from_hex("#ff0000")
Color.black()
Color.white()
Color.red()
Color.green()
Color.blue()

BlendMode

from pdf_oxide import BlendMode

BlendMode.NORMAL()
BlendMode.MULTIPLY()
BlendMode.SCREEN()
BlendMode.OVERLAY()
BlendMode.DARKEN()
BlendMode.LIGHTEN()
BlendMode.COLOR_DODGE()
BlendMode.COLOR_BURN()
BlendMode.HARD_LIGHT()
BlendMode.SOFT_LIGHT()
BlendMode.DIFFERENCE()
BlendMode.EXCLUSION()

ExtGState

from pdf_oxide import ExtGState

gs = ExtGState()
gs = gs.fill_alpha(0.5)
gs = gs.stroke_alpha(0.8)
gs = gs.alpha(0.5)  # Set both fill and stroke
gs = gs.blend_mode(BlendMode.MULTIPLY())

gs = ExtGState.semi_transparent()  # Preset

LineCap / LineJoin

from pdf_oxide import LineCap, LineJoin

LineCap.BUTT()       # Default
LineCap.ROUND()
LineCap.SQUARE()

LineJoin.MITER()     # Default
LineJoin.ROUND()
LineJoin.BEVEL()

グラデーション

from pdf_oxide import LinearGradient, RadialGradient, Color

# Linear gradient (fluent API)
grad = (LinearGradient()
    .start(0, 0)
    .end(100, 0)
    .add_stop(0.0, Color.red())
    .add_stop(1.0, Color.blue()))

# Convenience constructors
hgrad = LinearGradient.horizontal(200, Color.red(), Color.blue())
vgrad = LinearGradient.vertical(100, Color(1, 1, 0), Color(0, 0, 1))

# Radial gradient
rgrad = RadialGradient.centered(50, 50, 50)
rgrad = rgrad.add_stop(0.0, Color(1, 1, 0))
rgrad = rgrad.add_stop(1.0, Color(1, 0, 0))

PatternPresets

from pdf_oxide import PatternPresets, Color

PatternPresets.horizontal_stripes(width, height, stripe_height, color)
PatternPresets.vertical_stripes(width, height, stripe_width, color)
PatternPresets.checkerboard(size, color1, color2)
PatternPresets.dots(spacing, radius, color)
PatternPresets.diagonal_lines(size, line_width, color)
PatternPresets.crosshatch(size, line_width, color)

OCR クラス

Rust ビルドで ocr フィーチャーが必要です。

OcrEngine

from pdf_oxide import OcrEngine, OcrConfig

engine = OcrEngine(
    det_model_path: str,
    rec_model_path: str,
    dict_path: str,
    config: OcrConfig | None = None
)

OcrConfig

from pdf_oxide import OcrConfig

config = OcrConfig(
    det_threshold: float | None = None,
    box_threshold: float | None = None,
    rec_threshold: float | None = None,
    num_threads: int | None = None,
    max_candidates: int | None = None,
    use_v5: bool = False
)

DocumentBuilder

PDF をページ単位で組み立てるためのフルーエントビルダーです。下記の例とゼロから作成を参照してください。

from pdf_oxide import DocumentBuilder

文書レベルのメソッド

メソッド	パラメータ	説明
`DocumentBuilder()`	–	新しいビルダーを構築する
`title(title)`	`str`	文書タイトルを設定する
`author(author)`	`str`	文書の作成者を設定する
`subject(subject)`	`str`	文書のサブジェクトを設定する
`keywords(keywords)`	`str`	文書のキーワードを設定する
`creator(creator)`	`str`	生成アプリケーション名を設定する
`on_open(script)`	`str`	文書レベルのオープン時 JavaScript アクションを設定する
`tagged_pdf_ua1()`	–	タグ付き PDF/UA-1 アクセシブル文書を出力する
`language(lang)`	`str`	文書の言語を設定する（例: `"en-US"`）
`role_map(custom, standard)`	`str, str`	カスタム構造タグを標準タグにマッピングする
`register_embedded_font(name, font)`	`str, EmbeddedFont`	フォントを登録する（`EmbeddedFont` を消費する）

ページファクトリ

builder.a4_page() -> FluentPageBuilder       # 595 x 842 pt
builder.letter_page() -> FluentPageBuilder   # 612 x 792 pt
builder.page(width: float, height: float) -> FluentPageBuilder

出力

builder.build() -> bytes
builder.save(path: str) -> None
builder.save_encrypted(path: str, user_password: str, owner_password: str) -> None
builder.to_bytes_encrypted(user_password: str, owner_password: str) -> bytes

FluentPageBuilder

done() までページレベルの操作をバッファリングします。DocumentBuilder.a4_page() / letter_page() / page() が返します。すべてのメソッドはチェーン用に self を返します。done() はページをコミットし、親の DocumentBuilder を返します。

テキストとレイアウト

メソッド	パラメータ	説明
`font(name, size)`	`str, float`	現在のフォントとサイズを設定する
`at(x, y)`	`float, float`	カーソルを絶対位置に移動する
`text(text)`	`str`	カーソル位置にテキストを描画する
`heading(level, text)`	`int, str`	見出しを描画する（レベル 1〜6）
`paragraph(text)`	`str`	折り返し付きの段落を描画する
`space(points)`	`float`	垂直方向のスペースを進める
`horizontal_rule()`	–	水平の区切り線を描画する
`columns(column_count, gap_pt, text)`	`int, float, str`	バランスのとれた多段組みテキストフロー
`footnote(ref_mark, note_text)`	`str, str`	インライン参照マーク＋ページ下部の注
`new_page_same_size()`	–	同じ寸法の新しいページを開始する
`measure(text) -> float`	`str`	レンダリングされたテキスト幅をポイントで測定する
`remaining_space() -> float`	–	ページの残り垂直スペース

インラインラン

page.inline(text: str)
page.inline_bold(text: str)
page.inline_italic(text: str)
page.inline_color(text: str, r: float, g: float, b: float)
page.newline()

リンクとアクション

page.link_url(url: str)
page.link_page(page: int)
page.link_named(name: str)
page.link_javascript(script: str)
page.on_open(script: str)
page.on_close(script: str)
page.field_keystroke(script: str)
page.field_format(script: str)
page.field_validate(script: str)
page.field_calculate(script: str)

マークアップ注釈

page.highlight(color: tuple[float, float, float])
page.underline(color: tuple[float, float, float])
page.strikeout(color: tuple[float, float, float])
page.squiggly(color: tuple[float, float, float])
page.sticky_note(text: str)
page.sticky_note_at(x: float, y: float, text: str)
page.watermark(text: str)
page.watermark_confidential()
page.watermark_draft()
page.stamp(name: str)
page.freetext(x: float, y: float, w: float, h: float, text: str)

AcroForm ウィジェット

page.text_field(name: str, x: float, y: float, w: float, h: float, default_value: str | None = None)
page.checkbox(name: str, x: float, y: float, w: float, h: float, checked: bool = False)
page.combo_box(name: str, x: float, y: float, w: float, h: float, options: list[str], selected: str | None = None)
page.radio_group(name: str, buttons: list[tuple[str, float, float, float, float]], selected: str | None = None)
page.push_button(name: str, x: float, y: float, w: float, h: float, caption: str)
page.signature_field(name: str, x: float, y: float, w: float, h: float)

グラフィック

page.rect(x: float, y: float, w: float, h: float)
page.filled_rect(x: float, y: float, w: float, h: float, r: float, g: float, b: float)
page.line(x1: float, y1: float, x2: float, y2: float)
page.text_in_rect(x: float, y: float, w: float, h: float, text: str, align: int | None = None)
page.stroke_rect(x, y, w, h, width=1.0, color=(0.0, 0.0, 0.0))
page.stroke_rect_dashed(x, y, w, h, dash, width=1.0, color=(0.0, 0.0, 0.0), phase=0.0)
page.stroke_line(x1, y1, x2, y2, width=1.0, color=(0.0, 0.0, 0.0))
page.stroke_line_dashed(x1, y1, x2, y2, dash, width=1.0, color=(0.0, 0.0, 0.0), phase=0.0)

画像とバーコード

page.image_with_alt(bytes: bytes, x: float, y: float, w: float, h: float, alt_text: str)
page.image_artifact(bytes: bytes, x: float, y: float, w: float, h: float)
page.barcode_1d(barcode_type: int, data: str, x: float, y: float, w: float, h: float)
page.barcode_qr(data: str, x: float, y: float, size: float)

barcode_type: 0=Code128、1=Code39、2=EAN13、3=EAN8、4=UPCA、5=ITF、6=Code93、7=Codabar。

テーブル

page.table(table: Table)
page.streaming_table(
    columns: list[Column],
    repeat_header: bool = False,
    mode: str = "fixed",
    sample_rows: int = 50,
    min_col_width_pt: float = 20.0,
    max_col_width_pt: float = 400.0,
    max_rowspan: int = 1,
    batch_size: int = 256
) -> StreamingTable

コミット

page.done() -> DocumentBuilder

EmbeddedFont

DocumentBuilder に登録された TTF/OTF フォントです。

from pdf_oxide import EmbeddedFont

EmbeddedFont.from_file(path: str) -> EmbeddedFont
EmbeddedFont.from_bytes(data: bytes, name: str | None = None) -> EmbeddedFont

プロパティ	型	説明
`name`	`str`	フォントの登録名

Tables

フルーエントなテーブル API のための値オブジェクトです。

Align

from pdf_oxide import Align

Align.LEFT     # 0
Align.CENTER   # 1
Align.RIGHT    # 2

Column

from pdf_oxide import Column

Column(header: str, width: float = 100.0, align: Align | int | None = None)

プロパティ	型	説明
`header`	`str`	列ヘッダーのテキスト
`width`	`float`	列幅（ポイント）
`align`	`int`	セルの配置

Table

from pdf_oxide import Table

Table(columns: list[Column], rows: list[list[str]], has_header: bool = False)

FluentPageBuilder.table() が消費するバッファ済みテーブルです。has_header=True の場合、列ヘッダーはスタイル付きのヘッダー行としてレンダリングされます。

StreamingTable

FluentPageBuilder.streaming_table() が返す行ストリーミング型のテーブルハンドルで、一度に展開するには大きすぎるテーブル向けです。

メソッド	パラメータ	説明
`push_row(cells)`	`list[str]`	セル文字列の行を追加する
`push_row_span(cells)`	`list[tuple[str, int]]`	`(text, colspan)` セルの行を追加する
`flush()`	–	現在のバッチをフラッシュする
`finish()`	–	テーブルを完了し、`FluentPageBuilder` を返す
`column_count()`	– → `int`	列数
`pending_row_count()`	– → `int`	バッファ済みだが未コミットの行数
`batch_count()`	– → `int`	完了したバッチの数

ページテンプレート

ページ間で適用される、繰り返しのヘッダー／フッターアーティファクトです。

Artifact / ArtifactStyle

from pdf_oxide import Artifact, ArtifactStyle

Artifact()                       # empty artifact
Artifact.center(text: str)       # centered artifact text
artifact.with_left(text: str)    # add left-aligned text

style = ArtifactStyle()
style = style.font(name: str, size: float)
style = style.bold()

Header / Footer

from pdf_oxide import Header, Footer

Header()                  # or Header.center(text: str)
Footer()                  # or Footer.center(text: str)

PageTemplate

from pdf_oxide import PageTemplate, Header, Footer

template = (PageTemplate()
    .header(Header.center("Confidential"))
    .footer(Footer.center("Page")))

デジタル署名

PDF の署名・タイムスタンプ・検証を行います（PAdES / LTV）。Rust ビルドで signatures（および任意で tsa-client）フィーチャーが必要です。

Certificate

from pdf_oxide import Certificate

Certificate.load(data: bytes) -> Certificate                       # DER certificate (verify only)
Certificate.load_pem(cert_pem: str, key_pem: str) -> Certificate   # signing credential
Certificate.load_pkcs12(data: bytes, password: str) -> Certificate # PKCS#12 / .p12 signing credential

メソッド	戻り値	説明
`subject()`	`str`	証明書のサブジェクト DN
`issuer()`	`str`	証明書の発行者 DN
`serial()`	`str`	シリアル番号
`validity()`	`tuple[int, int]`	`(not_before, not_after)` の Unix タイムスタンプ
`is_valid()`	`bool`	証明書が現在その有効期間内にあるかどうか

Signature

PdfDocument.signatures() が返します。

プロパティ / メソッド	型	説明
`signer_name`	`str	None`
`reason`	`str	None`
`location`	`str	None`
`contact_info`	`str	None`
`signing_time`	`int	None`
`covers_whole_document`	`bool`	署名がファイル全体をカバーしているかどうか
`pades_level`	`PadesLevel`	検出された PAdES ベースライン（B-B/B-T/B-LT）
`verify()`	`bool`	署名を暗号学的に検証する
`verify_detached(pdf_data)`	`bool`	ファイルバイトに対する `messageDigest` の照合を含めて検証する

Timestamp

from pdf_oxide import Timestamp

Timestamp.parse(data: bytes) -> Timestamp

プロパティ / メソッド	型	説明
`time`	`int`	タイムスタンプ時刻（Unix）
`serial`	`str`	TSA レスポンスのシリアル番号
`policy_oid`	`str`	TSA ポリシー OID
`tsa_name`	`str`	TSA 名
`hash_algorithm`	`int`	メッセージインプリントのハッシュアルゴリズムコード
`message_imprint`	`bytes`	ハッシュ化されたメッセージインプリント
`verify()`	`bool`	タイムスタンプトークンを検証する

TsaClient

from pdf_oxide import TsaClient

client = TsaClient(
    url: str,
    username: str | None = None,
    password: str | None = None,
    timeout_seconds: int = 30,
    hash_algorithm: int = 2,
    use_nonce: bool = True,
    cert_req: bool = True
)
client.request_timestamp(data: bytes) -> Timestamp
client.request_timestamp_hash(digest: bytes, algorithm: int = 2) -> Timestamp

PadesLevel

from pdf_oxide import PadesLevel

PadesLevel.B_B     # baseline
PadesLevel.B_T     # + trusted timestamp
PadesLevel.B_LT    # + long-term validation material
PadesLevel.B_LTA   # + archival timestamp

RevocationMaterial

from pdf_oxide import RevocationMaterial

RevocationMaterial(
    certs: list[bytes] | None = None,
    crls: list[bytes] | None = None,
    ocsps: list[bytes] | None = None
)

B-LT 署名のための DER エンコードされた証明書・CRL・OCSP レスポンスです。

Dss

パース済みのドキュメントセキュリティストアで、PdfDocument.dss() が返します。

プロパティ	型	説明
`certs`	`list[bytes]`	文書レベルの証明書 DER ブロブ
`crls`	`list[bytes]`	CRL DER ブロブ
`ocsps`	`list[bytes]`	OCSP レスポンス DER ブロブ
`vri`	`list[str]`	署名ごとの VRI キー（`/Contents` の SHA-1 を 16 進表記したもの）

モジュールレベル関数

from pdf_oxide import (
    sign_pdf_bytes, sign_pdf_bytes_pades, has_document_timestamp,
    generate_barcode_svg, generate_qr_svg,
    plan_split_by_bookmarks, split_by_bookmarks,
)

署名

sign_pdf_bytes(pdf_data: bytes, cert: Certificate, reason: str | None = None, location: str | None = None) -> bytes

読み込んだ署名用 Certificate で生の PDF バイトに署名し、署名済み PDF を返します。

sign_pdf_bytes_pades(
    pdf_data: bytes,
    cert: Certificate,
    level: PadesLevel,
    tsa_url: str | None = None,
    reason: str | None = None,
    location: str | None = None,
    revocation: RevocationMaterial | None = None
) -> bytes

生の PDF バイトに PAdES ベースラインレベルで署名します。B_T/B_LT には tsa_url が必要です。

has_document_timestamp(pdf_data: bytes) -> bool

PDF が文書レベルの RFC 3161 アーカイブタイムスタンプ（PAdES-B-LTA）を持つかどうかを返します。

バーコード

generate_barcode_svg(barcode_type: int, data: str) -> str
generate_qr_svg(data: str, error_correction: int, size: int) -> str

1D バーコードまたは QR コードを SVG 文字列として生成します。barcodes フィーチャーが必要です。

ブックマークによる分割

plan_split_by_bookmarks(src_bytes: bytes, title_prefix: str | None = None, ignore_case: bool = False, level: int = 1, include_front_matter: bool = True) -> list[dict]
split_by_bookmarks(src_bytes: bytes, title_prefix: str | None = None, ignore_case: bool = False, level: int = 1, include_front_matter: bool = True) -> list[tuple[dict, bytes]]

ブックマークの境界で PDF の分割を計画または実行します。plan_* はセグメントのメタデータのみを返します。split_* は各セグメントをその PDF バイトとペアにして返します。

OCR モデルのプロビジョニング

prefetch_models(languages: list[str]) -> str
model_manifest() -> str
prefetch_available() -> bool

オフライン／エアギャップ環境での利用に向けて OCR モデルをプロビジョニングし、モデルマニフェスト（JSON）を確認し、このビルドがモデルをダウンロードできるかどうかを確認します。

ロギング

setup_logging() -> None
set_log_level(level: str) -> None     # "off" | "error" | "warn" | "info" | "debug" | "trace"
get_log_level() -> str
disable_logging() -> None

エンジンの調整

set_max_ops_per_stream(limit: int | None) -> int | None
set_preserve_unmapped_glyphs(preserve: bool) -> bool

ストリームごとのオペレータ上限（敵対的入力への保護）と、未マッピンググリフの U+FFFD 保持を調整します。どちらも以前の値を返します。

暗号ガバナンス

crypto_active_provider() -> str
crypto_available_providers() -> list[str]
crypto_use_fips() -> None                 # install the FIPS aws-lc-rs provider (requires the fips feature)
crypto_set_policy(spec: str) -> None      # e.g. "strict" or "compat;deny:rc4@write"
crypto_policy() -> str
crypto_inventory() -> list[str]
crypto_cbom() -> str                      # CycloneDX 1.6 CBOM (JSON)

非同期 API

ブロッキング操作をスレッドプールで実行する async/await ラッパーです。メソッドは同期版に対応しています。

from pdf_oxide import AsyncPdfDocument, AsyncPdf, AsyncOfficeConverter

async def main():
    doc = await AsyncPdfDocument.open("input.pdf")
    text = await doc.extract_text(0)
    await doc.close()
    # Or use as an async context manager:
    async with await AsyncPdfDocument.from_bytes(pdf_bytes) as doc:
        md = await doc.to_markdown_all()

クラス	コンストラクタ	備考
`AsyncPdfDocument`	`await AsyncPdfDocument.open(path, password=None)`、`await AsyncPdfDocument.from_bytes(data, password=None)`	すべての `PdfDocument` メソッドが awaitable として利用可能。`async with` と `.close()` をサポート
`AsyncPdf`	`Pdf` のファクトリメソッドをラップ	`await pdf.save(path)`、`await pdf.to_bytes()`
`AsyncOfficeConverter`	`OfficeConverter` の静的メソッドをラップ	例: `await AsyncOfficeConverter.from_docx(path)`

エラー処理

PdfError

PDF 固有のエラーはすべて PdfError を送出します。

from pdf_oxide import PdfDocument, PdfError

try:
    doc = PdfDocument("file.pdf")
    text = doc.extract_text(0)
except PdfError as e:
    print(f"PDF error: {e}")
except FileNotFoundError:
    print("File not found")
except IndexError:
    print("Page index out of range")

よくあるエラーのシナリオ:

例外	原因
`PdfError`	不正な形式の PDF、パスワードなしの暗号化、パース失敗
`FileNotFoundError`	ファイルが存在しない
`IndexError`	ページインデックスが `page_count()` を超えている
`ValueError`	無効な引数（例: 負のページインデックス）

完全な例

from pdf_oxide import PdfDocument, Pdf

# --- Extraction ---
doc = PdfDocument("input.pdf")
print(f"Pages: {doc.page_count()}")

for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(f"Page {i + 1}: {len(text)} characters")

# Character-level analysis
chars = doc.extract_chars(0)
fonts = set(ch.font_name for ch in chars)
print(f"Fonts on page 1: {fonts}")

# Image extraction
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"extracted_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

# --- Creation ---
pdf = Pdf.from_markdown("# Report\n\nGenerated by PDF Oxide.",
                        title="Report", author="PDF Oxide")
pdf.save("report.pdf")

# --- Editing ---
doc = PdfDocument("document.pdf")
doc.set_title("Updated Title")
doc.set_author("New Author")
doc.rotate_all_pages(90)

# Search and replace via DOM
page = doc.page(0)
for text in page.find_text_containing("DRAFT"):
    page.set_text(text.id, "FINAL")
doc.save_page(page)

# Form filling
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name} ({f.field_type}) = {f.value}")
doc.set_form_field_value("name", "John Doe")

# Merge another PDF
merged_count = doc.merge_from("appendix.pdf")
print(f"Merged {merged_count} pages")

doc.save("output.pdf")

# --- Search ---
results = doc.search("configuration", case_insensitive=True)
for r in results:
    print(f"Page {r.page + 1}: '{r.text}' at ({r.x:.0f}, {r.y:.0f})")

v0.3.38 の追加機能

`DocumentBuilder` / `FluentPageBuilder` / `EmbeddedFont`

from pdf_oxide import DocumentBuilder, EmbeddedFont, StampType

font = EmbeddedFont.from_file("DejaVuSans.ttf")
# Alt: EmbeddedFont.from_bytes(data: bytes, name: str | None = None)

(DocumentBuilder()
    .register_embedded_font("DejaVu", font)
    .letter_page()           # or .a4_page() / .page(size)
        .at(72, 720).font("DejaVu", 12).text("Hello")
        .heading(1, "Title")
        .paragraph("Body text with automatic wrapping")
        # Annotations (15 methods)
        .link_url("https://example.com")
        .link_page(2)
        .link_named("glossary")
        .highlight((1.0, 1.0, 0.0))
        .underline((0.0, 0.0, 1.0))
        .strikeout((1.0, 0.0, 0.0))
        .squiggly((1.0, 0.5, 0.0))
        .sticky_note("Review this")
        .stamp(StampType.APPROVED)
        .freetext((100, 500, 200, 50), "Comment")
        .watermark("DRAFT")
        .watermark_confidential()
        .watermark_draft()
        # AcroForm widgets (5 types)
        .text_field("name", 150, 400, 200, 20, "Jane Doe")
        .checkbox("agree", 72, 380, 15, 15, True)
        .combo_box("country", 150, 360, 200, 20, ["US", "UK"], "US")
        .radio_group("tier", [("free", 72, 340, 15, 15), ("pro", 120, 340, 15, 15)], "pro")
        .push_button("submit", 72, 300, 80, 25, "Submit")
        # Graphics primitives
        .rect(50, 270, 500, 2)
        .filled_rect(50, 260, 500, 2, (0.9, 0.9, 0.9))
        .line(50, 250, 550, 250)
    .done()
    .save_encrypted("out.pdf", "user-pw", "owner-pw"))
# Alt: .save("out.pdf") / .build() -> bytes
# Alt: .to_bytes_encrypted("user-pw", "owner-pw") -> bytes

HTML + CSS パイプライン

Pdf.from_html_css(html: str, css: str, font_bytes: bytes) -> Pdf
Pdf.from_html_css_with_fonts(html: str, css: str, fonts: list[tuple[str, bytes]]) -> Pdf

HTML から作成を参照してください。

署名の検証

from pdf_oxide import PdfDocument, Timestamp, TsaClient

doc = PdfDocument("signed.pdf")
doc.signature_count()                # int
for sig in doc.signatures():
    sig.signer_name                  # str
    sig.reason                       # str | None
    sig.location                     # str | None
    sig.signing_time                 # datetime | None
    sig.verify()                     # "Valid" | "Invalid" | "Unknown"
    sig.verify_detached(pdf_bytes)   # adds messageDigest check

# Timestamp
ts = Timestamp.parse(tst_bytes)
ts.time, ts.serial, ts.policy_oid, ts.tsa_name, ts.hash_algorithm, ts.message_imprint

# TSA client (behind `tsa-client` feature)
client = TsaClient(url="https://freetsa.org/tsr",
                   username=None, password=None,
                   timeout_seconds=30, hash_algorithm=2,
                   use_nonce=True, cert_req=True)
ts = client.request_timestamp(pdf_bytes)
ts = client.request_timestamp_hash(digest, algorithm=2)

詳細はデジタル署名を参照してください。

レンダリング

doc.render_page_region(page: int, x: float, y: float, w: float, h: float, format: int = 0) -> bytes
doc.render_page_fit(page: int, fit_width: int, fit_height: int, format: int = 0) -> bytes

format: 0 = PNG、1 = JPEG。座標は左下を原点とする PDF ポイントです。

`Pdf` のフラット化

doc.flatten_to_images(dpi: int = 150) -> bytes

他の言語のバインディング

PDF Oxide はあらゆる主要なエコシステム向けにネイティブバインディングを提供しています： Rust, Node.js, WASM, C#, Golang, Java, PHP, Ruby, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Clojure, Objective-C, Elixir。

次のステップ

型と列挙型 — すべての共有型と列挙型
Page API リファレンス — バインディング間で一貫したページ単位の反復処理
Python 入門 — チュートリアル