Skip to content

Python API Reference

PDF Oxide provides native Python bindings built with PyO3. Pre-built wheels are available for Python 3.8–3.14 on Linux, macOS, and Windows (x86_64 and ARM64).

pip install pdf_oxide

For the Rust API, see the Rust API Reference. For the JavaScript API, see the Node.js API Reference or WASM API Reference. For type details, see Types & Enums.


PdfDocument

The primary class for opening, extracting, editing, and saving PDF files.

from pdf_oxide import PdfDocument

Constructor

PdfDocument(path: str, password: str | None = None)
Parameter Type Description
path str Path to the PDF file
password str | None Optional password for encrypted PDFs (default: None)

Pass password= to open encrypted PDFs in one step. You can also use doc.authenticate(password) after opening as an alternative.

Raises FileNotFoundError if the file does not exist. Raises PdfError if the file is not a valid PDF.

Class Methods

PdfDocument.from_bytes(data: bytes, password: str | None = None) -> PdfDocument

Open a PDF from in-memory bytes (e.g., downloaded from S3, received via HTTP). Accepts an optional password for encrypted PDFs.

Parameter Type Description
data bytes Raw PDF file bytes
password str | None Optional password for encrypted PDFs (default: None)
from pdf_oxide import PdfDocument

# Open PDF from bytes (e.g., downloaded from S3)
doc = PdfDocument.from_bytes(pdf_bytes)

# Also supports password:
doc = PdfDocument.from_bytes(pdf_bytes, password="secret")

Methods

General

Method Return Type Description
version() tuple[int, int] PDF version as (major, minor) (e.g., (1, 7))
authenticate(password) bool Authenticate an encrypted PDF with user or owner password

Document Info

doc.page_count() -> int

Return the number of pages in the document.

doc.has_structure_tree() -> bool

Check if the document is a Tagged PDF with a structure tree.

Authentication

doc.authenticate(password: str) -> bool

Authenticate with a password after opening. Returns True if authentication succeeded.

Text Extraction

doc.extract_text(
    page: int,
    region: tuple[float, float, float, float] | None = None,
    exclude_layers: list[str] | None = None,
    exclude_inks: list[str] | None = None,
    extract_tables: bool = True
) -> str

Extract plain text from a single page. Pages are zero-indexed. Optionally clip to a region, exclude named optional-content layers or ink/separation names, and toggle table reconstruction.

doc.extract_chars(
    page: int,
    region: tuple[float, float, float, float] | None = None,
    exclude_layers: list[str] | None = None,
    exclude_inks: list[str] | None = None
) -> list[TextChar]

Extract per-character positioning and font metadata. Returns a list of TextChar objects.

doc.extract_spans(page: int, region: tuple | None = None, reading_order: str | None = None) -> list[TextSpan]

Extract text spans with font metadata. Each span is a run of identically-styled text. Pass reading_order="column_aware" for multi-column PDFs.

doc.extract_words(
    page: int,
    *,
    include_artifacts: bool = True,
    region: tuple | None = None,
    word_gap_threshold: float | None = None,
    profile: ExtractionProfile | None = None
) -> list[TextWord]

Extract word-grouped text with bounding boxes. Returns a list of TextWord objects.

doc.extract_text_lines(
    page: int,
    *,
    include_artifacts: bool = True,
    region: tuple | None = None,
    word_gap_threshold: float | None = None,
    line_gap_threshold: float | None = None,
    profile: ExtractionProfile | None = None
) -> list[TextLine]

Extract line-grouped text. Returns a list of TextLine objects.

doc.extract_page_text(page: int, reading_order: str | None = None) -> dict

Extract spans, characters, and page dimensions from a single pass. Returns a dict with keys: spans, chars, page_width, page_height, text. More efficient than calling extract_spans() + extract_chars() separately.

doc.page_layout_params(page: int) -> LayoutParams

Compute adaptive layout parameters (word/line gap thresholds, median metrics, column count) for a page. See LayoutParams.

doc.within(page: int, bbox: tuple[float, float, float, float]) -> PdfPageRegion

Create a clipped region handle for extracting text, words, lines, tables, images, and paths inside bbox. See PdfPageRegion.

Auto Extraction & Classification

doc.extract_text_auto(page: int) -> str

Auto-select the best extraction strategy (native text vs. OCR) for a page and return plain text.

doc.extract_page_auto(page: int, options_json: str | None = None) -> str

Auto-extract a page and return a JSON document; pass a JSON options_json string to tune the pipeline.

doc.classify_page(page: int) -> str

Classify a single page (e.g. "text", "scanned", "mixed").

doc.classify_document() -> str

Classify the whole document by sampling its pages.

doc.has_text_layer(page: int) -> bool

Check whether a page already has an extractable native text layer (vs. requiring OCR).

Conversion

doc.to_plain_text(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None
) -> str

Convert a page to plain text with layout options.

doc.to_plain_text_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None
) -> str

Convert all pages to plain text.

doc.to_markdown(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

Convert a page to Markdown.

doc.to_markdown_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

Convert all pages to Markdown.

doc.to_html(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

Convert a page to HTML.

doc.to_html_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

Convert all pages to HTML.

Office Conversion

Method Return Type Description
to_docx(path) Convert the PDF to a Word document file
to_docx_bytes() bytes Convert the PDF to DOCX bytes
to_pptx(path) Convert the PDF to a PowerPoint file
to_pptx_bytes() bytes Convert the PDF to PPTX bytes
to_xlsx(path) Convert the PDF to an Excel workbook file
to_xlsx_bytes() bytes Convert the PDF to XLSX bytes

Image Extraction

doc.extract_images(page: int) -> list[ImageInfo]

Extract all images from a page, including images in content streams and nested Form XObjects.

doc.extract_image_bytes(page: int) -> list[dict]

Extract raw image bytes from a page. Each dict contains width, height, data (bytes), and format.

doc.search(
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0
) -> list[SearchResult]

Search for text across all pages. Set max_results=0 for unlimited results. Returns a list of matches with page number, text, and coordinates.

doc.search_page(
    page: int,
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0
) -> list[SearchResult]

Search for text on a single page.

Metadata Editing

Method Parameters Description
set_title(title) str Set document title
set_author(author) str Set document author
set_subject(subject) str Set document subject
set_keywords(keywords) str Set document keywords

Page Rotation

Method Parameters Returns Description
page_rotation(page) int int Get current rotation (0, 90, 180, 270)
set_page_rotation(page, degrees) int, int Set absolute rotation
rotate_page(page, degrees) int, int Add to current rotation
rotate_all_pages(degrees) int Rotate all pages

Page Dimensions

Method Parameters Returns Description
page_media_box(page) int tuple[float, float, float, float] Get MediaBox (llx, lly, urx, ury)
set_page_media_box(page, llx, lly, urx, ury) int, float, float, float, float Set MediaBox
page_crop_box(page) int `tuple None`
set_page_crop_box(page, llx, lly, urx, ury) int, float, float, float, float Set CropBox
crop_margins(left, right, top, bottom) float, float, float, float Crop all page margins

Erase / Whiteout

Method Parameters Description
erase_region(page, llx, lly, urx, ury) int, float, float, float, float Erase a rectangular region
erase_regions(page, rects) int, list[tuple] Erase multiple regions
clear_erase_regions(page) int Clear pending erase operations

Annotations

doc.get_annotations(page: int) -> list[dict]

Get annotation metadata (type, rect, contents, etc.) for a page.

Method Parameters Returns Description
flatten_page_annotations(page) int Flatten annotations on a page
flatten_all_annotations() Flatten all annotations
is_page_marked_for_flatten(page) int bool Check if page is marked for flatten
unmark_page_for_flatten(page) int Unmark a page for flatten

Redaction

doc.add_redaction(
    page: int,
    rect: tuple[float, float, float, float],
    fill: tuple[float, float, float] | None = None
) -> None

Mark a rectangular region for redaction with an optional RGB fill color.

doc.redaction_count(page: int) -> int

Return the number of pending redactions on a page.

doc.apply_redactions_destructive(
    scrub_metadata: bool = True,
    remove_javascript: bool = True,
    remove_embedded_files: bool = True,
    fill: tuple[float, float, float] = (0.0, 0.0, 0.0)
) -> None

Apply all redactions destructively, removing underlying content and optionally scrubbing metadata, JavaScript, and embedded files.

doc.sanitize_document(
    scrub_metadata: bool = True,
    remove_javascript: bool = True,
    remove_embedded_files: bool = True
) -> None

Sanitize the document without redacting regions: strip metadata, JavaScript, and/or embedded files.

Method Parameters Returns Description
apply_page_redactions(page) int Apply redactions on a page
apply_all_redactions() Apply all pending redactions
is_page_marked_for_redaction(page) int bool Check if page is marked for redaction
unmark_page_for_redaction(page) int Unmark a page for redaction

Layers & Inks

Method Parameters Returns Description
get_layers() list[str] List optional-content (OCG) layer names
get_page_inks(page) int list[str] List ink / separation colorant names on a page
get_page_inks_deep(page) int list[str] List inks including those nested in Form XObjects
doc.remove_headers(threshold: float = 0.8) -> int
doc.remove_footers(threshold: float = 0.8) -> int
doc.remove_artifacts(threshold: float = 0.8) -> int

Detect and remove repeating headers, footers, or page artifacts across the document. threshold is the cross-page repetition ratio. Returns the number of elements removed.

Method Parameters Description
erase_header(page) int Erase the detected header region on a page
edit_header(page) int Mark the header region for editing
erase_footer(page) int Erase the detected footer region on a page
edit_footer(page) int Mark the footer region for editing
erase_artifacts(page) int Erase detected artifacts on a page
sync_editor_erasures() Flush pending header/footer/artifact erasures into the editor

Form Fields

doc.get_form_fields() -> list[FormField]

Get all form fields. See FormField for properties.

doc.get_form_field_value(name: str) -> str | bool | list | None

Get a form field value by name. Returns the appropriate Python type based on the field type.

doc.set_form_field_value(name: str, value: str | bool) -> None

Set a form field value by name.

doc.has_xfa() -> bool

Check if the document contains XFA forms.

doc.export_form_data(path: str, format: str = "fdf") -> None

Export form data to a file. Supported formats: "fdf" and "xfdf".

Method Parameters Description
flatten_forms() Flatten all form fields into page content
flatten_forms_on_page(page) int Flatten forms on a specific page

Image Manipulation

doc.page_images(page: int) -> list[dict]

Get image names and bounds for positioning operations. Each dict contains name, bounds [x, y, width, height], and matrix.

Method Parameters Description
reposition_image(page, name, x, y) int, str, float, float Move an image
resize_image(page, name, width, height) int, str, float, float Resize an image
set_image_bounds(page, name, x, y, width, height) int, str, float, float, float, float Set image position and size
clear_image_modifications(page) int Clear pending image modifications
has_image_modifications(page) intbool Check for pending image modifications

Document Operations

doc.merge_from(source: str | PdfDocument) -> int

Merge pages from another PDF. Accepts a file path or PdfDocument instance. Returns the number of pages merged.

doc.embed_file(name: str, data: bytes) -> None

Attach a file to the PDF.

doc.get_outline() -> list[dict] | None

Get document bookmarks / table of contents. Returns None if no outline exists.

doc.extract_paths(page: int, region: tuple | None = None) -> list[dict]

Get vector paths (lines, curves, shapes) from a page.

doc.extract_rects(page: int, region: tuple | None = None) -> list[dict]

Get axis-aligned rectangles (from filled/stroked paths) on a page.

doc.extract_lines(page: int, region: tuple | None = None) -> list[dict]

Get straight line segments on a page.

doc.extract_tables(page: int, region: tuple | None = None, table_settings: dict | None = None) -> list[dict]

Detect and extract tables. Each table is a dict with rows and cells (text + bounding boxes). Pass table_settings to tune detection strategy.

doc.extract_structured(page: int) -> str

Extract the page as a structured JSON document (logical reading order, blocks, and roles).

doc.page_labels() -> list[dict]

Get page label ranges. Each dict contains start_page, style, prefix, and start_value.

doc.xmp_metadata() -> dict | None

Get XMP metadata as a dictionary with fields like dc_title, dc_creator, xmp_create_date, etc. Returns None if no XMP metadata exists.

OCR

doc.extract_text_ocr(page: int, engine: OcrEngine | None = None) -> str

Extract text using OCR. Requires the ocr feature in the Rust build. Pass a custom OcrEngine or None for the default engine.

Page Extraction & Reordering

doc.extract_pages(pages: list[int], output: str) -> None

Extract the given page indices into a new PDF file at output.

doc.extract_pages_to_bytes(pages: list[int]) -> bytes

Extract the given page indices into a new PDF returned as bytes.

doc.extract_page_ranges_to_bytes(ranges: list[tuple[int, int]]) -> bytes

Extract one or more (start, end) page ranges into a new PDF returned as bytes.

Method Parameters Description
select_pages(pages) list[int] Keep only the listed pages, in the given order
delete_page(index) int Delete a single page
move_page(from_index, to_index) int, int Move a page to a new position

Compliance & Validation

doc.validate_pdf_a(level: str = "1b") -> dict

Validate against a PDF/A conformance level (e.g. "1b", "2b", "3b"). Returns a report dict.

doc.convert_to_pdf_a(level: str = "2b") -> dict

Convert the document to PDF/A and return a conversion report dict.

doc.validate_pdf_ua() -> dict

Validate against PDF/UA (accessibility) requirements.

doc.validate_pdf_x(level: str = "1a_2001") -> dict

Validate against a PDF/X (print-production) conformance level.

Permissions & Warnings

doc.permissions() -> dict

Return the document’s encryption permission flags (print, copy, modify, annotate, etc.).

doc.structured_warnings() -> list

Return warnings collected during structured / tagged-content extraction.

doc.flatten_warnings() -> list[str]

Return warnings collected during form/annotation flattening.

Signatures & Document Security Store

doc.signatures() -> list[Signature]

Return all digital signatures in the document. See Signature.

doc.signature_count() -> int

Return the number of digital signatures.

doc.dss() -> Dss | None

Return the document’s parsed Document Security Store (LTV material), or None. See Dss.

Page API (v0.3.34)

PdfDocument is iterable and indexable, returning lazy Page objects. See Page.

len(doc)                  # number of pages
doc[i]                    # page at index i (negative indexing supported)
doc[-1]                   # last page
for page in doc: ...      # iterate pages

DOM Access

doc.page(index: int) -> PdfPage

Get a DOM-like page handle for element-level editing. See PdfPage.

doc.save_page(page: PdfPage) -> None

Save a modified PdfPage back to the document.

Rendering

doc.render_page(
    page: int,
    dpi: int | None = None,
    format: str | None = None,
    background: tuple[float, float, float, float] | None = None,
    transparent: bool = False,
    render_annotations: bool | None = None,
    jpeg_quality: int | None = None,
    excluded_layers: list[str] | None = None
) -> bytes

Render a page to PNG or JPEG bytes with control over DPI, background, transparency, annotation rendering, JPEG quality, and excluded layers.

doc.render_pixmap(page: int, dpi: int | None = None) -> RenderedPixmap

Render a page to a raw RGBA RenderedPixmap (named tuple with width, height, data).

doc.render_separations(page: int, dpi: int | None = None) -> list[SeparationPlate]

Render per-ink separation plates for a page. Returns a list of SeparationPlate named tuples (name, width, height, data).

doc.render_separation(page: int, ink_name: str, dpi: int | None = None) -> SeparationPlate

Render a single named ink separation plate.

Method Return Type Description
render_page_fit(page, fit_width, fit_height, format=0) bytes Render a page scaled to fit a pixel box
flatten_to_images(dpi=150) bytes Flatten all pages to image-based PDF

Saving

doc.save(path: str, compress: bool = True, garbage_collect: bool = True, linearize: bool = False) -> None

Save the PDF to a file. Toggle stream compression, dead-object garbage collection, and linearization (fast web view).

doc.to_bytes(compress: bool = True, garbage_collect: bool = True, linearize: bool = False) -> bytes

Serialize the PDF to bytes with the same options as save().

doc.save_encrypted(
    path: str,
    user_password: str,
    owner_password: str | None = None,
    allow_print: bool = True,
    allow_copy: bool = True,
    allow_modify: bool = True,
    allow_annotate: bool = True
) -> None

Save with AES-256 password protection and permission controls. If owner_password is None, the user password is used.

doc.to_bytes_encrypted(
    user_password: str,
    owner_password: str | None = None,
    allow_print: bool = True,
    allow_copy: bool = True,
    allow_modify: bool = True,
    allow_annotate: bool = True
) -> bytes

Serialize an AES-256 encrypted PDF to bytes.


Page

A lazy page handle returned by doc[i] or iteration over PdfDocument. All properties are computed on access and dispatch to the parent document.

from pdf_oxide import PdfDocument

with PdfDocument("paper.pdf") as doc:
    page = doc[0]
    text = page.text
    md = page.markdown(detect_headings=True)

Properties (lazy)

Property Type Description
index int Zero-based page index
width, height float Page dimensions in PDF points
bbox tuple[float, 4] (llx, lly, urx, ury)
text str Extracted plain text
chars, words, lines, spans list[...] Structured text
tables list[dict] Tables with rows + cells (text + bboxes)
images, paths, annotations list[...] Page content

Methods

page.markdown(preserve_layout=False, detect_headings=True,
              include_images=False, image_output_dir=None,
              embed_images=True, include_form_fields=True) -> str
page.plain_text(...) -> str
page.html(...) -> str
page.render(dpi=None, format=None, background=None, transparent=False,
            render_annotations=None, jpeg_quality=None, excluded_layers=None) -> bytes
page.render_pixmap(dpi=None) -> RenderedPixmap
page.search(pattern, case_insensitive=False, literal=False,
            whole_word=False, max_results=100) -> list
page.region(x, y, width, height) -> PdfPageRegion

The lazy page object is also exposed as doc.pages() (an iterator equivalent to iterating the document directly).


PdfPage

DOM-like page handle for element-level access and editing. Obtained via PdfDocument.page().

from pdf_oxide import PdfDocument

doc = PdfDocument("file.pdf")
page = doc.page(0)

Properties

Property Type Description
index int Zero-based page index
width float Page width in PDF points
height float Page height in PDF points

Methods

page.children() -> list[PdfElement]

Get all elements on the page.

page.find_text_containing(needle: str) -> list[PdfText]

Find all text elements containing the given substring.

page.find_images() -> list[PdfImage]

Find all image elements on the page.

page.get_element(element_id: str) -> PdfElement | None

Get a specific element by its ID.

page.set_text(text_id: PdfTextId, new_text: str) -> None

Replace the text content of an element identified by its PdfTextId.

page.annotations() -> list[PdfAnnotation]

Get all annotations on the page.

page.add_link(x: float, y: float, width: float, height: float, url: str) -> str

Add a URL link annotation. Returns the annotation ID.

page.add_highlight(x: float, y: float, width: float, height: float, color: tuple[float, float, float]) -> str

Add a highlight annotation with an RGB color. Returns the annotation ID.

page.add_note(x: float, y: float, text: str) -> str

Add a sticky note annotation. Returns the annotation ID.

page.remove_annotation(index: int) -> bool

Remove an annotation by index. Returns True if removed.

page.add_text(text: str, x: float, y: float, font_size: float = 12.0) -> PdfTextId

Add new text to the page. Returns a PdfTextId for later reference.

page.remove_element(element_id: PdfTextId) -> bool

Remove an element by its ID. Returns True if removed.

Example

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
page = doc.page(0)

# Find and replace text
for text in page.find_text_containing("DRAFT"):
    page.set_text(text.id, "FINAL")

# Add a link
page.add_link(100, 700, 200, 20, "https://example.com")

doc.save_page(page)
doc.save("invoice_updated.pdf")

Pdf

The unified class for creating PDFs from various source formats.

from pdf_oxide import Pdf

Factory Methods

Pdf.from_markdown(content: str, title: str | None = None, author: str | None = None) -> Pdf

Create a PDF from Markdown content.

Pdf.from_html(content: str, title: str | None = None, author: str | None = None) -> Pdf

Create a PDF from HTML content.

Pdf.from_text(content: str, title: str | None = None, author: str | None = None) -> Pdf

Create a PDF from plain text.

Pdf.from_markdown_with_template(content: str, template: str, title: str | None = None, author: str | None = None) -> Pdf

Create a PDF from Markdown rendered through a named CSS/layout template.

Pdf.from_image(path: str) -> Pdf

Create a single-page PDF from an image file (JPEG, PNG).

Pdf.from_bytes(data: bytes) -> Pdf

Open an existing PDF from in-memory bytes for modification. Useful for loading PDFs downloaded from S3, HTTP, or databases.

from pdf_oxide import Pdf

pdf = Pdf.from_bytes(existing_pdf_bytes)
pdf.save("modified.pdf")
Pdf.from_images(paths: list[str]) -> Pdf

Create a multi-page PDF from multiple image files, one page per image.

Pdf.from_image_bytes(data: bytes) -> Pdf

Create a single-page PDF from image bytes.

Pdf.merge(paths: list[str]) -> Pdf

Merge multiple PDF files (by path) into a single Pdf.

Methods

pdf.save(path: str) -> None

Save the PDF to a file.

pdf.to_bytes() -> bytes

Get the PDF content as bytes.

len(pdf) -> int

Get the PDF size in bytes (via __len__).


PdfText

Represents a text element on a page. Returned by PdfPage.find_text_containing().

Property Type Description
id PdfTextId Unique element identifier
value str Text content
text str Text content (alias for value)
bbox tuple[float, float, float, float] Bounding box (x0, y0, x1, y1)
font_name str PostScript font name
font_size float Font size in points
is_bold bool Whether text is bold
is_italic bool Whether text is italic

Methods

Method Parameters Returns Description
contains(needle) str bool Check if text contains substring
starts_with(prefix) str bool Check if text starts with prefix
ends_with(suffix) str bool Check if text ends with suffix

PdfImage

Represents an image element on a page. Returned by PdfPage.find_images().

Property Type Description
bbox tuple[float, float, float, float] Bounding box (x0, y0, x1, y1)
width int Image width in pixels
height int Image height in pixels
aspect_ratio float Width / height ratio

PdfAnnotation

Represents an annotation on a page. Returned by PdfPage.annotations().

Property Type Description
subtype str Annotation type (e.g., "Link", "Highlight", "Text")
rect tuple[float, float, float, float] Position (x0, y0, x1, y1)
contents `str None`
color `tuple[float, float, float] None`
is_modified bool Whether the annotation has been modified
is_new bool Whether the annotation is newly added

PdfElement

Generic element wrapper. Returned by PdfPage.children().

Method Returns Description
is_text() bool Check if element is text
is_image() bool Check if element is an image
is_path() bool Check if element is a vector path
is_table() bool Check if element is a table
is_structure() bool Check if element is a structure element
as_text() `PdfText None`
as_image() `PdfImage None`
Property Type Description
bbox tuple[float, float, float, float] Bounding box

TextChar

Represents a single character with positioning and font metadata. Returned by PdfDocument.extract_chars().

from pdf_oxide import TextChar  # or access via PdfDocument
Attribute Type Description
char str The Unicode character
bbox tuple[float, float, float, float] Bounding box (x0, y0, x1, y1)
font_name str PostScript font name
font_size float Font size in points
font_weight str Weight ("thin", "light", "normal", "medium", "semi-bold", "bold", "extra-bold", "black")
is_italic bool Whether the character is italic
color tuple[float, float, float] RGB color (r, g, b), values 0.0–1.0
rotation_degrees float Character rotation in degrees
origin_x float Text origin X position
origin_y float Text origin Y position
advance_width float Glyph advance width
mcid `int None`

Example

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
chars = doc.extract_chars(0)

for ch in chars[:5]:
    print(f"'{ch.char}' at bbox={ch.bbox} "
          f"font={ch.font_name} size={ch.font_size:.1f} "
          f"weight={ch.font_weight} italic={ch.is_italic}")

TextSpan

Represents a run of text sharing the same font and style. Returned by PdfDocument.extract_spans().

Attribute Type Description
text str The text content
bbox tuple[float, float, float, float] Bounding box (x0, y0, x1, y1)
font_name str PostScript font name
font_size float Font size in points
is_bold bool Whether the span is bold
is_italic bool Whether the span is italic
is_monospace bool Whether the font is fixed-width (Courier, Consolas, etc.)
char_widths list[float] Per-glyph advance widths for accurate bounding boxes
color tuple[float, float, float] RGB color (r, g, b), values 0.0–1.0

Example

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
spans = doc.extract_spans(0)

for span in spans:
    print(f"'{span.text}' font={span.font_name} size={span.font_size:.1f} "
          f"bold={span.is_bold} italic={span.is_italic} color={span.color}")

Image Extraction

extract_images() returns ImageInfo objects with image metadata. Use extract_image_bytes() for raw image data suitable for saving to disk.

extract_image_bytes() Return Format

Each dict returned by extract_image_bytes() has the following keys:

Key Type Description
width int Image width in pixels
height int Image height in pixels
data bytes Raw image data
format str Image format (e.g., "png", "jpeg")

Example

from pdf_oxide import PdfDocument

doc = PdfDocument("brochure.pdf")
images = doc.extract_image_bytes(0)

for i, img in enumerate(images):
    print(f"Image {i}: {img['width']}x{img['height']}")
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

SearchResult

Represents a text search match. Returned by search() and search_page().

Attribute Type Description
page int Zero-based page index
text str Matched text
x float X position in PDF points
y float Y position in PDF points

FormField

Represents a form field. Returned by PdfDocument.get_form_fields().

Property Type Description
name str Fully qualified field name
field_type str Field type: "text", "button", "choice", "signature", or "unknown"
value `str bool
tooltip `str None`
bounds `tuple[float, float, float, float] None`
flags `int None`
max_length `int None`
is_readonly bool Whether the field is read-only
is_required bool Whether the field is required

TextWord

A word-grouped run of text. Returned by PdfDocument.extract_words() and PdfPageRegion.extract_words().

Property Type Description
text str The word text
bbox tuple[float, float, float, float] Bounding box (x0, y0, x1, y1)
font_name str PostScript font name
font_size float Font size in points
is_bold bool Whether the word is bold
is_italic bool Whether the word is italic
chars list[TextChar] Constituent characters

TextLine

A line-grouped run of text. Returned by PdfDocument.extract_text_lines() and PdfPageRegion.extract_text_lines().

Property Type Description
text str The line text
bbox tuple[float, float, float, float] Bounding box (x0, y0, x1, y1)
words list[TextWord] Words in the line
chars list[TextChar] Characters in the line

PdfPageRegion

A clipped region of a page. Returned by PdfDocument.within() and PdfPage.region().

Property Type Description
bbox tuple[float, float, float, float] The region’s bounds

Methods

region.extract_text() -> str
region.extract_words() -> list[TextWord]
region.extract_text_lines() -> list[TextLine]
region.extract_tables(table_settings: dict | None = None) -> list[dict]
region.extract_images() -> list
region.extract_paths() -> list

Extraction methods scoped to the region’s bounding box.


LayoutParams

Computed adaptive layout parameters for a page. Returned by PdfDocument.page_layout_params().

Property Type Description
word_gap_threshold float Inter-word gap threshold in points
line_gap_threshold float Inter-line gap threshold in points
median_char_width float Median character width
median_font_size float Median font size
median_line_spacing float Median line spacing
column_count int Detected number of text columns

ExtractionProfile

A tunable text-extraction profile passed to extract_words() / extract_text_lines().

from pdf_oxide import ExtractionProfile

Static Constructors

ExtractionProfile.conservative()
ExtractionProfile.aggressive()
ExtractionProfile.balanced()
ExtractionProfile.academic()
ExtractionProfile.policy()
ExtractionProfile.form()
ExtractionProfile.government()
ExtractionProfile.scanned_ocr()
ExtractionProfile.adaptive()
ExtractionProfile.available() -> list[str]   # names of all built-in profiles

Properties

Property Type Description
name str Profile name
tj_offset_threshold float TJ array offset word-break threshold
word_margin_ratio float Word margin ratio
space_threshold_em_ratio float Space-width threshold (em ratio)
space_char_multiplier float Space-character multiplier
use_adaptive_threshold bool Whether adaptive thresholds are enabled

OfficeConverter

Convert Office documents (DOCX, XLSX, PPTX) to PDF. Requires the office feature in the Rust build.

from pdf_oxide import OfficeConverter

OfficeConverter()   # instances are stateless; the conversion methods are also usable as static methods

Methods

OfficeConverter.from_docx(path: str) -> Pdf

Convert a Word document to a Pdf object.

OfficeConverter.from_docx_bytes(data: bytes) -> Pdf

Convert Word document bytes to a Pdf object.

OfficeConverter.from_xlsx(path: str) -> Pdf

Convert an Excel spreadsheet to a Pdf object.

OfficeConverter.from_xlsx_bytes(data: bytes) -> Pdf

Convert Excel spreadsheet bytes to a Pdf object.

OfficeConverter.from_pptx(path: str) -> Pdf

Convert a PowerPoint presentation to a Pdf object.

OfficeConverter.from_pptx_bytes(data: bytes) -> Pdf

Convert PowerPoint presentation bytes to a Pdf object.

OfficeConverter.convert(path: str) -> Pdf

Auto-detect format and convert any supported Office document to a Pdf object.

Example

from pdf_oxide import OfficeConverter

pdf = OfficeConverter.from_docx("report.docx")
pdf.save("report.pdf")

# Or use convert() for auto-detection
pdf = OfficeConverter.convert("spreadsheet.xlsx")
pdf.save("spreadsheet.pdf")

Graphics Classes

These classes are available for advanced PDF creation with graphics:

Color

from pdf_oxide import Color

Color(r: float, g: float, b: float)  # RGB, values 0.0-1.0
Color.from_hex("#ff0000")
Color.black()
Color.white()
Color.red()
Color.green()
Color.blue()

BlendMode

from pdf_oxide import BlendMode

BlendMode.NORMAL()
BlendMode.MULTIPLY()
BlendMode.SCREEN()
BlendMode.OVERLAY()
BlendMode.DARKEN()
BlendMode.LIGHTEN()
BlendMode.COLOR_DODGE()
BlendMode.COLOR_BURN()
BlendMode.HARD_LIGHT()
BlendMode.SOFT_LIGHT()
BlendMode.DIFFERENCE()
BlendMode.EXCLUSION()

ExtGState

from pdf_oxide import ExtGState

gs = ExtGState()
gs = gs.fill_alpha(0.5)
gs = gs.stroke_alpha(0.8)
gs = gs.alpha(0.5)  # Set both fill and stroke
gs = gs.blend_mode(BlendMode.MULTIPLY())

gs = ExtGState.semi_transparent()  # Preset

LineCap / LineJoin

from pdf_oxide import LineCap, LineJoin

LineCap.BUTT()       # Default
LineCap.ROUND()
LineCap.SQUARE()

LineJoin.MITER()     # Default
LineJoin.ROUND()
LineJoin.BEVEL()

Gradients

from pdf_oxide import LinearGradient, RadialGradient, Color

# Linear gradient (fluent API)
grad = (LinearGradient()
    .start(0, 0)
    .end(100, 0)
    .add_stop(0.0, Color.red())
    .add_stop(1.0, Color.blue()))

# Convenience constructors
hgrad = LinearGradient.horizontal(200, Color.red(), Color.blue())
vgrad = LinearGradient.vertical(100, Color(1, 1, 0), Color(0, 0, 1))

# Radial gradient
rgrad = RadialGradient.centered(50, 50, 50)
rgrad = rgrad.add_stop(0.0, Color(1, 1, 0))
rgrad = rgrad.add_stop(1.0, Color(1, 0, 0))

PatternPresets

from pdf_oxide import PatternPresets, Color

PatternPresets.horizontal_stripes(width, height, stripe_height, color)
PatternPresets.vertical_stripes(width, height, stripe_width, color)
PatternPresets.checkerboard(size, color1, color2)
PatternPresets.dots(spacing, radius, color)
PatternPresets.diagonal_lines(size, line_width, color)
PatternPresets.crosshatch(size, line_width, color)

OCR Classes

Requires the ocr feature in the Rust build.

OcrEngine

from pdf_oxide import OcrEngine, OcrConfig

engine = OcrEngine(
    det_model_path: str,
    rec_model_path: str,
    dict_path: str,
    config: OcrConfig | None = None
)

OcrConfig

from pdf_oxide import OcrConfig

config = OcrConfig(
    det_threshold: float | None = None,
    box_threshold: float | None = None,
    rec_threshold: float | None = None,
    num_threads: int | None = None,
    max_candidates: int | None = None,
    use_v5: bool = False
)

DocumentBuilder

Fluent builder for composing PDFs page by page. See the example below and Create from scratch.

from pdf_oxide import DocumentBuilder

Document-Level Methods

Method Parameters Description
DocumentBuilder() Construct a new builder
title(title) str Set document title
author(author) str Set document author
subject(subject) str Set document subject
keywords(keywords) str Set document keywords
creator(creator) str Set the producing application name
on_open(script) str Set a document-level open JavaScript action
tagged_pdf_ua1() Emit a Tagged PDF/UA-1 accessible document
language(lang) str Set the document language (e.g. "en-US")
role_map(custom, standard) str, str Map a custom structure tag to a standard one
register_embedded_font(name, font) str, EmbeddedFont Register a font (consumes the EmbeddedFont)

Page Factories

builder.a4_page() -> FluentPageBuilder       # 595 x 842 pt
builder.letter_page() -> FluentPageBuilder   # 612 x 792 pt
builder.page(width: float, height: float) -> FluentPageBuilder

Output

builder.build() -> bytes
builder.save(path: str) -> None
builder.save_encrypted(path: str, user_password: str, owner_password: str) -> None
builder.to_bytes_encrypted(user_password: str, owner_password: str) -> bytes

FluentPageBuilder

Buffers page-level operations until done(). Returned by DocumentBuilder.a4_page() / letter_page() / page(). Every method returns self for chaining; done() commits the page and returns the parent DocumentBuilder.

Text & Layout

Method Parameters Description
font(name, size) str, float Set the current font and size
at(x, y) float, float Move the cursor to an absolute position
text(text) str Draw text at the cursor
heading(level, text) int, str Draw a heading (level 1–6)
paragraph(text) str Draw a wrapped paragraph
space(points) float Advance vertical space
horizontal_rule() Draw a horizontal divider
columns(column_count, gap_pt, text) int, float, str Balanced multi-column text flow
footnote(ref_mark, note_text) str, str Inline reference mark + bottom-of-page note
new_page_same_size() Start a fresh page with the same dimensions
measure(text) -> float str Measure rendered text width in points
remaining_space() -> float Remaining vertical space on the page

Inline Runs

page.inline(text: str)
page.inline_bold(text: str)
page.inline_italic(text: str)
page.inline_color(text: str, r: float, g: float, b: float)
page.newline()
page.link_url(url: str)
page.link_page(page: int)
page.link_named(name: str)
page.link_javascript(script: str)
page.on_open(script: str)
page.on_close(script: str)
page.field_keystroke(script: str)
page.field_format(script: str)
page.field_validate(script: str)
page.field_calculate(script: str)

Markup Annotations

page.highlight(color: tuple[float, float, float])
page.underline(color: tuple[float, float, float])
page.strikeout(color: tuple[float, float, float])
page.squiggly(color: tuple[float, float, float])
page.sticky_note(text: str)
page.sticky_note_at(x: float, y: float, text: str)
page.watermark(text: str)
page.watermark_confidential()
page.watermark_draft()
page.stamp(name: str)
page.freetext(x: float, y: float, w: float, h: float, text: str)

AcroForm Widgets

page.text_field(name: str, x: float, y: float, w: float, h: float, default_value: str | None = None)
page.checkbox(name: str, x: float, y: float, w: float, h: float, checked: bool = False)
page.combo_box(name: str, x: float, y: float, w: float, h: float, options: list[str], selected: str | None = None)
page.radio_group(name: str, buttons: list[tuple[str, float, float, float, float]], selected: str | None = None)
page.push_button(name: str, x: float, y: float, w: float, h: float, caption: str)
page.signature_field(name: str, x: float, y: float, w: float, h: float)

Graphics

page.rect(x: float, y: float, w: float, h: float)
page.filled_rect(x: float, y: float, w: float, h: float, r: float, g: float, b: float)
page.line(x1: float, y1: float, x2: float, y2: float)
page.text_in_rect(x: float, y: float, w: float, h: float, text: str, align: int | None = None)
page.stroke_rect(x, y, w, h, width=1.0, color=(0.0, 0.0, 0.0))
page.stroke_rect_dashed(x, y, w, h, dash, width=1.0, color=(0.0, 0.0, 0.0), phase=0.0)
page.stroke_line(x1, y1, x2, y2, width=1.0, color=(0.0, 0.0, 0.0))
page.stroke_line_dashed(x1, y1, x2, y2, dash, width=1.0, color=(0.0, 0.0, 0.0), phase=0.0)

Images & Barcodes

page.image_with_alt(bytes: bytes, x: float, y: float, w: float, h: float, alt_text: str)
page.image_artifact(bytes: bytes, x: float, y: float, w: float, h: float)
page.barcode_1d(barcode_type: int, data: str, x: float, y: float, w: float, h: float)
page.barcode_qr(data: str, x: float, y: float, size: float)

barcode_type: 0=Code128, 1=Code39, 2=EAN13, 3=EAN8, 4=UPCA, 5=ITF, 6=Code93, 7=Codabar.

Tables

page.table(table: Table)
page.streaming_table(
    columns: list[Column],
    repeat_header: bool = False,
    mode: str = "fixed",
    sample_rows: int = 50,
    min_col_width_pt: float = 20.0,
    max_col_width_pt: float = 400.0,
    max_rowspan: int = 1,
    batch_size: int = 256
) -> StreamingTable

Commit

page.done() -> DocumentBuilder

EmbeddedFont

A TTF/OTF font registered with a DocumentBuilder.

from pdf_oxide import EmbeddedFont

EmbeddedFont.from_file(path: str) -> EmbeddedFont
EmbeddedFont.from_bytes(data: bytes, name: str | None = None) -> EmbeddedFont
Property Type Description
name str The font’s registered name

Tables

Value objects for the fluent table API.

Align

from pdf_oxide import Align

Align.LEFT     # 0
Align.CENTER   # 1
Align.RIGHT    # 2

Column

from pdf_oxide import Column

Column(header: str, width: float = 100.0, align: Align | int | None = None)
Property Type Description
header str Column header text
width float Column width in points
align int Cell alignment

Table

from pdf_oxide import Table

Table(columns: list[Column], rows: list[list[str]], has_header: bool = False)

A buffered table consumed by FluentPageBuilder.table(). With has_header=True, the column headers render as a styled header row.

StreamingTable

A row-streaming table handle returned by FluentPageBuilder.streaming_table(), for tables too large to materialize at once.

Method Parameters Description
push_row(cells) list[str] Append a row of cell strings
push_row_span(cells) list[tuple[str, int]] Append a row of (text, colspan) cells
flush() Flush the current batch
finish() Finish the table, returning the FluentPageBuilder
column_count() – → int Number of columns
pending_row_count() – → int Rows buffered but not yet committed
batch_count() – → int Number of completed batches

Page Templates

Repeating header/footer artifacts applied across pages.

Artifact / ArtifactStyle

from pdf_oxide import Artifact, ArtifactStyle

Artifact()                       # empty artifact
Artifact.center(text: str)       # centered artifact text
artifact.with_left(text: str)    # add left-aligned text

style = ArtifactStyle()
style = style.font(name: str, size: float)
style = style.bold()
from pdf_oxide import Header, Footer

Header()                  # or Header.center(text: str)
Footer()                  # or Footer.center(text: str)

PageTemplate

from pdf_oxide import PageTemplate, Header, Footer

template = (PageTemplate()
    .header(Header.center("Confidential"))
    .footer(Footer.center("Page")))

Digital Signatures

Sign, timestamp, and verify PDFs (PAdES / LTV). Requires the signatures (and optionally tsa-client) features in the Rust build.

Certificate

from pdf_oxide import Certificate

Certificate.load(data: bytes) -> Certificate                       # DER certificate (verify only)
Certificate.load_pem(cert_pem: str, key_pem: str) -> Certificate   # signing credential
Certificate.load_pkcs12(data: bytes, password: str) -> Certificate # PKCS#12 / .p12 signing credential
Method Returns Description
subject() str Certificate subject DN
issuer() str Certificate issuer DN
serial() str Serial number
validity() tuple[int, int] (not_before, not_after) Unix timestamps
is_valid() bool Whether the certificate is currently within its validity window

Signature

Returned by PdfDocument.signatures().

Property / Method Type Description
signer_name `str None`
reason `str None`
location `str None`
contact_info `str None`
signing_time `int None`
covers_whole_document bool Whether the signature covers the entire file
pades_level PadesLevel Detected PAdES baseline (B-B/B-T/B-LT)
verify() bool Verify the signature cryptographically
verify_detached(pdf_data) bool Verify including the messageDigest against the file bytes

Timestamp

from pdf_oxide import Timestamp

Timestamp.parse(data: bytes) -> Timestamp
Property / Method Type Description
time int Timestamp time (Unix)
serial str TSA response serial number
policy_oid str TSA policy OID
tsa_name str TSA name
hash_algorithm int Message-imprint hash algorithm code
message_imprint bytes The hashed message imprint
verify() bool Verify the timestamp token

TsaClient

from pdf_oxide import TsaClient

client = TsaClient(
    url: str,
    username: str | None = None,
    password: str | None = None,
    timeout_seconds: int = 30,
    hash_algorithm: int = 2,
    use_nonce: bool = True,
    cert_req: bool = True
)
client.request_timestamp(data: bytes) -> Timestamp
client.request_timestamp_hash(digest: bytes, algorithm: int = 2) -> Timestamp

PadesLevel

from pdf_oxide import PadesLevel

PadesLevel.B_B     # baseline
PadesLevel.B_T     # + trusted timestamp
PadesLevel.B_LT    # + long-term validation material
PadesLevel.B_LTA   # + archival timestamp

RevocationMaterial

from pdf_oxide import RevocationMaterial

RevocationMaterial(
    certs: list[bytes] | None = None,
    crls: list[bytes] | None = None,
    ocsps: list[bytes] | None = None
)

DER-encoded certificates, CRLs, and OCSP responses for B-LT signing.

Dss

A parsed Document Security Store, returned by PdfDocument.dss().

Property Type Description
certs list[bytes] Document-level certificate DER blobs
crls list[bytes] CRL DER blobs
ocsps list[bytes] OCSP response DER blobs
vri list[str] Per-signature VRI keys (hex SHA-1 of /Contents)

Module-Level Functions

from pdf_oxide import (
    sign_pdf_bytes, sign_pdf_bytes_pades, has_document_timestamp,
    generate_barcode_svg, generate_qr_svg,
    plan_split_by_bookmarks, split_by_bookmarks,
)

Signing

sign_pdf_bytes(pdf_data: bytes, cert: Certificate, reason: str | None = None, location: str | None = None) -> bytes

Sign raw PDF bytes with a loaded signing Certificate and return the signed PDF.

sign_pdf_bytes_pades(
    pdf_data: bytes,
    cert: Certificate,
    level: PadesLevel,
    tsa_url: str | None = None,
    reason: str | None = None,
    location: str | None = None,
    revocation: RevocationMaterial | None = None
) -> bytes

Sign raw PDF bytes at a PAdES baseline level. B_T/B_LT require a tsa_url.

has_document_timestamp(pdf_data: bytes) -> bool

Whether the PDF carries a document-level RFC 3161 archival timestamp (PAdES-B-LTA).

Barcodes

generate_barcode_svg(barcode_type: int, data: str) -> str
generate_qr_svg(data: str, error_correction: int, size: int) -> str

Generate a 1D barcode or QR code as an SVG string. Requires the barcodes feature.

Split by Bookmarks

plan_split_by_bookmarks(src_bytes: bytes, title_prefix: str | None = None, ignore_case: bool = False, level: int = 1, include_front_matter: bool = True) -> list[dict]
split_by_bookmarks(src_bytes: bytes, title_prefix: str | None = None, ignore_case: bool = False, level: int = 1, include_front_matter: bool = True) -> list[tuple[dict, bytes]]

Plan or perform a split of a PDF at bookmark boundaries. plan_* returns segment metadata only; split_* returns each segment paired with its PDF bytes.

OCR Model Provisioning

prefetch_models(languages: list[str]) -> str
model_manifest() -> str
prefetch_available() -> bool

Provision OCR models for offline/air-gapped use, inspect the model manifest (JSON), and check whether this build can download models.

Logging

setup_logging() -> None
set_log_level(level: str) -> None     # "off" | "error" | "warn" | "info" | "debug" | "trace"
get_log_level() -> str
disable_logging() -> None

Engine Tuning

set_max_ops_per_stream(limit: int | None) -> int | None
set_preserve_unmapped_glyphs(preserve: bool) -> bool

Adjust the per-stream operator cap (adversarial-input protection) and U+FFFD preservation for unmapped glyphs. Both return the previous value.

Cryptographic Governance

crypto_active_provider() -> str
crypto_available_providers() -> list[str]
crypto_use_fips() -> None                 # install the FIPS aws-lc-rs provider (requires the fips feature)
crypto_set_policy(spec: str) -> None      # e.g. "strict" or "compat;deny:rc4@write"
crypto_policy() -> str
crypto_inventory() -> list[str]
crypto_cbom() -> str                      # CycloneDX 1.6 CBOM (JSON)

Asynchronous API

async/await wrappers that run blocking operations in a thread pool. Methods mirror their synchronous counterparts.

from pdf_oxide import AsyncPdfDocument, AsyncPdf, AsyncOfficeConverter

async def main():
    doc = await AsyncPdfDocument.open("input.pdf")
    text = await doc.extract_text(0)
    await doc.close()
    # Or use as an async context manager:
    async with await AsyncPdfDocument.from_bytes(pdf_bytes) as doc:
        md = await doc.to_markdown_all()
Class Constructors Notes
AsyncPdfDocument await AsyncPdfDocument.open(path, password=None), await AsyncPdfDocument.from_bytes(data, password=None) All PdfDocument methods are available as awaitables; supports async with and .close()
AsyncPdf wraps Pdf factory methods await pdf.save(path), await pdf.to_bytes()
AsyncOfficeConverter wraps OfficeConverter static methods e.g. await AsyncOfficeConverter.from_docx(path)

Error Handling

PdfError

All PDF-specific errors raise PdfError:

from pdf_oxide import PdfDocument, PdfError

try:
    doc = PdfDocument("file.pdf")
    text = doc.extract_text(0)
except PdfError as e:
    print(f"PDF error: {e}")
except FileNotFoundError:
    print("File not found")
except IndexError:
    print("Page index out of range")

Common error scenarios:

Exception Cause
PdfError Malformed PDF, encrypted without password, parse failure
FileNotFoundError File does not exist
IndexError Page index exceeds page_count()
ValueError Invalid argument (e.g., negative page index)

Complete Example

from pdf_oxide import PdfDocument, Pdf

# --- Extraction ---
doc = PdfDocument("input.pdf")
print(f"Pages: {doc.page_count()}")

for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(f"Page {i + 1}: {len(text)} characters")

# Character-level analysis
chars = doc.extract_chars(0)
fonts = set(ch.font_name for ch in chars)
print(f"Fonts on page 1: {fonts}")

# Image extraction
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"extracted_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

# --- Creation ---
pdf = Pdf.from_markdown("# Report\n\nGenerated by PDF Oxide.",
                        title="Report", author="PDF Oxide")
pdf.save("report.pdf")

# --- Editing ---
doc = PdfDocument("document.pdf")
doc.set_title("Updated Title")
doc.set_author("New Author")
doc.rotate_all_pages(90)

# Search and replace via DOM
page = doc.page(0)
for text in page.find_text_containing("DRAFT"):
    page.set_text(text.id, "FINAL")
doc.save_page(page)

# Form filling
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name} ({f.field_type}) = {f.value}")
doc.set_form_field_value("name", "John Doe")

# Merge another PDF
merged_count = doc.merge_from("appendix.pdf")
print(f"Merged {merged_count} pages")

doc.save("output.pdf")

# --- Search ---
results = doc.search("configuration", case_insensitive=True)
for r in results:
    print(f"Page {r.page + 1}: '{r.text}' at ({r.x:.0f}, {r.y:.0f})")

v0.3.38 additions

DocumentBuilder / FluentPageBuilder / EmbeddedFont

from pdf_oxide import DocumentBuilder, EmbeddedFont, StampType

font = EmbeddedFont.from_file("DejaVuSans.ttf")
# Alt: EmbeddedFont.from_bytes(data: bytes, name: str | None = None)

(DocumentBuilder()
    .register_embedded_font("DejaVu", font)
    .letter_page()           # or .a4_page() / .page(size)
        .at(72, 720).font("DejaVu", 12).text("Hello")
        .heading(1, "Title")
        .paragraph("Body text with automatic wrapping")
        # Annotations (15 methods)
        .link_url("https://example.com")
        .link_page(2)
        .link_named("glossary")
        .highlight((1.0, 1.0, 0.0))
        .underline((0.0, 0.0, 1.0))
        .strikeout((1.0, 0.0, 0.0))
        .squiggly((1.0, 0.5, 0.0))
        .sticky_note("Review this")
        .stamp(StampType.APPROVED)
        .freetext((100, 500, 200, 50), "Comment")
        .watermark("DRAFT")
        .watermark_confidential()
        .watermark_draft()
        # AcroForm widgets (5 types)
        .text_field("name", 150, 400, 200, 20, "Jane Doe")
        .checkbox("agree", 72, 380, 15, 15, True)
        .combo_box("country", 150, 360, 200, 20, ["US", "UK"], "US")
        .radio_group("tier", [("free", 72, 340, 15, 15), ("pro", 120, 340, 15, 15)], "pro")
        .push_button("submit", 72, 300, 80, 25, "Submit")
        # Graphics primitives
        .rect(50, 270, 500, 2)
        .filled_rect(50, 260, 500, 2, (0.9, 0.9, 0.9))
        .line(50, 250, 550, 250)
    .done()
    .save_encrypted("out.pdf", "user-pw", "owner-pw"))
# Alt: .save("out.pdf") / .build() -> bytes
# Alt: .to_bytes_encrypted("user-pw", "owner-pw") -> bytes

HTML + CSS pipeline

Pdf.from_html_css(html: str, css: str, font_bytes: bytes) -> Pdf
Pdf.from_html_css_with_fonts(html: str, css: str, fonts: list[tuple[str, bytes]]) -> Pdf

See Create from HTML.

Signature verification

from pdf_oxide import PdfDocument, Timestamp, TsaClient

doc = PdfDocument("signed.pdf")
doc.signature_count()                # int
for sig in doc.signatures():
    sig.signer_name                  # str
    sig.reason                       # str | None
    sig.location                     # str | None
    sig.signing_time                 # datetime | None
    sig.verify()                     # "Valid" | "Invalid" | "Unknown"
    sig.verify_detached(pdf_bytes)   # adds messageDigest check

# Timestamp
ts = Timestamp.parse(tst_bytes)
ts.time, ts.serial, ts.policy_oid, ts.tsa_name, ts.hash_algorithm, ts.message_imprint

# TSA client (behind `tsa-client` feature)
client = TsaClient(url="https://freetsa.org/tsr",
                   username=None, password=None,
                   timeout_seconds=30, hash_algorithm=2,
                   use_nonce=True, cert_req=True)
ts = client.request_timestamp(pdf_bytes)
ts = client.request_timestamp_hash(digest, algorithm=2)

See Digital Signatures for details.

Rendering

doc.render_page_region(page: int, x: float, y: float, w: float, h: float, format: int = 0) -> bytes
doc.render_page_fit(page: int, fit_width: int, fit_height: int, format: int = 0) -> bytes

format: 0 = PNG, 1 = JPEG. Coordinates in PDF points from lower-left.

Pdf flatten

doc.flatten_to_images(dpi: int = 150) -> bytes

Other Language Bindings

PDF Oxide ships native bindings for every major ecosystem: Rust, Node.js, WASM, C#, Golang, Java, PHP, Ruby, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Clojure, Objective-C, and Elixir.

Next Steps