What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Python API Reference

PDF Oxide provides native Python bindings built with PyO3. Pre-built wheels are available for Python 3.8–3.14 on Linux, macOS, and Windows (x86_64 and ARM64).

pip install pdf_oxide

For the Rust API, see the Rust API Reference. For the JavaScript API, see the Node.js API Reference or WASM API Reference. For type details, see Types & Enums.

PdfDocument

The primary class for opening, extracting, editing, and saving PDF files.

from pdf_oxide import PdfDocument

Constructor

PdfDocument(path: str, password: str | None = None)

Parameter	Type	Description
`path`	`str`	Path to the PDF file
`password`	`str \| None`	Optional password for encrypted PDFs (default: `None`)

Pass password= to open encrypted PDFs in one step. You can also use doc.authenticate(password) after opening as an alternative.

Raises FileNotFoundError if the file does not exist. Raises PdfError if the file is not a valid PDF.

Class Methods

PdfDocument.from_bytes(data: bytes, password: str | None = None) -> PdfDocument

Open a PDF from in-memory bytes (e.g., downloaded from S3, received via HTTP). Accepts an optional password for encrypted PDFs.

Parameter	Type	Description
`data`	`bytes`	Raw PDF file bytes
`password`	`str \| None`	Optional password for encrypted PDFs (default: `None`)

from pdf_oxide import PdfDocument

# Open PDF from bytes (e.g., downloaded from S3)
doc = PdfDocument.from_bytes(pdf_bytes)

# Also supports password:
doc = PdfDocument.from_bytes(pdf_bytes, password="secret")

Methods

General

Method	Return Type	Description
`version()`	`tuple[int, int]`	PDF version as `(major, minor)` (e.g., `(1, 7)`)
`authenticate(password)`	`bool`	Authenticate an encrypted PDF with user or owner password

Document Info

doc.page_count() -> int

Return the number of pages in the document.

doc.has_structure_tree() -> bool

Check if the document is a Tagged PDF with a structure tree.

Authentication

doc.authenticate(password: str) -> bool

Authenticate with a password after opening. Returns True if authentication succeeded.

Text Extraction

doc.extract_text(
    page: int,
    region: tuple[float, float, float, float] | None = None,
    exclude_layers: list[str] | None = None,
    exclude_inks: list[str] | None = None,
    extract_tables: bool = True
) -> str

Extract plain text from a single page. Pages are zero-indexed. Optionally clip to a region, exclude named optional-content layers or ink/separation names, and toggle table reconstruction.

doc.extract_chars(
    page: int,
    region: tuple[float, float, float, float] | None = None,
    exclude_layers: list[str] | None = None,
    exclude_inks: list[str] | None = None
) -> list[TextChar]

Extract per-character positioning and font metadata. Returns a list of TextChar objects.

doc.extract_spans(page: int, region: tuple | None = None, reading_order: str | None = None) -> list[TextSpan]

Extract text spans with font metadata. Each span is a run of identically-styled text. Pass reading_order="column_aware" for multi-column PDFs.

doc.extract_words(
    page: int,
    *,
    include_artifacts: bool = True,
    region: tuple | None = None,
    word_gap_threshold: float | None = None,
    profile: ExtractionProfile | None = None
) -> list[TextWord]

Extract word-grouped text with bounding boxes. Returns a list of TextWord objects.

doc.extract_text_lines(
    page: int,
    *,
    include_artifacts: bool = True,
    region: tuple | None = None,
    word_gap_threshold: float | None = None,
    line_gap_threshold: float | None = None,
    profile: ExtractionProfile | None = None
) -> list[TextLine]

Extract line-grouped text. Returns a list of TextLine objects.

doc.extract_page_text(page: int, reading_order: str | None = None) -> dict

Extract spans, characters, and page dimensions from a single pass. Returns a dict with keys: spans, chars, page_width, page_height, text. More efficient than calling extract_spans() + extract_chars() separately.

doc.page_layout_params(page: int) -> LayoutParams

Compute adaptive layout parameters (word/line gap thresholds, median metrics, column count) for a page. See LayoutParams.

doc.within(page: int, bbox: tuple[float, float, float, float]) -> PdfPageRegion

Create a clipped region handle for extracting text, words, lines, tables, images, and paths inside bbox. See PdfPageRegion.

Auto Extraction & Classification

doc.extract_text_auto(page: int) -> str

Auto-select the best extraction strategy (native text vs. OCR) for a page and return plain text.

doc.extract_page_auto(page: int, options_json: str | None = None) -> str

Auto-extract a page and return a JSON document; pass a JSON options_json string to tune the pipeline.

doc.classify_page(page: int) -> str

Classify a single page (e.g. "text", "scanned", "mixed").

doc.classify_document() -> str

Classify the whole document by sampling its pages.

doc.has_text_layer(page: int) -> bool

Check whether a page already has an extractable native text layer (vs. requiring OCR).

Conversion

doc.to_plain_text(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None
) -> str

Convert a page to plain text with layout options.

doc.to_plain_text_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None
) -> str

Convert all pages to plain text.

doc.to_markdown(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

Convert a page to Markdown.

doc.to_markdown_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

Convert all pages to Markdown.

doc.to_html(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

Convert a page to HTML.

doc.to_html_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

Convert all pages to HTML.

Office Conversion

Method	Return Type	Description
`to_docx(path)`	–	Convert the PDF to a Word document file
`to_docx_bytes()`	`bytes`	Convert the PDF to DOCX bytes
`to_pptx(path)`	–	Convert the PDF to a PowerPoint file
`to_pptx_bytes()`	`bytes`	Convert the PDF to PPTX bytes
`to_xlsx(path)`	–	Convert the PDF to an Excel workbook file
`to_xlsx_bytes()`	`bytes`	Convert the PDF to XLSX bytes

Image Extraction

doc.extract_images(page: int) -> list[ImageInfo]

Extract all images from a page, including images in content streams and nested Form XObjects.

doc.extract_image_bytes(page: int) -> list[dict]

Extract raw image bytes from a page. Each dict contains width, height, data (bytes), and format.

Search

doc.search(
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0
) -> list[SearchResult]

Search for text across all pages. Set max_results=0 for unlimited results. Returns a list of matches with page number, text, and coordinates.

doc.search_page(
    page: int,
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0
) -> list[SearchResult]

Search for text on a single page.

Metadata Editing

Method	Parameters	Description
`set_title(title)`	`str`	Set document title
`set_author(author)`	`str`	Set document author
`set_subject(subject)`	`str`	Set document subject
`set_keywords(keywords)`	`str`	Set document keywords

Page Rotation

Method	Parameters	Returns	Description
`page_rotation(page)`	`int`	`int`	Get current rotation (0, 90, 180, 270)
`set_page_rotation(page, degrees)`	`int, int`	–	Set absolute rotation
`rotate_page(page, degrees)`	`int, int`	–	Add to current rotation
`rotate_all_pages(degrees)`	`int`	–	Rotate all pages

Page Dimensions

Method	Parameters	Returns	Description
`page_media_box(page)`	`int`	`tuple[float, float, float, float]`	Get MediaBox `(llx, lly, urx, ury)`
`set_page_media_box(page, llx, lly, urx, ury)`	`int, float, float, float, float`	–	Set MediaBox
`page_crop_box(page)`	`int`	`tuple	None`
`set_page_crop_box(page, llx, lly, urx, ury)`	`int, float, float, float, float`	–	Set CropBox
`crop_margins(left, right, top, bottom)`	`float, float, float, float`	–	Crop all page margins

Erase / Whiteout

Method	Parameters	Description
`erase_region(page, llx, lly, urx, ury)`	`int, float, float, float, float`	Erase a rectangular region
`erase_regions(page, rects)`	`int, list[tuple]`	Erase multiple regions
`clear_erase_regions(page)`	`int`	Clear pending erase operations

Annotations

doc.get_annotations(page: int) -> list[dict]

Get annotation metadata (type, rect, contents, etc.) for a page.

Method	Parameters	Returns	Description
`flatten_page_annotations(page)`	`int`	–	Flatten annotations on a page
`flatten_all_annotations()`	–	–	Flatten all annotations
`is_page_marked_for_flatten(page)`	`int`	`bool`	Check if page is marked for flatten
`unmark_page_for_flatten(page)`	`int`	–	Unmark a page for flatten

Redaction

doc.add_redaction(
    page: int,
    rect: tuple[float, float, float, float],
    fill: tuple[float, float, float] | None = None
) -> None

Mark a rectangular region for redaction with an optional RGB fill color.

doc.redaction_count(page: int) -> int

Return the number of pending redactions on a page.

doc.apply_redactions_destructive(
    scrub_metadata: bool = True,
    remove_javascript: bool = True,
    remove_embedded_files: bool = True,
    fill: tuple[float, float, float] = (0.0, 0.0, 0.0)
) -> None

Apply all redactions destructively, removing underlying content and optionally scrubbing metadata, JavaScript, and embedded files.

doc.sanitize_document(
    scrub_metadata: bool = True,
    remove_javascript: bool = True,
    remove_embedded_files: bool = True
) -> None

Sanitize the document without redacting regions: strip metadata, JavaScript, and/or embedded files.

Method	Parameters	Returns	Description
`apply_page_redactions(page)`	`int`	–	Apply redactions on a page
`apply_all_redactions()`	–	–	Apply all pending redactions
`is_page_marked_for_redaction(page)`	`int`	`bool`	Check if page is marked for redaction
`unmark_page_for_redaction(page)`	`int`	–	Unmark a page for redaction

Layers & Inks

Method	Parameters	Returns	Description
`get_layers()`	–	`list[str]`	List optional-content (OCG) layer names
`get_page_inks(page)`	`int`	`list[str]`	List ink / separation colorant names on a page
`get_page_inks_deep(page)`	`int`	`list[str]`	List inks including those nested in Form XObjects

Header / Footer Cleanup

doc.remove_headers(threshold: float = 0.8) -> int
doc.remove_footers(threshold: float = 0.8) -> int
doc.remove_artifacts(threshold: float = 0.8) -> int

Detect and remove repeating headers, footers, or page artifacts across the document. threshold is the cross-page repetition ratio. Returns the number of elements removed.

Method	Parameters	Description
`erase_header(page)`	`int`	Erase the detected header region on a page
`edit_header(page)`	`int`	Mark the header region for editing
`erase_footer(page)`	`int`	Erase the detected footer region on a page
`edit_footer(page)`	`int`	Mark the footer region for editing
`erase_artifacts(page)`	`int`	Erase detected artifacts on a page
`sync_editor_erasures()`	–	Flush pending header/footer/artifact erasures into the editor

Form Fields

doc.get_form_fields() -> list[FormField]

Get all form fields. See FormField for properties.

doc.get_form_field_value(name: str) -> str | bool | list | None

Get a form field value by name. Returns the appropriate Python type based on the field type.

doc.set_form_field_value(name: str, value: str | bool) -> None

Set a form field value by name.

doc.has_xfa() -> bool

Check if the document contains XFA forms.

doc.export_form_data(path: str, format: str = "fdf") -> None

Export form data to a file. Supported formats: "fdf" and "xfdf".

Method	Parameters	Description
`flatten_forms()`	–	Flatten all form fields into page content
`flatten_forms_on_page(page)`	`int`	Flatten forms on a specific page

Image Manipulation

doc.page_images(page: int) -> list[dict]

Get image names and bounds for positioning operations. Each dict contains name, bounds [x, y, width, height], and matrix.

Method	Parameters	Description
`reposition_image(page, name, x, y)`	`int, str, float, float`	Move an image
`resize_image(page, name, width, height)`	`int, str, float, float`	Resize an image
`set_image_bounds(page, name, x, y, width, height)`	`int, str, float, float, float, float`	Set image position and size
`clear_image_modifications(page)`	`int`	Clear pending image modifications
`has_image_modifications(page)`	`int` → `bool`	Check for pending image modifications

Document Operations

doc.merge_from(source: str | PdfDocument) -> int

Merge pages from another PDF. Accepts a file path or PdfDocument instance. Returns the number of pages merged.

doc.embed_file(name: str, data: bytes) -> None

Attach a file to the PDF.

doc.get_outline() -> list[dict] | None

Get document bookmarks / table of contents. Returns None if no outline exists.

doc.extract_paths(page: int, region: tuple | None = None) -> list[dict]

Get vector paths (lines, curves, shapes) from a page.

doc.extract_rects(page: int, region: tuple | None = None) -> list[dict]

Get axis-aligned rectangles (from filled/stroked paths) on a page.

doc.extract_lines(page: int, region: tuple | None = None) -> list[dict]

Get straight line segments on a page.

doc.extract_tables(page: int, region: tuple | None = None, table_settings: dict | None = None) -> list[dict]

Detect and extract tables. Each table is a dict with rows and cells (text + bounding boxes). Pass table_settings to tune detection strategy.

doc.extract_structured(page: int) -> str

Extract the page as a structured JSON document (logical reading order, blocks, and roles).

doc.page_labels() -> list[dict]

Get page label ranges. Each dict contains start_page, style, prefix, and start_value.

doc.xmp_metadata() -> dict | None

Get XMP metadata as a dictionary with fields like dc_title, dc_creator, xmp_create_date, etc. Returns None if no XMP metadata exists.

OCR

doc.extract_text_ocr(page: int, engine: OcrEngine | None = None) -> str

Extract text using OCR. Requires the ocr feature in the Rust build. Pass a custom OcrEngine or None for the default engine.

Page Extraction & Reordering

doc.extract_pages(pages: list[int], output: str) -> None

Extract the given page indices into a new PDF file at output.

doc.extract_pages_to_bytes(pages: list[int]) -> bytes

Extract the given page indices into a new PDF returned as bytes.

doc.extract_page_ranges_to_bytes(ranges: list[tuple[int, int]]) -> bytes

Extract one or more (start, end) page ranges into a new PDF returned as bytes.

Method	Parameters	Description
`select_pages(pages)`	`list[int]`	Keep only the listed pages, in the given order
`delete_page(index)`	`int`	Delete a single page
`move_page(from_index, to_index)`	`int, int`	Move a page to a new position

Compliance & Validation

doc.validate_pdf_a(level: str = "1b") -> dict

Validate against a PDF/A conformance level (e.g. "1b", "2b", "3b"). Returns a report dict.

doc.convert_to_pdf_a(level: str = "2b") -> dict

Convert the document to PDF/A and return a conversion report dict.

doc.validate_pdf_ua() -> dict

Validate against PDF/UA (accessibility) requirements.

doc.validate_pdf_x(level: str = "1a_2001") -> dict

Validate against a PDF/X (print-production) conformance level.

Permissions & Warnings

doc.permissions() -> dict

Return the document’s encryption permission flags (print, copy, modify, annotate, etc.).

doc.structured_warnings() -> list

Return warnings collected during structured / tagged-content extraction.

doc.flatten_warnings() -> list[str]

Return warnings collected during form/annotation flattening.

Signatures & Document Security Store

doc.signatures() -> list[Signature]

Return all digital signatures in the document. See Signature.

doc.signature_count() -> int

Return the number of digital signatures.

doc.dss() -> Dss | None

Return the document’s parsed Document Security Store (LTV material), or None. See Dss.

Page API (v0.3.34)

PdfDocument is iterable and indexable, returning lazy Page objects. See Page.

len(doc)                  # number of pages
doc[i]                    # page at index i (negative indexing supported)
doc[-1]                   # last page
for page in doc: ...      # iterate pages

DOM Access

doc.page(index: int) -> PdfPage

Get a DOM-like page handle for element-level editing. See PdfPage.

doc.save_page(page: PdfPage) -> None

Save a modified PdfPage back to the document.

Rendering

doc.render_page(
    page: int,
    dpi: int | None = None,
    format: str | None = None,
    background: tuple[float, float, float, float] | None = None,
    transparent: bool = False,
    render_annotations: bool | None = None,
    jpeg_quality: int | None = None,
    excluded_layers: list[str] | None = None
) -> bytes

Render a page to PNG or JPEG bytes with control over DPI, background, transparency, annotation rendering, JPEG quality, and excluded layers.

doc.render_pixmap(page: int, dpi: int | None = None) -> RenderedPixmap

Render a page to a raw RGBA RenderedPixmap (named tuple with width, height, data).

doc.render_separations(page: int, dpi: int | None = None) -> list[SeparationPlate]

Render per-ink separation plates for a page. Returns a list of SeparationPlate named tuples (name, width, height, data).

doc.render_separation(page: int, ink_name: str, dpi: int | None = None) -> SeparationPlate

Render a single named ink separation plate.

Method	Return Type	Description
`render_page_fit(page, fit_width, fit_height, format=0)`	`bytes`	Render a page scaled to fit a pixel box
`flatten_to_images(dpi=150)`	`bytes`	Flatten all pages to image-based PDF

Saving

doc.save(path: str, compress: bool = True, garbage_collect: bool = True, linearize: bool = False) -> None

Save the PDF to a file. Toggle stream compression, dead-object garbage collection, and linearization (fast web view).

doc.to_bytes(compress: bool = True, garbage_collect: bool = True, linearize: bool = False) -> bytes

Serialize the PDF to bytes with the same options as save().

doc.save_encrypted(
    path: str,
    user_password: str,
    owner_password: str | None = None,
    allow_print: bool = True,
    allow_copy: bool = True,
    allow_modify: bool = True,
    allow_annotate: bool = True
) -> None

Save with AES-256 password protection and permission controls. If owner_password is None, the user password is used.

doc.to_bytes_encrypted(
    user_password: str,
    owner_password: str | None = None,
    allow_print: bool = True,
    allow_copy: bool = True,
    allow_modify: bool = True,
    allow_annotate: bool = True
) -> bytes

Serialize an AES-256 encrypted PDF to bytes.

Page

A lazy page handle returned by doc[i] or iteration over PdfDocument. All properties are computed on access and dispatch to the parent document.

from pdf_oxide import PdfDocument

with PdfDocument("paper.pdf") as doc:
    page = doc[0]
    text = page.text
    md = page.markdown(detect_headings=True)

Properties (lazy)

Property	Type	Description
`index`	`int`	Zero-based page index
`width`, `height`	`float`	Page dimensions in PDF points
`bbox`	`tuple[float, 4]`	`(llx, lly, urx, ury)`
`text`	`str`	Extracted plain text
`chars`, `words`, `lines`, `spans`	`list[...]`	Structured text
`tables`	`list[dict]`	Tables with rows + cells (text + bboxes)
`images`, `paths`, `annotations`	`list[...]`	Page content

Methods

page.markdown(preserve_layout=False, detect_headings=True,
              include_images=False, image_output_dir=None,
              embed_images=True, include_form_fields=True) -> str
page.plain_text(...) -> str
page.html(...) -> str
page.render(dpi=None, format=None, background=None, transparent=False,
            render_annotations=None, jpeg_quality=None, excluded_layers=None) -> bytes
page.render_pixmap(dpi=None) -> RenderedPixmap
page.search(pattern, case_insensitive=False, literal=False,
            whole_word=False, max_results=100) -> list
page.region(x, y, width, height) -> PdfPageRegion

The lazy page object is also exposed as doc.pages() (an iterator equivalent to iterating the document directly).

PdfPage

DOM-like page handle for element-level access and editing. Obtained via PdfDocument.page().

from pdf_oxide import PdfDocument

doc = PdfDocument("file.pdf")
page = doc.page(0)

Properties

Property	Type	Description
`index`	`int`	Zero-based page index
`width`	`float`	Page width in PDF points
`height`	`float`	Page height in PDF points

Methods

page.children() -> list[PdfElement]

Get all elements on the page.

page.find_text_containing(needle: str) -> list[PdfText]

Find all text elements containing the given substring.

page.find_images() -> list[PdfImage]

Find all image elements on the page.

page.get_element(element_id: str) -> PdfElement | None

Get a specific element by its ID.

page.set_text(text_id: PdfTextId, new_text: str) -> None

Replace the text content of an element identified by its PdfTextId.

page.annotations() -> list[PdfAnnotation]

Get all annotations on the page.

page.add_link(x: float, y: float, width: float, height: float, url: str) -> str

Add a URL link annotation. Returns the annotation ID.

page.add_highlight(x: float, y: float, width: float, height: float, color: tuple[float, float, float]) -> str

Add a highlight annotation with an RGB color. Returns the annotation ID.

page.add_note(x: float, y: float, text: str) -> str

Add a sticky note annotation. Returns the annotation ID.

page.remove_annotation(index: int) -> bool

Remove an annotation by index. Returns True if removed.

page.add_text(text: str, x: float, y: float, font_size: float = 12.0) -> PdfTextId

Add new text to the page. Returns a PdfTextId for later reference.

page.remove_element(element_id: PdfTextId) -> bool

Remove an element by its ID. Returns True if removed.

Example

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
page = doc.page(0)

# Find and replace text
for text in page.find_text_containing("DRAFT"):
    page.set_text(text.id, "FINAL")

# Add a link
page.add_link(100, 700, 200, 20, "https://example.com")

doc.save_page(page)
doc.save("invoice_updated.pdf")

Pdf

The unified class for creating PDFs from various source formats.

from pdf_oxide import Pdf

Factory Methods

Pdf.from_markdown(content: str, title: str | None = None, author: str | None = None) -> Pdf

Create a PDF from Markdown content.

Pdf.from_html(content: str, title: str | None = None, author: str | None = None) -> Pdf

Create a PDF from HTML content.

Pdf.from_text(content: str, title: str | None = None, author: str | None = None) -> Pdf

Create a PDF from plain text.

Pdf.from_markdown_with_template(content: str, template: str, title: str | None = None, author: str | None = None) -> Pdf

Create a PDF from Markdown rendered through a named CSS/layout template.

Pdf.from_image(path: str) -> Pdf

Create a single-page PDF from an image file (JPEG, PNG).

Pdf.from_bytes(data: bytes) -> Pdf

Open an existing PDF from in-memory bytes for modification. Useful for loading PDFs downloaded from S3, HTTP, or databases.

from pdf_oxide import Pdf

pdf = Pdf.from_bytes(existing_pdf_bytes)
pdf.save("modified.pdf")

Pdf.from_images(paths: list[str]) -> Pdf

Create a multi-page PDF from multiple image files, one page per image.

Pdf.from_image_bytes(data: bytes) -> Pdf

Create a single-page PDF from image bytes.

Pdf.merge(paths: list[str]) -> Pdf

Merge multiple PDF files (by path) into a single Pdf.

Methods

pdf.save(path: str) -> None

Save the PDF to a file.

pdf.to_bytes() -> bytes

Get the PDF content as bytes.

len(pdf) -> int

Get the PDF size in bytes (via __len__).

PdfText

Represents a text element on a page. Returned by PdfPage.find_text_containing().

Property	Type	Description
`id`	`PdfTextId`	Unique element identifier
`value`	`str`	Text content
`text`	`str`	Text content (alias for `value`)
`bbox`	`tuple[float, float, float, float]`	Bounding box `(x0, y0, x1, y1)`
`font_name`	`str`	PostScript font name
`font_size`	`float`	Font size in points
`is_bold`	`bool`	Whether text is bold
`is_italic`	`bool`	Whether text is italic

Methods

Method	Parameters	Returns	Description
`contains(needle)`	`str`	`bool`	Check if text contains substring
`starts_with(prefix)`	`str`	`bool`	Check if text starts with prefix
`ends_with(suffix)`	`str`	`bool`	Check if text ends with suffix

PdfImage

Represents an image element on a page. Returned by PdfPage.find_images().

Property	Type	Description
`bbox`	`tuple[float, float, float, float]`	Bounding box `(x0, y0, x1, y1)`
`width`	`int`	Image width in pixels
`height`	`int`	Image height in pixels
`aspect_ratio`	`float`	Width / height ratio

PdfAnnotation

Represents an annotation on a page. Returned by PdfPage.annotations().

Property	Type	Description
`subtype`	`str`	Annotation type (e.g., `"Link"`, `"Highlight"`, `"Text"`)
`rect`	`tuple[float, float, float, float]`	Position `(x0, y0, x1, y1)`
`contents`	`str	None`
`color`	`tuple[float, float, float]	None`
`is_modified`	`bool`	Whether the annotation has been modified
`is_new`	`bool`	Whether the annotation is newly added

PdfElement

Generic element wrapper. Returned by PdfPage.children().

Method	Returns	Description
`is_text()`	`bool`	Check if element is text
`is_image()`	`bool`	Check if element is an image
`is_path()`	`bool`	Check if element is a vector path
`is_table()`	`bool`	Check if element is a table
`is_structure()`	`bool`	Check if element is a structure element
`as_text()`	`PdfText	None`
`as_image()`	`PdfImage	None`

Property	Type	Description
`bbox`	`tuple[float, float, float, float]`	Bounding box

TextChar

Represents a single character with positioning and font metadata. Returned by PdfDocument.extract_chars().

from pdf_oxide import TextChar  # or access via PdfDocument

Attribute	Type	Description
`char`	`str`	The Unicode character
`bbox`	`tuple[float, float, float, float]`	Bounding box `(x0, y0, x1, y1)`
`font_name`	`str`	PostScript font name
`font_size`	`float`	Font size in points
`font_weight`	`str`	Weight (`"thin"`, `"light"`, `"normal"`, `"medium"`, `"semi-bold"`, `"bold"`, `"extra-bold"`, `"black"`)
`is_italic`	`bool`	Whether the character is italic
`color`	`tuple[float, float, float]`	RGB color `(r, g, b)`, values 0.0–1.0
`rotation_degrees`	`float`	Character rotation in degrees
`origin_x`	`float`	Text origin X position
`origin_y`	`float`	Text origin Y position
`advance_width`	`float`	Glyph advance width
`mcid`	`int	None`

Example

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
chars = doc.extract_chars(0)

for ch in chars[:5]:
    print(f"'{ch.char}' at bbox={ch.bbox} "
          f"font={ch.font_name} size={ch.font_size:.1f} "
          f"weight={ch.font_weight} italic={ch.is_italic}")

TextSpan

Represents a run of text sharing the same font and style. Returned by PdfDocument.extract_spans().

Attribute	Type	Description
`text`	`str`	The text content
`bbox`	`tuple[float, float, float, float]`	Bounding box `(x0, y0, x1, y1)`
`font_name`	`str`	PostScript font name
`font_size`	`float`	Font size in points
`is_bold`	`bool`	Whether the span is bold
`is_italic`	`bool`	Whether the span is italic
`is_monospace`	`bool`	Whether the font is fixed-width (Courier, Consolas, etc.)
`char_widths`	`list[float]`	Per-glyph advance widths for accurate bounding boxes
`color`	`tuple[float, float, float]`	RGB color `(r, g, b)`, values 0.0–1.0

Example

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
spans = doc.extract_spans(0)

for span in spans:
    print(f"'{span.text}' font={span.font_name} size={span.font_size:.1f} "
          f"bold={span.is_bold} italic={span.is_italic} color={span.color}")

Image Extraction

extract_images() returns ImageInfo objects with image metadata. Use extract_image_bytes() for raw image data suitable for saving to disk.

extract_image_bytes() Return Format

Each dict returned by extract_image_bytes() has the following keys:

Key	Type	Description
`width`	`int`	Image width in pixels
`height`	`int`	Image height in pixels
`data`	`bytes`	Raw image data
`format`	`str`	Image format (e.g., `"png"`, `"jpeg"`)

Example

from pdf_oxide import PdfDocument

doc = PdfDocument("brochure.pdf")
images = doc.extract_image_bytes(0)

for i, img in enumerate(images):
    print(f"Image {i}: {img['width']}x{img['height']}")
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

SearchResult

Represents a text search match. Returned by search() and search_page().

Attribute	Type	Description
`page`	`int`	Zero-based page index
`text`	`str`	Matched text
`x`	`float`	X position in PDF points
`y`	`float`	Y position in PDF points

FormField

Represents a form field. Returned by PdfDocument.get_form_fields().

Property	Type	Description
`name`	`str`	Fully qualified field name
`field_type`	`str`	Field type: `"text"`, `"button"`, `"choice"`, `"signature"`, or `"unknown"`
`value`	`str	bool
`tooltip`	`str	None`
`bounds`	`tuple[float, float, float, float]	None`
`flags`	`int	None`
`max_length`	`int	None`
`is_readonly`	`bool`	Whether the field is read-only
`is_required`	`bool`	Whether the field is required

TextWord

A word-grouped run of text. Returned by PdfDocument.extract_words() and PdfPageRegion.extract_words().

Property	Type	Description
`text`	`str`	The word text
`bbox`	`tuple[float, float, float, float]`	Bounding box `(x0, y0, x1, y1)`
`font_name`	`str`	PostScript font name
`font_size`	`float`	Font size in points
`is_bold`	`bool`	Whether the word is bold
`is_italic`	`bool`	Whether the word is italic
`chars`	`list[TextChar]`	Constituent characters

TextLine

A line-grouped run of text. Returned by PdfDocument.extract_text_lines() and PdfPageRegion.extract_text_lines().

Property	Type	Description
`text`	`str`	The line text
`bbox`	`tuple[float, float, float, float]`	Bounding box `(x0, y0, x1, y1)`
`words`	`list[TextWord]`	Words in the line
`chars`	`list[TextChar]`	Characters in the line

PdfPageRegion

A clipped region of a page. Returned by PdfDocument.within() and PdfPage.region().

Property	Type	Description
`bbox`	`tuple[float, float, float, float]`	The region’s bounds

Methods

region.extract_text() -> str
region.extract_words() -> list[TextWord]
region.extract_text_lines() -> list[TextLine]
region.extract_tables(table_settings: dict | None = None) -> list[dict]
region.extract_images() -> list
region.extract_paths() -> list

Extraction methods scoped to the region’s bounding box.

LayoutParams

Computed adaptive layout parameters for a page. Returned by PdfDocument.page_layout_params().

Property	Type	Description
`word_gap_threshold`	`float`	Inter-word gap threshold in points
`line_gap_threshold`	`float`	Inter-line gap threshold in points
`median_char_width`	`float`	Median character width
`median_font_size`	`float`	Median font size
`median_line_spacing`	`float`	Median line spacing
`column_count`	`int`	Detected number of text columns

ExtractionProfile

A tunable text-extraction profile passed to extract_words() / extract_text_lines().

from pdf_oxide import ExtractionProfile

Static Constructors

ExtractionProfile.conservative()
ExtractionProfile.aggressive()
ExtractionProfile.balanced()
ExtractionProfile.academic()
ExtractionProfile.policy()
ExtractionProfile.form()
ExtractionProfile.government()
ExtractionProfile.scanned_ocr()
ExtractionProfile.adaptive()
ExtractionProfile.available() -> list[str]   # names of all built-in profiles

Properties

Property	Type	Description
`name`	`str`	Profile name
`tj_offset_threshold`	`float`	TJ array offset word-break threshold
`word_margin_ratio`	`float`	Word margin ratio
`space_threshold_em_ratio`	`float`	Space-width threshold (em ratio)
`space_char_multiplier`	`float`	Space-character multiplier
`use_adaptive_threshold`	`bool`	Whether adaptive thresholds are enabled

OfficeConverter

Convert Office documents (DOCX, XLSX, PPTX) to PDF. Requires the office feature in the Rust build.

from pdf_oxide import OfficeConverter

OfficeConverter()   # instances are stateless; the conversion methods are also usable as static methods

Methods

OfficeConverter.from_docx(path: str) -> Pdf

Convert a Word document to a Pdf object.

OfficeConverter.from_docx_bytes(data: bytes) -> Pdf

Convert Word document bytes to a Pdf object.

OfficeConverter.from_xlsx(path: str) -> Pdf

Convert an Excel spreadsheet to a Pdf object.

OfficeConverter.from_xlsx_bytes(data: bytes) -> Pdf

Convert Excel spreadsheet bytes to a Pdf object.

OfficeConverter.from_pptx(path: str) -> Pdf

Convert a PowerPoint presentation to a Pdf object.

OfficeConverter.from_pptx_bytes(data: bytes) -> Pdf

Convert PowerPoint presentation bytes to a Pdf object.

OfficeConverter.convert(path: str) -> Pdf

Auto-detect format and convert any supported Office document to a Pdf object.

Example

from pdf_oxide import OfficeConverter

pdf = OfficeConverter.from_docx("report.docx")
pdf.save("report.pdf")

# Or use convert() for auto-detection
pdf = OfficeConverter.convert("spreadsheet.xlsx")
pdf.save("spreadsheet.pdf")

Graphics Classes

These classes are available for advanced PDF creation with graphics:

Color

from pdf_oxide import Color

Color(r: float, g: float, b: float)  # RGB, values 0.0-1.0
Color.from_hex("#ff0000")
Color.black()
Color.white()
Color.red()
Color.green()
Color.blue()

BlendMode

from pdf_oxide import BlendMode

BlendMode.NORMAL()
BlendMode.MULTIPLY()
BlendMode.SCREEN()
BlendMode.OVERLAY()
BlendMode.DARKEN()
BlendMode.LIGHTEN()
BlendMode.COLOR_DODGE()
BlendMode.COLOR_BURN()
BlendMode.HARD_LIGHT()
BlendMode.SOFT_LIGHT()
BlendMode.DIFFERENCE()
BlendMode.EXCLUSION()

ExtGState

from pdf_oxide import ExtGState

gs = ExtGState()
gs = gs.fill_alpha(0.5)
gs = gs.stroke_alpha(0.8)
gs = gs.alpha(0.5)  # Set both fill and stroke
gs = gs.blend_mode(BlendMode.MULTIPLY())

gs = ExtGState.semi_transparent()  # Preset

LineCap / LineJoin

from pdf_oxide import LineCap, LineJoin

LineCap.BUTT()       # Default
LineCap.ROUND()
LineCap.SQUARE()

LineJoin.MITER()     # Default
LineJoin.ROUND()
LineJoin.BEVEL()

Gradients

from pdf_oxide import LinearGradient, RadialGradient, Color

# Linear gradient (fluent API)
grad = (LinearGradient()
    .start(0, 0)
    .end(100, 0)
    .add_stop(0.0, Color.red())
    .add_stop(1.0, Color.blue()))

# Convenience constructors
hgrad = LinearGradient.horizontal(200, Color.red(), Color.blue())
vgrad = LinearGradient.vertical(100, Color(1, 1, 0), Color(0, 0, 1))

# Radial gradient
rgrad = RadialGradient.centered(50, 50, 50)
rgrad = rgrad.add_stop(0.0, Color(1, 1, 0))
rgrad = rgrad.add_stop(1.0, Color(1, 0, 0))

PatternPresets

from pdf_oxide import PatternPresets, Color

PatternPresets.horizontal_stripes(width, height, stripe_height, color)
PatternPresets.vertical_stripes(width, height, stripe_width, color)
PatternPresets.checkerboard(size, color1, color2)
PatternPresets.dots(spacing, radius, color)
PatternPresets.diagonal_lines(size, line_width, color)
PatternPresets.crosshatch(size, line_width, color)

OCR Classes

Requires the ocr feature in the Rust build.

OcrEngine

from pdf_oxide import OcrEngine, OcrConfig

engine = OcrEngine(
    det_model_path: str,
    rec_model_path: str,
    dict_path: str,
    config: OcrConfig | None = None
)

OcrConfig

from pdf_oxide import OcrConfig

config = OcrConfig(
    det_threshold: float | None = None,
    box_threshold: float | None = None,
    rec_threshold: float | None = None,
    num_threads: int | None = None,
    max_candidates: int | None = None,
    use_v5: bool = False
)

DocumentBuilder

Fluent builder for composing PDFs page by page. See the example below and Create from scratch.

from pdf_oxide import DocumentBuilder

Document-Level Methods

Method	Parameters	Description
`DocumentBuilder()`	–	Construct a new builder
`title(title)`	`str`	Set document title
`author(author)`	`str`	Set document author
`subject(subject)`	`str`	Set document subject
`keywords(keywords)`	`str`	Set document keywords
`creator(creator)`	`str`	Set the producing application name
`on_open(script)`	`str`	Set a document-level open JavaScript action
`tagged_pdf_ua1()`	–	Emit a Tagged PDF/UA-1 accessible document
`language(lang)`	`str`	Set the document language (e.g. `"en-US"`)
`role_map(custom, standard)`	`str, str`	Map a custom structure tag to a standard one
`register_embedded_font(name, font)`	`str, EmbeddedFont`	Register a font (consumes the `EmbeddedFont`)

Page Factories

builder.a4_page() -> FluentPageBuilder       # 595 x 842 pt
builder.letter_page() -> FluentPageBuilder   # 612 x 792 pt
builder.page(width: float, height: float) -> FluentPageBuilder

Output

builder.build() -> bytes
builder.save(path: str) -> None
builder.save_encrypted(path: str, user_password: str, owner_password: str) -> None
builder.to_bytes_encrypted(user_password: str, owner_password: str) -> bytes

FluentPageBuilder

Buffers page-level operations until done(). Returned by DocumentBuilder.a4_page() / letter_page() / page(). Every method returns self for chaining; done() commits the page and returns the parent DocumentBuilder.

Text & Layout

Method	Parameters	Description
`font(name, size)`	`str, float`	Set the current font and size
`at(x, y)`	`float, float`	Move the cursor to an absolute position
`text(text)`	`str`	Draw text at the cursor
`heading(level, text)`	`int, str`	Draw a heading (level 1–6)
`paragraph(text)`	`str`	Draw a wrapped paragraph
`space(points)`	`float`	Advance vertical space
`horizontal_rule()`	–	Draw a horizontal divider
`columns(column_count, gap_pt, text)`	`int, float, str`	Balanced multi-column text flow
`footnote(ref_mark, note_text)`	`str, str`	Inline reference mark + bottom-of-page note
`new_page_same_size()`	–	Start a fresh page with the same dimensions
`measure(text) -> float`	`str`	Measure rendered text width in points
`remaining_space() -> float`	–	Remaining vertical space on the page

Inline Runs

page.inline(text: str)
page.inline_bold(text: str)
page.inline_italic(text: str)
page.inline_color(text: str, r: float, g: float, b: float)
page.newline()

Links & Actions

page.link_url(url: str)
page.link_page(page: int)
page.link_named(name: str)
page.link_javascript(script: str)
page.on_open(script: str)
page.on_close(script: str)
page.field_keystroke(script: str)
page.field_format(script: str)
page.field_validate(script: str)
page.field_calculate(script: str)

Markup Annotations

page.highlight(color: tuple[float, float, float])
page.underline(color: tuple[float, float, float])
page.strikeout(color: tuple[float, float, float])
page.squiggly(color: tuple[float, float, float])
page.sticky_note(text: str)
page.sticky_note_at(x: float, y: float, text: str)
page.watermark(text: str)
page.watermark_confidential()
page.watermark_draft()
page.stamp(name: str)
page.freetext(x: float, y: float, w: float, h: float, text: str)

AcroForm Widgets

page.text_field(name: str, x: float, y: float, w: float, h: float, default_value: str | None = None)
page.checkbox(name: str, x: float, y: float, w: float, h: float, checked: bool = False)
page.combo_box(name: str, x: float, y: float, w: float, h: float, options: list[str], selected: str | None = None)
page.radio_group(name: str, buttons: list[tuple[str, float, float, float, float]], selected: str | None = None)
page.push_button(name: str, x: float, y: float, w: float, h: float, caption: str)
page.signature_field(name: str, x: float, y: float, w: float, h: float)

Graphics

page.rect(x: float, y: float, w: float, h: float)
page.filled_rect(x: float, y: float, w: float, h: float, r: float, g: float, b: float)
page.line(x1: float, y1: float, x2: float, y2: float)
page.text_in_rect(x: float, y: float, w: float, h: float, text: str, align: int | None = None)
page.stroke_rect(x, y, w, h, width=1.0, color=(0.0, 0.0, 0.0))
page.stroke_rect_dashed(x, y, w, h, dash, width=1.0, color=(0.0, 0.0, 0.0), phase=0.0)
page.stroke_line(x1, y1, x2, y2, width=1.0, color=(0.0, 0.0, 0.0))
page.stroke_line_dashed(x1, y1, x2, y2, dash, width=1.0, color=(0.0, 0.0, 0.0), phase=0.0)

Images & Barcodes

page.image_with_alt(bytes: bytes, x: float, y: float, w: float, h: float, alt_text: str)
page.image_artifact(bytes: bytes, x: float, y: float, w: float, h: float)
page.barcode_1d(barcode_type: int, data: str, x: float, y: float, w: float, h: float)
page.barcode_qr(data: str, x: float, y: float, size: float)

barcode_type: 0=Code128, 1=Code39, 2=EAN13, 3=EAN8, 4=UPCA, 5=ITF, 6=Code93, 7=Codabar.

Tables

page.table(table: Table)
page.streaming_table(
    columns: list[Column],
    repeat_header: bool = False,
    mode: str = "fixed",
    sample_rows: int = 50,
    min_col_width_pt: float = 20.0,
    max_col_width_pt: float = 400.0,
    max_rowspan: int = 1,
    batch_size: int = 256
) -> StreamingTable

Commit

page.done() -> DocumentBuilder

EmbeddedFont

A TTF/OTF font registered with a DocumentBuilder.

from pdf_oxide import EmbeddedFont

EmbeddedFont.from_file(path: str) -> EmbeddedFont
EmbeddedFont.from_bytes(data: bytes, name: str | None = None) -> EmbeddedFont

Property	Type	Description
`name`	`str`	The font’s registered name

Tables

Value objects for the fluent table API.

Align

from pdf_oxide import Align

Align.LEFT     # 0
Align.CENTER   # 1
Align.RIGHT    # 2

Column

from pdf_oxide import Column

Column(header: str, width: float = 100.0, align: Align | int | None = None)

Property	Type	Description
`header`	`str`	Column header text
`width`	`float`	Column width in points
`align`	`int`	Cell alignment

Table

from pdf_oxide import Table

Table(columns: list[Column], rows: list[list[str]], has_header: bool = False)

A buffered table consumed by FluentPageBuilder.table(). With has_header=True, the column headers render as a styled header row.

StreamingTable

A row-streaming table handle returned by FluentPageBuilder.streaming_table(), for tables too large to materialize at once.

Method	Parameters	Description
`push_row(cells)`	`list[str]`	Append a row of cell strings
`push_row_span(cells)`	`list[tuple[str, int]]`	Append a row of `(text, colspan)` cells
`flush()`	–	Flush the current batch
`finish()`	–	Finish the table, returning the `FluentPageBuilder`
`column_count()`	– → `int`	Number of columns
`pending_row_count()`	– → `int`	Rows buffered but not yet committed
`batch_count()`	– → `int`	Number of completed batches

Page Templates

Repeating header/footer artifacts applied across pages.

Artifact / ArtifactStyle

from pdf_oxide import Artifact, ArtifactStyle

Artifact()                       # empty artifact
Artifact.center(text: str)       # centered artifact text
artifact.with_left(text: str)    # add left-aligned text

style = ArtifactStyle()
style = style.font(name: str, size: float)
style = style.bold()

Header / Footer

from pdf_oxide import Header, Footer

Header()                  # or Header.center(text: str)
Footer()                  # or Footer.center(text: str)

PageTemplate

from pdf_oxide import PageTemplate, Header, Footer

template = (PageTemplate()
    .header(Header.center("Confidential"))
    .footer(Footer.center("Page")))

Digital Signatures

Sign, timestamp, and verify PDFs (PAdES / LTV). Requires the signatures (and optionally tsa-client) features in the Rust build.

Certificate

from pdf_oxide import Certificate

Certificate.load(data: bytes) -> Certificate                       # DER certificate (verify only)
Certificate.load_pem(cert_pem: str, key_pem: str) -> Certificate   # signing credential
Certificate.load_pkcs12(data: bytes, password: str) -> Certificate # PKCS#12 / .p12 signing credential

Method	Returns	Description
`subject()`	`str`	Certificate subject DN
`issuer()`	`str`	Certificate issuer DN
`serial()`	`str`	Serial number
`validity()`	`tuple[int, int]`	`(not_before, not_after)` Unix timestamps
`is_valid()`	`bool`	Whether the certificate is currently within its validity window

Signature

Returned by PdfDocument.signatures().

Property / Method	Type	Description
`signer_name`	`str	None`
`reason`	`str	None`
`location`	`str	None`
`contact_info`	`str	None`
`signing_time`	`int	None`
`covers_whole_document`	`bool`	Whether the signature covers the entire file
`pades_level`	`PadesLevel`	Detected PAdES baseline (B-B/B-T/B-LT)
`verify()`	`bool`	Verify the signature cryptographically
`verify_detached(pdf_data)`	`bool`	Verify including the `messageDigest` against the file bytes

Timestamp

from pdf_oxide import Timestamp

Timestamp.parse(data: bytes) -> Timestamp

Property / Method	Type	Description
`time`	`int`	Timestamp time (Unix)
`serial`	`str`	TSA response serial number
`policy_oid`	`str`	TSA policy OID
`tsa_name`	`str`	TSA name
`hash_algorithm`	`int`	Message-imprint hash algorithm code
`message_imprint`	`bytes`	The hashed message imprint
`verify()`	`bool`	Verify the timestamp token

TsaClient

from pdf_oxide import TsaClient

client = TsaClient(
    url: str,
    username: str | None = None,
    password: str | None = None,
    timeout_seconds: int = 30,
    hash_algorithm: int = 2,
    use_nonce: bool = True,
    cert_req: bool = True
)
client.request_timestamp(data: bytes) -> Timestamp
client.request_timestamp_hash(digest: bytes, algorithm: int = 2) -> Timestamp

PadesLevel

from pdf_oxide import PadesLevel

PadesLevel.B_B     # baseline
PadesLevel.B_T     # + trusted timestamp
PadesLevel.B_LT    # + long-term validation material
PadesLevel.B_LTA   # + archival timestamp

RevocationMaterial

from pdf_oxide import RevocationMaterial

RevocationMaterial(
    certs: list[bytes] | None = None,
    crls: list[bytes] | None = None,
    ocsps: list[bytes] | None = None
)

DER-encoded certificates, CRLs, and OCSP responses for B-LT signing.

Dss

A parsed Document Security Store, returned by PdfDocument.dss().

Property	Type	Description
`certs`	`list[bytes]`	Document-level certificate DER blobs
`crls`	`list[bytes]`	CRL DER blobs
`ocsps`	`list[bytes]`	OCSP response DER blobs
`vri`	`list[str]`	Per-signature VRI keys (hex SHA-1 of `/Contents`)

Module-Level Functions

from pdf_oxide import (
    sign_pdf_bytes, sign_pdf_bytes_pades, has_document_timestamp,
    generate_barcode_svg, generate_qr_svg,
    plan_split_by_bookmarks, split_by_bookmarks,
)

Signing

sign_pdf_bytes(pdf_data: bytes, cert: Certificate, reason: str | None = None, location: str | None = None) -> bytes

Sign raw PDF bytes with a loaded signing Certificate and return the signed PDF.

sign_pdf_bytes_pades(
    pdf_data: bytes,
    cert: Certificate,
    level: PadesLevel,
    tsa_url: str | None = None,
    reason: str | None = None,
    location: str | None = None,
    revocation: RevocationMaterial | None = None
) -> bytes

Sign raw PDF bytes at a PAdES baseline level. B_T/B_LT require a tsa_url.

has_document_timestamp(pdf_data: bytes) -> bool

Whether the PDF carries a document-level RFC 3161 archival timestamp (PAdES-B-LTA).

Barcodes

generate_barcode_svg(barcode_type: int, data: str) -> str
generate_qr_svg(data: str, error_correction: int, size: int) -> str

Generate a 1D barcode or QR code as an SVG string. Requires the barcodes feature.

Split by Bookmarks

plan_split_by_bookmarks(src_bytes: bytes, title_prefix: str | None = None, ignore_case: bool = False, level: int = 1, include_front_matter: bool = True) -> list[dict]
split_by_bookmarks(src_bytes: bytes, title_prefix: str | None = None, ignore_case: bool = False, level: int = 1, include_front_matter: bool = True) -> list[tuple[dict, bytes]]

Plan or perform a split of a PDF at bookmark boundaries. plan_* returns segment metadata only; split_* returns each segment paired with its PDF bytes.

OCR Model Provisioning

prefetch_models(languages: list[str]) -> str
model_manifest() -> str
prefetch_available() -> bool

Provision OCR models for offline/air-gapped use, inspect the model manifest (JSON), and check whether this build can download models.

Logging

setup_logging() -> None
set_log_level(level: str) -> None     # "off" | "error" | "warn" | "info" | "debug" | "trace"
get_log_level() -> str
disable_logging() -> None

Engine Tuning

set_max_ops_per_stream(limit: int | None) -> int | None
set_preserve_unmapped_glyphs(preserve: bool) -> bool

Adjust the per-stream operator cap (adversarial-input protection) and U+FFFD preservation for unmapped glyphs. Both return the previous value.

Cryptographic Governance

crypto_active_provider() -> str
crypto_available_providers() -> list[str]
crypto_use_fips() -> None                 # install the FIPS aws-lc-rs provider (requires the fips feature)
crypto_set_policy(spec: str) -> None      # e.g. "strict" or "compat;deny:rc4@write"
crypto_policy() -> str
crypto_inventory() -> list[str]
crypto_cbom() -> str                      # CycloneDX 1.6 CBOM (JSON)

Asynchronous API

async/await wrappers that run blocking operations in a thread pool. Methods mirror their synchronous counterparts.

from pdf_oxide import AsyncPdfDocument, AsyncPdf, AsyncOfficeConverter

async def main():
    doc = await AsyncPdfDocument.open("input.pdf")
    text = await doc.extract_text(0)
    await doc.close()
    # Or use as an async context manager:
    async with await AsyncPdfDocument.from_bytes(pdf_bytes) as doc:
        md = await doc.to_markdown_all()

Class	Constructors	Notes
`AsyncPdfDocument`	`await AsyncPdfDocument.open(path, password=None)`, `await AsyncPdfDocument.from_bytes(data, password=None)`	All `PdfDocument` methods are available as awaitables; supports `async with` and `.close()`
`AsyncPdf`	wraps `Pdf` factory methods	`await pdf.save(path)`, `await pdf.to_bytes()`
`AsyncOfficeConverter`	wraps `OfficeConverter` static methods	e.g. `await AsyncOfficeConverter.from_docx(path)`

Error Handling

PdfError

All PDF-specific errors raise PdfError:

from pdf_oxide import PdfDocument, PdfError

try:
    doc = PdfDocument("file.pdf")
    text = doc.extract_text(0)
except PdfError as e:
    print(f"PDF error: {e}")
except FileNotFoundError:
    print("File not found")
except IndexError:
    print("Page index out of range")

Common error scenarios:

Exception	Cause
`PdfError`	Malformed PDF, encrypted without password, parse failure
`FileNotFoundError`	File does not exist
`IndexError`	Page index exceeds `page_count()`
`ValueError`	Invalid argument (e.g., negative page index)

Complete Example

from pdf_oxide import PdfDocument, Pdf

# --- Extraction ---
doc = PdfDocument("input.pdf")
print(f"Pages: {doc.page_count()}")

for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(f"Page {i + 1}: {len(text)} characters")

# Character-level analysis
chars = doc.extract_chars(0)
fonts = set(ch.font_name for ch in chars)
print(f"Fonts on page 1: {fonts}")

# Image extraction
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"extracted_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

# --- Creation ---
pdf = Pdf.from_markdown("# Report\n\nGenerated by PDF Oxide.",
                        title="Report", author="PDF Oxide")
pdf.save("report.pdf")

# --- Editing ---
doc = PdfDocument("document.pdf")
doc.set_title("Updated Title")
doc.set_author("New Author")
doc.rotate_all_pages(90)

# Search and replace via DOM
page = doc.page(0)
for text in page.find_text_containing("DRAFT"):
    page.set_text(text.id, "FINAL")
doc.save_page(page)

# Form filling
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name} ({f.field_type}) = {f.value}")
doc.set_form_field_value("name", "John Doe")

# Merge another PDF
merged_count = doc.merge_from("appendix.pdf")
print(f"Merged {merged_count} pages")

doc.save("output.pdf")

# --- Search ---
results = doc.search("configuration", case_insensitive=True)
for r in results:
    print(f"Page {r.page + 1}: '{r.text}' at ({r.x:.0f}, {r.y:.0f})")

v0.3.38 additions

`DocumentBuilder` / `FluentPageBuilder` / `EmbeddedFont`

from pdf_oxide import DocumentBuilder, EmbeddedFont, StampType

font = EmbeddedFont.from_file("DejaVuSans.ttf")
# Alt: EmbeddedFont.from_bytes(data: bytes, name: str | None = None)

(DocumentBuilder()
    .register_embedded_font("DejaVu", font)
    .letter_page()           # or .a4_page() / .page(size)
        .at(72, 720).font("DejaVu", 12).text("Hello")
        .heading(1, "Title")
        .paragraph("Body text with automatic wrapping")
        # Annotations (15 methods)
        .link_url("https://example.com")
        .link_page(2)
        .link_named("glossary")
        .highlight((1.0, 1.0, 0.0))
        .underline((0.0, 0.0, 1.0))
        .strikeout((1.0, 0.0, 0.0))
        .squiggly((1.0, 0.5, 0.0))
        .sticky_note("Review this")
        .stamp(StampType.APPROVED)
        .freetext((100, 500, 200, 50), "Comment")
        .watermark("DRAFT")
        .watermark_confidential()
        .watermark_draft()
        # AcroForm widgets (5 types)
        .text_field("name", 150, 400, 200, 20, "Jane Doe")
        .checkbox("agree", 72, 380, 15, 15, True)
        .combo_box("country", 150, 360, 200, 20, ["US", "UK"], "US")
        .radio_group("tier", [("free", 72, 340, 15, 15), ("pro", 120, 340, 15, 15)], "pro")
        .push_button("submit", 72, 300, 80, 25, "Submit")
        # Graphics primitives
        .rect(50, 270, 500, 2)
        .filled_rect(50, 260, 500, 2, (0.9, 0.9, 0.9))
        .line(50, 250, 550, 250)
    .done()
    .save_encrypted("out.pdf", "user-pw", "owner-pw"))
# Alt: .save("out.pdf") / .build() -> bytes
# Alt: .to_bytes_encrypted("user-pw", "owner-pw") -> bytes

HTML + CSS pipeline

Pdf.from_html_css(html: str, css: str, font_bytes: bytes) -> Pdf
Pdf.from_html_css_with_fonts(html: str, css: str, fonts: list[tuple[str, bytes]]) -> Pdf

See Create from HTML.

Signature verification

from pdf_oxide import PdfDocument, Timestamp, TsaClient

doc = PdfDocument("signed.pdf")
doc.signature_count()                # int
for sig in doc.signatures():
    sig.signer_name                  # str
    sig.reason                       # str | None
    sig.location                     # str | None
    sig.signing_time                 # datetime | None
    sig.verify()                     # "Valid" | "Invalid" | "Unknown"
    sig.verify_detached(pdf_bytes)   # adds messageDigest check

# Timestamp
ts = Timestamp.parse(tst_bytes)
ts.time, ts.serial, ts.policy_oid, ts.tsa_name, ts.hash_algorithm, ts.message_imprint

# TSA client (behind `tsa-client` feature)
client = TsaClient(url="https://freetsa.org/tsr",
                   username=None, password=None,
                   timeout_seconds=30, hash_algorithm=2,
                   use_nonce=True, cert_req=True)
ts = client.request_timestamp(pdf_bytes)
ts = client.request_timestamp_hash(digest, algorithm=2)

See Digital Signatures for details.

Rendering

doc.render_page_region(page: int, x: float, y: float, w: float, h: float, format: int = 0) -> bytes
doc.render_page_fit(page: int, fit_width: int, fit_height: int, format: int = 0) -> bytes

format: 0 = PNG, 1 = JPEG. Coordinates in PDF points from lower-left.

`Pdf` flatten

doc.flatten_to_images(dpi: int = 150) -> bytes

Other Language Bindings

PDF Oxide ships native bindings for every major ecosystem: Rust, Node.js, WASM, C#, Golang, Java, PHP, Ruby, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Clojure, Objective-C, and Elixir.

Next Steps

Types & Enums — all shared types and enums
Page API Reference — consistent per-page iteration across bindings
Getting Started with Python — tutorial