Python API Reference
PDF Oxide provides native Python bindings built with PyO3. Pre-built wheels are available for Python 3.8–3.14 on Linux, macOS, and Windows (x86_64 and ARM64).
pip install pdf_oxide
For the Rust API, see the Rust API Reference. For the JavaScript API, see the Node.js API Reference or WASM API Reference. For type details, see Types & Enums.
PdfDocument
The primary class for opening, extracting, editing, and saving PDF files.
from pdf_oxide import PdfDocument
Constructor
PdfDocument(path: str, password: str | None = None)
| Parameter | Type | Description |
|---|---|---|
path |
str |
Path to the PDF file |
password |
str | None |
Optional password for encrypted PDFs (default: None) |
Pass password= to open encrypted PDFs in one step. You can also use doc.authenticate(password) after opening as an alternative.
Raises FileNotFoundError if the file does not exist. Raises PdfError if the file is not a valid PDF.
Class Methods
PdfDocument.from_bytes(data: bytes, password: str | None = None) -> PdfDocument
Open a PDF from in-memory bytes (e.g., downloaded from S3, received via HTTP). Accepts an optional password for encrypted PDFs.
| Parameter | Type | Description |
|---|---|---|
data |
bytes |
Raw PDF file bytes |
password |
str | None |
Optional password for encrypted PDFs (default: None) |
from pdf_oxide import PdfDocument
# Open PDF from bytes (e.g., downloaded from S3)
doc = PdfDocument.from_bytes(pdf_bytes)
# Also supports password:
doc = PdfDocument.from_bytes(pdf_bytes, password="secret")
Methods
General
| Method | Return Type | Description |
|---|---|---|
version() |
tuple[int, int] |
PDF version as (major, minor) (e.g., (1, 7)) |
authenticate(password) |
bool |
Authenticate an encrypted PDF with user or owner password |
Document Info
doc.page_count() -> int
Return the number of pages in the document.
doc.has_structure_tree() -> bool
Check if the document is a Tagged PDF with a structure tree.
Authentication
doc.authenticate(password: str) -> bool
Authenticate with a password after opening. Returns True if authentication succeeded.
Text Extraction
doc.extract_text(
page: int,
region: tuple[float, float, float, float] | None = None,
exclude_layers: list[str] | None = None,
exclude_inks: list[str] | None = None,
extract_tables: bool = True
) -> str
Extract plain text from a single page. Pages are zero-indexed. Optionally clip to a region, exclude named optional-content layers or ink/separation names, and toggle table reconstruction.
doc.extract_chars(
page: int,
region: tuple[float, float, float, float] | None = None,
exclude_layers: list[str] | None = None,
exclude_inks: list[str] | None = None
) -> list[TextChar]
Extract per-character positioning and font metadata. Returns a list of TextChar objects.
doc.extract_spans(page: int, region: tuple | None = None, reading_order: str | None = None) -> list[TextSpan]
Extract text spans with font metadata. Each span is a run of identically-styled text. Pass reading_order="column_aware" for multi-column PDFs.
doc.extract_words(
page: int,
*,
include_artifacts: bool = True,
region: tuple | None = None,
word_gap_threshold: float | None = None,
profile: ExtractionProfile | None = None
) -> list[TextWord]
Extract word-grouped text with bounding boxes. Returns a list of TextWord objects.
doc.extract_text_lines(
page: int,
*,
include_artifacts: bool = True,
region: tuple | None = None,
word_gap_threshold: float | None = None,
line_gap_threshold: float | None = None,
profile: ExtractionProfile | None = None
) -> list[TextLine]
Extract line-grouped text. Returns a list of TextLine objects.
doc.extract_page_text(page: int, reading_order: str | None = None) -> dict
Extract spans, characters, and page dimensions from a single pass. Returns a dict with keys: spans, chars, page_width, page_height, text. More efficient than calling extract_spans() + extract_chars() separately.
doc.page_layout_params(page: int) -> LayoutParams
Compute adaptive layout parameters (word/line gap thresholds, median metrics, column count) for a page. See LayoutParams.
doc.within(page: int, bbox: tuple[float, float, float, float]) -> PdfPageRegion
Create a clipped region handle for extracting text, words, lines, tables, images, and paths inside bbox. See PdfPageRegion.
Auto Extraction & Classification
doc.extract_text_auto(page: int) -> str
Auto-select the best extraction strategy (native text vs. OCR) for a page and return plain text.
doc.extract_page_auto(page: int, options_json: str | None = None) -> str
Auto-extract a page and return a JSON document; pass a JSON options_json string to tune the pipeline.
doc.classify_page(page: int) -> str
Classify a single page (e.g. "text", "scanned", "mixed").
doc.classify_document() -> str
Classify the whole document by sampling its pages.
doc.has_text_layer(page: int) -> bool
Check whether a page already has an extractable native text layer (vs. requiring OCR).
Conversion
doc.to_plain_text(
page: int,
preserve_layout: bool = False,
detect_headings: bool = True,
include_images: bool = True,
image_output_dir: str | None = None
) -> str
Convert a page to plain text with layout options.
doc.to_plain_text_all(
preserve_layout: bool = False,
detect_headings: bool = True,
include_images: bool = True,
image_output_dir: str | None = None
) -> str
Convert all pages to plain text.
doc.to_markdown(
page: int,
preserve_layout: bool = False,
detect_headings: bool = True,
include_images: bool = True,
image_output_dir: str | None = None,
embed_images: bool = True,
include_form_fields: bool = True
) -> str
Convert a page to Markdown.
doc.to_markdown_all(
preserve_layout: bool = False,
detect_headings: bool = True,
include_images: bool = True,
image_output_dir: str | None = None,
embed_images: bool = True,
include_form_fields: bool = True
) -> str
Convert all pages to Markdown.
doc.to_html(
page: int,
preserve_layout: bool = False,
detect_headings: bool = True,
include_images: bool = True,
image_output_dir: str | None = None,
embed_images: bool = True,
include_form_fields: bool = True
) -> str
Convert a page to HTML.
doc.to_html_all(
preserve_layout: bool = False,
detect_headings: bool = True,
include_images: bool = True,
image_output_dir: str | None = None,
embed_images: bool = True,
include_form_fields: bool = True
) -> str
Convert all pages to HTML.
Office Conversion
| Method | Return Type | Description |
|---|---|---|
to_docx(path) |
– | Convert the PDF to a Word document file |
to_docx_bytes() |
bytes |
Convert the PDF to DOCX bytes |
to_pptx(path) |
– | Convert the PDF to a PowerPoint file |
to_pptx_bytes() |
bytes |
Convert the PDF to PPTX bytes |
to_xlsx(path) |
– | Convert the PDF to an Excel workbook file |
to_xlsx_bytes() |
bytes |
Convert the PDF to XLSX bytes |
Image Extraction
doc.extract_images(page: int) -> list[ImageInfo]
Extract all images from a page, including images in content streams and nested Form XObjects.
doc.extract_image_bytes(page: int) -> list[dict]
Extract raw image bytes from a page. Each dict contains width, height, data (bytes), and format.
Search
doc.search(
pattern: str,
case_insensitive: bool = False,
literal: bool = False,
whole_word: bool = False,
max_results: int = 0
) -> list[SearchResult]
Search for text across all pages. Set max_results=0 for unlimited results. Returns a list of matches with page number, text, and coordinates.
doc.search_page(
page: int,
pattern: str,
case_insensitive: bool = False,
literal: bool = False,
whole_word: bool = False,
max_results: int = 0
) -> list[SearchResult]
Search for text on a single page.
Metadata Editing
| Method | Parameters | Description |
|---|---|---|
set_title(title) |
str |
Set document title |
set_author(author) |
str |
Set document author |
set_subject(subject) |
str |
Set document subject |
set_keywords(keywords) |
str |
Set document keywords |
Page Rotation
| Method | Parameters | Returns | Description |
|---|---|---|---|
page_rotation(page) |
int |
int |
Get current rotation (0, 90, 180, 270) |
set_page_rotation(page, degrees) |
int, int |
– | Set absolute rotation |
rotate_page(page, degrees) |
int, int |
– | Add to current rotation |
rotate_all_pages(degrees) |
int |
– | Rotate all pages |
Page Dimensions
| Method | Parameters | Returns | Description |
|---|---|---|---|
page_media_box(page) |
int |
tuple[float, float, float, float] |
Get MediaBox (llx, lly, urx, ury) |
set_page_media_box(page, llx, lly, urx, ury) |
int, float, float, float, float |
– | Set MediaBox |
page_crop_box(page) |
int |
`tuple | None` |
set_page_crop_box(page, llx, lly, urx, ury) |
int, float, float, float, float |
– | Set CropBox |
crop_margins(left, right, top, bottom) |
float, float, float, float |
– | Crop all page margins |
Erase / Whiteout
| Method | Parameters | Description |
|---|---|---|
erase_region(page, llx, lly, urx, ury) |
int, float, float, float, float |
Erase a rectangular region |
erase_regions(page, rects) |
int, list[tuple] |
Erase multiple regions |
clear_erase_regions(page) |
int |
Clear pending erase operations |
Annotations
doc.get_annotations(page: int) -> list[dict]
Get annotation metadata (type, rect, contents, etc.) for a page.
| Method | Parameters | Returns | Description |
|---|---|---|---|
flatten_page_annotations(page) |
int |
– | Flatten annotations on a page |
flatten_all_annotations() |
– | – | Flatten all annotations |
is_page_marked_for_flatten(page) |
int |
bool |
Check if page is marked for flatten |
unmark_page_for_flatten(page) |
int |
– | Unmark a page for flatten |
Redaction
doc.add_redaction(
page: int,
rect: tuple[float, float, float, float],
fill: tuple[float, float, float] | None = None
) -> None
Mark a rectangular region for redaction with an optional RGB fill color.
doc.redaction_count(page: int) -> int
Return the number of pending redactions on a page.
doc.apply_redactions_destructive(
scrub_metadata: bool = True,
remove_javascript: bool = True,
remove_embedded_files: bool = True,
fill: tuple[float, float, float] = (0.0, 0.0, 0.0)
) -> None
Apply all redactions destructively, removing underlying content and optionally scrubbing metadata, JavaScript, and embedded files.
doc.sanitize_document(
scrub_metadata: bool = True,
remove_javascript: bool = True,
remove_embedded_files: bool = True
) -> None
Sanitize the document without redacting regions: strip metadata, JavaScript, and/or embedded files.
| Method | Parameters | Returns | Description |
|---|---|---|---|
apply_page_redactions(page) |
int |
– | Apply redactions on a page |
apply_all_redactions() |
– | – | Apply all pending redactions |
is_page_marked_for_redaction(page) |
int |
bool |
Check if page is marked for redaction |
unmark_page_for_redaction(page) |
int |
– | Unmark a page for redaction |
Layers & Inks
| Method | Parameters | Returns | Description |
|---|---|---|---|
get_layers() |
– | list[str] |
List optional-content (OCG) layer names |
get_page_inks(page) |
int |
list[str] |
List ink / separation colorant names on a page |
get_page_inks_deep(page) |
int |
list[str] |
List inks including those nested in Form XObjects |
Header / Footer Cleanup
doc.remove_headers(threshold: float = 0.8) -> int
doc.remove_footers(threshold: float = 0.8) -> int
doc.remove_artifacts(threshold: float = 0.8) -> int
Detect and remove repeating headers, footers, or page artifacts across the document. threshold is the cross-page repetition ratio. Returns the number of elements removed.
| Method | Parameters | Description |
|---|---|---|
erase_header(page) |
int |
Erase the detected header region on a page |
edit_header(page) |
int |
Mark the header region for editing |
erase_footer(page) |
int |
Erase the detected footer region on a page |
edit_footer(page) |
int |
Mark the footer region for editing |
erase_artifacts(page) |
int |
Erase detected artifacts on a page |
sync_editor_erasures() |
– | Flush pending header/footer/artifact erasures into the editor |
Form Fields
doc.get_form_fields() -> list[FormField]
Get all form fields. See FormField for properties.
doc.get_form_field_value(name: str) -> str | bool | list | None
Get a form field value by name. Returns the appropriate Python type based on the field type.
doc.set_form_field_value(name: str, value: str | bool) -> None
Set a form field value by name.
doc.has_xfa() -> bool
Check if the document contains XFA forms.
doc.export_form_data(path: str, format: str = "fdf") -> None
Export form data to a file. Supported formats: "fdf" and "xfdf".
| Method | Parameters | Description |
|---|---|---|
flatten_forms() |
– | Flatten all form fields into page content |
flatten_forms_on_page(page) |
int |
Flatten forms on a specific page |
Image Manipulation
doc.page_images(page: int) -> list[dict]
Get image names and bounds for positioning operations. Each dict contains name, bounds [x, y, width, height], and matrix.
| Method | Parameters | Description |
|---|---|---|
reposition_image(page, name, x, y) |
int, str, float, float |
Move an image |
resize_image(page, name, width, height) |
int, str, float, float |
Resize an image |
set_image_bounds(page, name, x, y, width, height) |
int, str, float, float, float, float |
Set image position and size |
clear_image_modifications(page) |
int |
Clear pending image modifications |
has_image_modifications(page) |
int → bool |
Check for pending image modifications |
Document Operations
doc.merge_from(source: str | PdfDocument) -> int
Merge pages from another PDF. Accepts a file path or PdfDocument instance. Returns the number of pages merged.
doc.embed_file(name: str, data: bytes) -> None
Attach a file to the PDF.
doc.get_outline() -> list[dict] | None
Get document bookmarks / table of contents. Returns None if no outline exists.
doc.extract_paths(page: int, region: tuple | None = None) -> list[dict]
Get vector paths (lines, curves, shapes) from a page.
doc.extract_rects(page: int, region: tuple | None = None) -> list[dict]
Get axis-aligned rectangles (from filled/stroked paths) on a page.
doc.extract_lines(page: int, region: tuple | None = None) -> list[dict]
Get straight line segments on a page.
doc.extract_tables(page: int, region: tuple | None = None, table_settings: dict | None = None) -> list[dict]
Detect and extract tables. Each table is a dict with rows and cells (text + bounding boxes). Pass table_settings to tune detection strategy.
doc.extract_structured(page: int) -> str
Extract the page as a structured JSON document (logical reading order, blocks, and roles).
doc.page_labels() -> list[dict]
Get page label ranges. Each dict contains start_page, style, prefix, and start_value.
doc.xmp_metadata() -> dict | None
Get XMP metadata as a dictionary with fields like dc_title, dc_creator, xmp_create_date, etc. Returns None if no XMP metadata exists.
OCR
doc.extract_text_ocr(page: int, engine: OcrEngine | None = None) -> str
Extract text using OCR. Requires the ocr feature in the Rust build. Pass a custom OcrEngine or None for the default engine.
Page Extraction & Reordering
doc.extract_pages(pages: list[int], output: str) -> None
Extract the given page indices into a new PDF file at output.
doc.extract_pages_to_bytes(pages: list[int]) -> bytes
Extract the given page indices into a new PDF returned as bytes.
doc.extract_page_ranges_to_bytes(ranges: list[tuple[int, int]]) -> bytes
Extract one or more (start, end) page ranges into a new PDF returned as bytes.
| Method | Parameters | Description |
|---|---|---|
select_pages(pages) |
list[int] |
Keep only the listed pages, in the given order |
delete_page(index) |
int |
Delete a single page |
move_page(from_index, to_index) |
int, int |
Move a page to a new position |
Compliance & Validation
doc.validate_pdf_a(level: str = "1b") -> dict
Validate against a PDF/A conformance level (e.g. "1b", "2b", "3b"). Returns a report dict.
doc.convert_to_pdf_a(level: str = "2b") -> dict
Convert the document to PDF/A and return a conversion report dict.
doc.validate_pdf_ua() -> dict
Validate against PDF/UA (accessibility) requirements.
doc.validate_pdf_x(level: str = "1a_2001") -> dict
Validate against a PDF/X (print-production) conformance level.
Permissions & Warnings
doc.permissions() -> dict
Return the document’s encryption permission flags (print, copy, modify, annotate, etc.).
doc.structured_warnings() -> list
Return warnings collected during structured / tagged-content extraction.
doc.flatten_warnings() -> list[str]
Return warnings collected during form/annotation flattening.
Signatures & Document Security Store
doc.signatures() -> list[Signature]
Return all digital signatures in the document. See Signature.
doc.signature_count() -> int
Return the number of digital signatures.
doc.dss() -> Dss | None
Return the document’s parsed Document Security Store (LTV material), or None. See Dss.
Page API (v0.3.34)
PdfDocument is iterable and indexable, returning lazy Page objects. See Page.
len(doc) # number of pages
doc[i] # page at index i (negative indexing supported)
doc[-1] # last page
for page in doc: ... # iterate pages
DOM Access
doc.page(index: int) -> PdfPage
Get a DOM-like page handle for element-level editing. See PdfPage.
doc.save_page(page: PdfPage) -> None
Save a modified PdfPage back to the document.
Rendering
doc.render_page(
page: int,
dpi: int | None = None,
format: str | None = None,
background: tuple[float, float, float, float] | None = None,
transparent: bool = False,
render_annotations: bool | None = None,
jpeg_quality: int | None = None,
excluded_layers: list[str] | None = None
) -> bytes
Render a page to PNG or JPEG bytes with control over DPI, background, transparency, annotation rendering, JPEG quality, and excluded layers.
doc.render_pixmap(page: int, dpi: int | None = None) -> RenderedPixmap
Render a page to a raw RGBA RenderedPixmap (named tuple with width, height, data).
doc.render_separations(page: int, dpi: int | None = None) -> list[SeparationPlate]
Render per-ink separation plates for a page. Returns a list of SeparationPlate named tuples (name, width, height, data).
doc.render_separation(page: int, ink_name: str, dpi: int | None = None) -> SeparationPlate
Render a single named ink separation plate.
| Method | Return Type | Description |
|---|---|---|
render_page_fit(page, fit_width, fit_height, format=0) |
bytes |
Render a page scaled to fit a pixel box |
flatten_to_images(dpi=150) |
bytes |
Flatten all pages to image-based PDF |
Saving
doc.save(path: str, compress: bool = True, garbage_collect: bool = True, linearize: bool = False) -> None
Save the PDF to a file. Toggle stream compression, dead-object garbage collection, and linearization (fast web view).
doc.to_bytes(compress: bool = True, garbage_collect: bool = True, linearize: bool = False) -> bytes
Serialize the PDF to bytes with the same options as save().
doc.save_encrypted(
path: str,
user_password: str,
owner_password: str | None = None,
allow_print: bool = True,
allow_copy: bool = True,
allow_modify: bool = True,
allow_annotate: bool = True
) -> None
Save with AES-256 password protection and permission controls. If owner_password is None, the user password is used.
doc.to_bytes_encrypted(
user_password: str,
owner_password: str | None = None,
allow_print: bool = True,
allow_copy: bool = True,
allow_modify: bool = True,
allow_annotate: bool = True
) -> bytes
Serialize an AES-256 encrypted PDF to bytes.
Page
A lazy page handle returned by doc[i] or iteration over PdfDocument. All properties are computed on access and dispatch to the parent document.
from pdf_oxide import PdfDocument
with PdfDocument("paper.pdf") as doc:
page = doc[0]
text = page.text
md = page.markdown(detect_headings=True)
Properties (lazy)
| Property | Type | Description |
|---|---|---|
index |
int |
Zero-based page index |
width, height |
float |
Page dimensions in PDF points |
bbox |
tuple[float, 4] |
(llx, lly, urx, ury) |
text |
str |
Extracted plain text |
chars, words, lines, spans |
list[...] |
Structured text |
tables |
list[dict] |
Tables with rows + cells (text + bboxes) |
images, paths, annotations |
list[...] |
Page content |
Methods
page.markdown(preserve_layout=False, detect_headings=True,
include_images=False, image_output_dir=None,
embed_images=True, include_form_fields=True) -> str
page.plain_text(...) -> str
page.html(...) -> str
page.render(dpi=None, format=None, background=None, transparent=False,
render_annotations=None, jpeg_quality=None, excluded_layers=None) -> bytes
page.render_pixmap(dpi=None) -> RenderedPixmap
page.search(pattern, case_insensitive=False, literal=False,
whole_word=False, max_results=100) -> list
page.region(x, y, width, height) -> PdfPageRegion
The lazy page object is also exposed as doc.pages() (an iterator equivalent to iterating the document directly).
PdfPage
DOM-like page handle for element-level access and editing. Obtained via PdfDocument.page().
from pdf_oxide import PdfDocument
doc = PdfDocument("file.pdf")
page = doc.page(0)
Properties
| Property | Type | Description |
|---|---|---|
index |
int |
Zero-based page index |
width |
float |
Page width in PDF points |
height |
float |
Page height in PDF points |
Methods
page.children() -> list[PdfElement]
Get all elements on the page.
page.find_text_containing(needle: str) -> list[PdfText]
Find all text elements containing the given substring.
page.find_images() -> list[PdfImage]
Find all image elements on the page.
page.get_element(element_id: str) -> PdfElement | None
Get a specific element by its ID.
page.set_text(text_id: PdfTextId, new_text: str) -> None
Replace the text content of an element identified by its PdfTextId.
page.annotations() -> list[PdfAnnotation]
Get all annotations on the page.
page.add_link(x: float, y: float, width: float, height: float, url: str) -> str
Add a URL link annotation. Returns the annotation ID.
page.add_highlight(x: float, y: float, width: float, height: float, color: tuple[float, float, float]) -> str
Add a highlight annotation with an RGB color. Returns the annotation ID.
page.add_note(x: float, y: float, text: str) -> str
Add a sticky note annotation. Returns the annotation ID.
page.remove_annotation(index: int) -> bool
Remove an annotation by index. Returns True if removed.
page.add_text(text: str, x: float, y: float, font_size: float = 12.0) -> PdfTextId
Add new text to the page. Returns a PdfTextId for later reference.
page.remove_element(element_id: PdfTextId) -> bool
Remove an element by its ID. Returns True if removed.
Example
from pdf_oxide import PdfDocument
doc = PdfDocument("invoice.pdf")
page = doc.page(0)
# Find and replace text
for text in page.find_text_containing("DRAFT"):
page.set_text(text.id, "FINAL")
# Add a link
page.add_link(100, 700, 200, 20, "https://example.com")
doc.save_page(page)
doc.save("invoice_updated.pdf")
The unified class for creating PDFs from various source formats.
from pdf_oxide import Pdf
Factory Methods
Pdf.from_markdown(content: str, title: str | None = None, author: str | None = None) -> Pdf
Create a PDF from Markdown content.
Pdf.from_html(content: str, title: str | None = None, author: str | None = None) -> Pdf
Create a PDF from HTML content.
Pdf.from_text(content: str, title: str | None = None, author: str | None = None) -> Pdf
Create a PDF from plain text.
Pdf.from_markdown_with_template(content: str, template: str, title: str | None = None, author: str | None = None) -> Pdf
Create a PDF from Markdown rendered through a named CSS/layout template.
Pdf.from_image(path: str) -> Pdf
Create a single-page PDF from an image file (JPEG, PNG).
Pdf.from_bytes(data: bytes) -> Pdf
Open an existing PDF from in-memory bytes for modification. Useful for loading PDFs downloaded from S3, HTTP, or databases.
from pdf_oxide import Pdf
pdf = Pdf.from_bytes(existing_pdf_bytes)
pdf.save("modified.pdf")
Pdf.from_images(paths: list[str]) -> Pdf
Create a multi-page PDF from multiple image files, one page per image.
Pdf.from_image_bytes(data: bytes) -> Pdf
Create a single-page PDF from image bytes.
Pdf.merge(paths: list[str]) -> Pdf
Merge multiple PDF files (by path) into a single Pdf.
Methods
pdf.save(path: str) -> None
Save the PDF to a file.
pdf.to_bytes() -> bytes
Get the PDF content as bytes.
len(pdf) -> int
Get the PDF size in bytes (via __len__).
PdfText
Represents a text element on a page. Returned by PdfPage.find_text_containing().
| Property | Type | Description |
|---|---|---|
id |
PdfTextId |
Unique element identifier |
value |
str |
Text content |
text |
str |
Text content (alias for value) |
bbox |
tuple[float, float, float, float] |
Bounding box (x0, y0, x1, y1) |
font_name |
str |
PostScript font name |
font_size |
float |
Font size in points |
is_bold |
bool |
Whether text is bold |
is_italic |
bool |
Whether text is italic |
Methods
| Method | Parameters | Returns | Description |
|---|---|---|---|
contains(needle) |
str |
bool |
Check if text contains substring |
starts_with(prefix) |
str |
bool |
Check if text starts with prefix |
ends_with(suffix) |
str |
bool |
Check if text ends with suffix |
PdfImage
Represents an image element on a page. Returned by PdfPage.find_images().
| Property | Type | Description |
|---|---|---|
bbox |
tuple[float, float, float, float] |
Bounding box (x0, y0, x1, y1) |
width |
int |
Image width in pixels |
height |
int |
Image height in pixels |
aspect_ratio |
float |
Width / height ratio |
PdfAnnotation
Represents an annotation on a page. Returned by PdfPage.annotations().
| Property | Type | Description |
|---|---|---|
subtype |
str |
Annotation type (e.g., "Link", "Highlight", "Text") |
rect |
tuple[float, float, float, float] |
Position (x0, y0, x1, y1) |
contents |
`str | None` |
color |
`tuple[float, float, float] | None` |
is_modified |
bool |
Whether the annotation has been modified |
is_new |
bool |
Whether the annotation is newly added |
PdfElement
Generic element wrapper. Returned by PdfPage.children().
| Method | Returns | Description |
|---|---|---|
is_text() |
bool |
Check if element is text |
is_image() |
bool |
Check if element is an image |
is_path() |
bool |
Check if element is a vector path |
is_table() |
bool |
Check if element is a table |
is_structure() |
bool |
Check if element is a structure element |
as_text() |
`PdfText | None` |
as_image() |
`PdfImage | None` |
| Property | Type | Description |
|---|---|---|
bbox |
tuple[float, float, float, float] |
Bounding box |
TextChar
Represents a single character with positioning and font metadata. Returned by PdfDocument.extract_chars().
from pdf_oxide import TextChar # or access via PdfDocument
| Attribute | Type | Description |
|---|---|---|
char |
str |
The Unicode character |
bbox |
tuple[float, float, float, float] |
Bounding box (x0, y0, x1, y1) |
font_name |
str |
PostScript font name |
font_size |
float |
Font size in points |
font_weight |
str |
Weight ("thin", "light", "normal", "medium", "semi-bold", "bold", "extra-bold", "black") |
is_italic |
bool |
Whether the character is italic |
color |
tuple[float, float, float] |
RGB color (r, g, b), values 0.0–1.0 |
rotation_degrees |
float |
Character rotation in degrees |
origin_x |
float |
Text origin X position |
origin_y |
float |
Text origin Y position |
advance_width |
float |
Glyph advance width |
mcid |
`int | None` |
Example
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
chars = doc.extract_chars(0)
for ch in chars[:5]:
print(f"'{ch.char}' at bbox={ch.bbox} "
f"font={ch.font_name} size={ch.font_size:.1f} "
f"weight={ch.font_weight} italic={ch.is_italic}")
TextSpan
Represents a run of text sharing the same font and style. Returned by PdfDocument.extract_spans().
| Attribute | Type | Description |
|---|---|---|
text |
str |
The text content |
bbox |
tuple[float, float, float, float] |
Bounding box (x0, y0, x1, y1) |
font_name |
str |
PostScript font name |
font_size |
float |
Font size in points |
is_bold |
bool |
Whether the span is bold |
is_italic |
bool |
Whether the span is italic |
is_monospace |
bool |
Whether the font is fixed-width (Courier, Consolas, etc.) |
char_widths |
list[float] |
Per-glyph advance widths for accurate bounding boxes |
color |
tuple[float, float, float] |
RGB color (r, g, b), values 0.0–1.0 |
Example
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
spans = doc.extract_spans(0)
for span in spans:
print(f"'{span.text}' font={span.font_name} size={span.font_size:.1f} "
f"bold={span.is_bold} italic={span.is_italic} color={span.color}")
Image Extraction
extract_images() returns ImageInfo objects with image metadata. Use extract_image_bytes() for raw image data suitable for saving to disk.
extract_image_bytes() Return Format
Each dict returned by extract_image_bytes() has the following keys:
| Key | Type | Description |
|---|---|---|
width |
int |
Image width in pixels |
height |
int |
Image height in pixels |
data |
bytes |
Raw image data |
format |
str |
Image format (e.g., "png", "jpeg") |
Example
from pdf_oxide import PdfDocument
doc = PdfDocument("brochure.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
print(f"Image {i}: {img['width']}x{img['height']}")
with open(f"image_{i}.{img['format']}", "wb") as f:
f.write(img["data"])
SearchResult
Represents a text search match. Returned by search() and search_page().
| Attribute | Type | Description |
|---|---|---|
page |
int |
Zero-based page index |
text |
str |
Matched text |
x |
float |
X position in PDF points |
y |
float |
Y position in PDF points |
FormField
Represents a form field. Returned by PdfDocument.get_form_fields().
| Property | Type | Description |
|---|---|---|
name |
str |
Fully qualified field name |
field_type |
str |
Field type: "text", "button", "choice", "signature", or "unknown" |
value |
`str | bool |
tooltip |
`str | None` |
bounds |
`tuple[float, float, float, float] | None` |
flags |
`int | None` |
max_length |
`int | None` |
is_readonly |
bool |
Whether the field is read-only |
is_required |
bool |
Whether the field is required |
TextWord
A word-grouped run of text. Returned by PdfDocument.extract_words() and PdfPageRegion.extract_words().
| Property | Type | Description |
|---|---|---|
text |
str |
The word text |
bbox |
tuple[float, float, float, float] |
Bounding box (x0, y0, x1, y1) |
font_name |
str |
PostScript font name |
font_size |
float |
Font size in points |
is_bold |
bool |
Whether the word is bold |
is_italic |
bool |
Whether the word is italic |
chars |
list[TextChar] |
Constituent characters |
TextLine
A line-grouped run of text. Returned by PdfDocument.extract_text_lines() and PdfPageRegion.extract_text_lines().
| Property | Type | Description |
|---|---|---|
text |
str |
The line text |
bbox |
tuple[float, float, float, float] |
Bounding box (x0, y0, x1, y1) |
words |
list[TextWord] |
Words in the line |
chars |
list[TextChar] |
Characters in the line |
PdfPageRegion
A clipped region of a page. Returned by PdfDocument.within() and PdfPage.region().
| Property | Type | Description |
|---|---|---|
bbox |
tuple[float, float, float, float] |
The region’s bounds |
Methods
region.extract_text() -> str
region.extract_words() -> list[TextWord]
region.extract_text_lines() -> list[TextLine]
region.extract_tables(table_settings: dict | None = None) -> list[dict]
region.extract_images() -> list
region.extract_paths() -> list
Extraction methods scoped to the region’s bounding box.
LayoutParams
Computed adaptive layout parameters for a page. Returned by PdfDocument.page_layout_params().
| Property | Type | Description |
|---|---|---|
word_gap_threshold |
float |
Inter-word gap threshold in points |
line_gap_threshold |
float |
Inter-line gap threshold in points |
median_char_width |
float |
Median character width |
median_font_size |
float |
Median font size |
median_line_spacing |
float |
Median line spacing |
column_count |
int |
Detected number of text columns |
ExtractionProfile
A tunable text-extraction profile passed to extract_words() / extract_text_lines().
from pdf_oxide import ExtractionProfile
Static Constructors
ExtractionProfile.conservative()
ExtractionProfile.aggressive()
ExtractionProfile.balanced()
ExtractionProfile.academic()
ExtractionProfile.policy()
ExtractionProfile.form()
ExtractionProfile.government()
ExtractionProfile.scanned_ocr()
ExtractionProfile.adaptive()
ExtractionProfile.available() -> list[str] # names of all built-in profiles
Properties
| Property | Type | Description |
|---|---|---|
name |
str |
Profile name |
tj_offset_threshold |
float |
TJ array offset word-break threshold |
word_margin_ratio |
float |
Word margin ratio |
space_threshold_em_ratio |
float |
Space-width threshold (em ratio) |
space_char_multiplier |
float |
Space-character multiplier |
use_adaptive_threshold |
bool |
Whether adaptive thresholds are enabled |
OfficeConverter
Convert Office documents (DOCX, XLSX, PPTX) to PDF. Requires the office feature in the Rust build.
from pdf_oxide import OfficeConverter
OfficeConverter() # instances are stateless; the conversion methods are also usable as static methods
Methods
OfficeConverter.from_docx(path: str) -> Pdf
Convert a Word document to a Pdf object.
OfficeConverter.from_docx_bytes(data: bytes) -> Pdf
Convert Word document bytes to a Pdf object.
OfficeConverter.from_xlsx(path: str) -> Pdf
Convert an Excel spreadsheet to a Pdf object.
OfficeConverter.from_xlsx_bytes(data: bytes) -> Pdf
Convert Excel spreadsheet bytes to a Pdf object.
OfficeConverter.from_pptx(path: str) -> Pdf
Convert a PowerPoint presentation to a Pdf object.
OfficeConverter.from_pptx_bytes(data: bytes) -> Pdf
Convert PowerPoint presentation bytes to a Pdf object.
OfficeConverter.convert(path: str) -> Pdf
Auto-detect format and convert any supported Office document to a Pdf object.
Example
from pdf_oxide import OfficeConverter
pdf = OfficeConverter.from_docx("report.docx")
pdf.save("report.pdf")
# Or use convert() for auto-detection
pdf = OfficeConverter.convert("spreadsheet.xlsx")
pdf.save("spreadsheet.pdf")
Graphics Classes
These classes are available for advanced PDF creation with graphics:
Color
from pdf_oxide import Color
Color(r: float, g: float, b: float) # RGB, values 0.0-1.0
Color.from_hex("#ff0000")
Color.black()
Color.white()
Color.red()
Color.green()
Color.blue()
BlendMode
from pdf_oxide import BlendMode
BlendMode.NORMAL()
BlendMode.MULTIPLY()
BlendMode.SCREEN()
BlendMode.OVERLAY()
BlendMode.DARKEN()
BlendMode.LIGHTEN()
BlendMode.COLOR_DODGE()
BlendMode.COLOR_BURN()
BlendMode.HARD_LIGHT()
BlendMode.SOFT_LIGHT()
BlendMode.DIFFERENCE()
BlendMode.EXCLUSION()
ExtGState
from pdf_oxide import ExtGState
gs = ExtGState()
gs = gs.fill_alpha(0.5)
gs = gs.stroke_alpha(0.8)
gs = gs.alpha(0.5) # Set both fill and stroke
gs = gs.blend_mode(BlendMode.MULTIPLY())
gs = ExtGState.semi_transparent() # Preset
LineCap / LineJoin
from pdf_oxide import LineCap, LineJoin
LineCap.BUTT() # Default
LineCap.ROUND()
LineCap.SQUARE()
LineJoin.MITER() # Default
LineJoin.ROUND()
LineJoin.BEVEL()
Gradients
from pdf_oxide import LinearGradient, RadialGradient, Color
# Linear gradient (fluent API)
grad = (LinearGradient()
.start(0, 0)
.end(100, 0)
.add_stop(0.0, Color.red())
.add_stop(1.0, Color.blue()))
# Convenience constructors
hgrad = LinearGradient.horizontal(200, Color.red(), Color.blue())
vgrad = LinearGradient.vertical(100, Color(1, 1, 0), Color(0, 0, 1))
# Radial gradient
rgrad = RadialGradient.centered(50, 50, 50)
rgrad = rgrad.add_stop(0.0, Color(1, 1, 0))
rgrad = rgrad.add_stop(1.0, Color(1, 0, 0))
PatternPresets
from pdf_oxide import PatternPresets, Color
PatternPresets.horizontal_stripes(width, height, stripe_height, color)
PatternPresets.vertical_stripes(width, height, stripe_width, color)
PatternPresets.checkerboard(size, color1, color2)
PatternPresets.dots(spacing, radius, color)
PatternPresets.diagonal_lines(size, line_width, color)
PatternPresets.crosshatch(size, line_width, color)
OCR Classes
Requires the ocr feature in the Rust build.
OcrEngine
from pdf_oxide import OcrEngine, OcrConfig
engine = OcrEngine(
det_model_path: str,
rec_model_path: str,
dict_path: str,
config: OcrConfig | None = None
)
OcrConfig
from pdf_oxide import OcrConfig
config = OcrConfig(
det_threshold: float | None = None,
box_threshold: float | None = None,
rec_threshold: float | None = None,
num_threads: int | None = None,
max_candidates: int | None = None,
use_v5: bool = False
)
DocumentBuilder
Fluent builder for composing PDFs page by page. See the example below and Create from scratch.
from pdf_oxide import DocumentBuilder
Document-Level Methods
| Method | Parameters | Description |
|---|---|---|
DocumentBuilder() |
– | Construct a new builder |
title(title) |
str |
Set document title |
author(author) |
str |
Set document author |
subject(subject) |
str |
Set document subject |
keywords(keywords) |
str |
Set document keywords |
creator(creator) |
str |
Set the producing application name |
on_open(script) |
str |
Set a document-level open JavaScript action |
tagged_pdf_ua1() |
– | Emit a Tagged PDF/UA-1 accessible document |
language(lang) |
str |
Set the document language (e.g. "en-US") |
role_map(custom, standard) |
str, str |
Map a custom structure tag to a standard one |
register_embedded_font(name, font) |
str, EmbeddedFont |
Register a font (consumes the EmbeddedFont) |
Page Factories
builder.a4_page() -> FluentPageBuilder # 595 x 842 pt
builder.letter_page() -> FluentPageBuilder # 612 x 792 pt
builder.page(width: float, height: float) -> FluentPageBuilder
Output
builder.build() -> bytes
builder.save(path: str) -> None
builder.save_encrypted(path: str, user_password: str, owner_password: str) -> None
builder.to_bytes_encrypted(user_password: str, owner_password: str) -> bytes
FluentPageBuilder
Buffers page-level operations until done(). Returned by DocumentBuilder.a4_page() / letter_page() / page(). Every method returns self for chaining; done() commits the page and returns the parent DocumentBuilder.
Text & Layout
| Method | Parameters | Description |
|---|---|---|
font(name, size) |
str, float |
Set the current font and size |
at(x, y) |
float, float |
Move the cursor to an absolute position |
text(text) |
str |
Draw text at the cursor |
heading(level, text) |
int, str |
Draw a heading (level 1–6) |
paragraph(text) |
str |
Draw a wrapped paragraph |
space(points) |
float |
Advance vertical space |
horizontal_rule() |
– | Draw a horizontal divider |
columns(column_count, gap_pt, text) |
int, float, str |
Balanced multi-column text flow |
footnote(ref_mark, note_text) |
str, str |
Inline reference mark + bottom-of-page note |
new_page_same_size() |
– | Start a fresh page with the same dimensions |
measure(text) -> float |
str |
Measure rendered text width in points |
remaining_space() -> float |
– | Remaining vertical space on the page |
Inline Runs
page.inline(text: str)
page.inline_bold(text: str)
page.inline_italic(text: str)
page.inline_color(text: str, r: float, g: float, b: float)
page.newline()
Links & Actions
page.link_url(url: str)
page.link_page(page: int)
page.link_named(name: str)
page.link_javascript(script: str)
page.on_open(script: str)
page.on_close(script: str)
page.field_keystroke(script: str)
page.field_format(script: str)
page.field_validate(script: str)
page.field_calculate(script: str)
Markup Annotations
page.highlight(color: tuple[float, float, float])
page.underline(color: tuple[float, float, float])
page.strikeout(color: tuple[float, float, float])
page.squiggly(color: tuple[float, float, float])
page.sticky_note(text: str)
page.sticky_note_at(x: float, y: float, text: str)
page.watermark(text: str)
page.watermark_confidential()
page.watermark_draft()
page.stamp(name: str)
page.freetext(x: float, y: float, w: float, h: float, text: str)
AcroForm Widgets
page.text_field(name: str, x: float, y: float, w: float, h: float, default_value: str | None = None)
page.checkbox(name: str, x: float, y: float, w: float, h: float, checked: bool = False)
page.combo_box(name: str, x: float, y: float, w: float, h: float, options: list[str], selected: str | None = None)
page.radio_group(name: str, buttons: list[tuple[str, float, float, float, float]], selected: str | None = None)
page.push_button(name: str, x: float, y: float, w: float, h: float, caption: str)
page.signature_field(name: str, x: float, y: float, w: float, h: float)
Graphics
page.rect(x: float, y: float, w: float, h: float)
page.filled_rect(x: float, y: float, w: float, h: float, r: float, g: float, b: float)
page.line(x1: float, y1: float, x2: float, y2: float)
page.text_in_rect(x: float, y: float, w: float, h: float, text: str, align: int | None = None)
page.stroke_rect(x, y, w, h, width=1.0, color=(0.0, 0.0, 0.0))
page.stroke_rect_dashed(x, y, w, h, dash, width=1.0, color=(0.0, 0.0, 0.0), phase=0.0)
page.stroke_line(x1, y1, x2, y2, width=1.0, color=(0.0, 0.0, 0.0))
page.stroke_line_dashed(x1, y1, x2, y2, dash, width=1.0, color=(0.0, 0.0, 0.0), phase=0.0)
Images & Barcodes
page.image_with_alt(bytes: bytes, x: float, y: float, w: float, h: float, alt_text: str)
page.image_artifact(bytes: bytes, x: float, y: float, w: float, h: float)
page.barcode_1d(barcode_type: int, data: str, x: float, y: float, w: float, h: float)
page.barcode_qr(data: str, x: float, y: float, size: float)
barcode_type: 0=Code128, 1=Code39, 2=EAN13, 3=EAN8, 4=UPCA, 5=ITF, 6=Code93, 7=Codabar.
Tables
page.table(table: Table)
page.streaming_table(
columns: list[Column],
repeat_header: bool = False,
mode: str = "fixed",
sample_rows: int = 50,
min_col_width_pt: float = 20.0,
max_col_width_pt: float = 400.0,
max_rowspan: int = 1,
batch_size: int = 256
) -> StreamingTable
Commit
page.done() -> DocumentBuilder
EmbeddedFont
A TTF/OTF font registered with a DocumentBuilder.
from pdf_oxide import EmbeddedFont
EmbeddedFont.from_file(path: str) -> EmbeddedFont
EmbeddedFont.from_bytes(data: bytes, name: str | None = None) -> EmbeddedFont
| Property | Type | Description |
|---|---|---|
name |
str |
The font’s registered name |
Tables
Value objects for the fluent table API.
Align
from pdf_oxide import Align
Align.LEFT # 0
Align.CENTER # 1
Align.RIGHT # 2
Column
from pdf_oxide import Column
Column(header: str, width: float = 100.0, align: Align | int | None = None)
| Property | Type | Description |
|---|---|---|
header |
str |
Column header text |
width |
float |
Column width in points |
align |
int |
Cell alignment |
Table
from pdf_oxide import Table
Table(columns: list[Column], rows: list[list[str]], has_header: bool = False)
A buffered table consumed by FluentPageBuilder.table(). With has_header=True, the column headers render as a styled header row.
StreamingTable
A row-streaming table handle returned by FluentPageBuilder.streaming_table(), for tables too large to materialize at once.
| Method | Parameters | Description |
|---|---|---|
push_row(cells) |
list[str] |
Append a row of cell strings |
push_row_span(cells) |
list[tuple[str, int]] |
Append a row of (text, colspan) cells |
flush() |
– | Flush the current batch |
finish() |
– | Finish the table, returning the FluentPageBuilder |
column_count() |
– → int |
Number of columns |
pending_row_count() |
– → int |
Rows buffered but not yet committed |
batch_count() |
– → int |
Number of completed batches |
Page Templates
Repeating header/footer artifacts applied across pages.
Artifact / ArtifactStyle
from pdf_oxide import Artifact, ArtifactStyle
Artifact() # empty artifact
Artifact.center(text: str) # centered artifact text
artifact.with_left(text: str) # add left-aligned text
style = ArtifactStyle()
style = style.font(name: str, size: float)
style = style.bold()
Header / Footer
from pdf_oxide import Header, Footer
Header() # or Header.center(text: str)
Footer() # or Footer.center(text: str)
PageTemplate
from pdf_oxide import PageTemplate, Header, Footer
template = (PageTemplate()
.header(Header.center("Confidential"))
.footer(Footer.center("Page")))
Digital Signatures
Sign, timestamp, and verify PDFs (PAdES / LTV). Requires the signatures (and optionally tsa-client) features in the Rust build.
Certificate
from pdf_oxide import Certificate
Certificate.load(data: bytes) -> Certificate # DER certificate (verify only)
Certificate.load_pem(cert_pem: str, key_pem: str) -> Certificate # signing credential
Certificate.load_pkcs12(data: bytes, password: str) -> Certificate # PKCS#12 / .p12 signing credential
| Method | Returns | Description |
|---|---|---|
subject() |
str |
Certificate subject DN |
issuer() |
str |
Certificate issuer DN |
serial() |
str |
Serial number |
validity() |
tuple[int, int] |
(not_before, not_after) Unix timestamps |
is_valid() |
bool |
Whether the certificate is currently within its validity window |
Signature
Returned by PdfDocument.signatures().
| Property / Method | Type | Description |
|---|---|---|
signer_name |
`str | None` |
reason |
`str | None` |
location |
`str | None` |
contact_info |
`str | None` |
signing_time |
`int | None` |
covers_whole_document |
bool |
Whether the signature covers the entire file |
pades_level |
PadesLevel |
Detected PAdES baseline (B-B/B-T/B-LT) |
verify() |
bool |
Verify the signature cryptographically |
verify_detached(pdf_data) |
bool |
Verify including the messageDigest against the file bytes |
Timestamp
from pdf_oxide import Timestamp
Timestamp.parse(data: bytes) -> Timestamp
| Property / Method | Type | Description |
|---|---|---|
time |
int |
Timestamp time (Unix) |
serial |
str |
TSA response serial number |
policy_oid |
str |
TSA policy OID |
tsa_name |
str |
TSA name |
hash_algorithm |
int |
Message-imprint hash algorithm code |
message_imprint |
bytes |
The hashed message imprint |
verify() |
bool |
Verify the timestamp token |
TsaClient
from pdf_oxide import TsaClient
client = TsaClient(
url: str,
username: str | None = None,
password: str | None = None,
timeout_seconds: int = 30,
hash_algorithm: int = 2,
use_nonce: bool = True,
cert_req: bool = True
)
client.request_timestamp(data: bytes) -> Timestamp
client.request_timestamp_hash(digest: bytes, algorithm: int = 2) -> Timestamp
PadesLevel
from pdf_oxide import PadesLevel
PadesLevel.B_B # baseline
PadesLevel.B_T # + trusted timestamp
PadesLevel.B_LT # + long-term validation material
PadesLevel.B_LTA # + archival timestamp
RevocationMaterial
from pdf_oxide import RevocationMaterial
RevocationMaterial(
certs: list[bytes] | None = None,
crls: list[bytes] | None = None,
ocsps: list[bytes] | None = None
)
DER-encoded certificates, CRLs, and OCSP responses for B-LT signing.
Dss
A parsed Document Security Store, returned by PdfDocument.dss().
| Property | Type | Description |
|---|---|---|
certs |
list[bytes] |
Document-level certificate DER blobs |
crls |
list[bytes] |
CRL DER blobs |
ocsps |
list[bytes] |
OCSP response DER blobs |
vri |
list[str] |
Per-signature VRI keys (hex SHA-1 of /Contents) |
Module-Level Functions
from pdf_oxide import (
sign_pdf_bytes, sign_pdf_bytes_pades, has_document_timestamp,
generate_barcode_svg, generate_qr_svg,
plan_split_by_bookmarks, split_by_bookmarks,
)
Signing
sign_pdf_bytes(pdf_data: bytes, cert: Certificate, reason: str | None = None, location: str | None = None) -> bytes
Sign raw PDF bytes with a loaded signing Certificate and return the signed PDF.
sign_pdf_bytes_pades(
pdf_data: bytes,
cert: Certificate,
level: PadesLevel,
tsa_url: str | None = None,
reason: str | None = None,
location: str | None = None,
revocation: RevocationMaterial | None = None
) -> bytes
Sign raw PDF bytes at a PAdES baseline level. B_T/B_LT require a tsa_url.
has_document_timestamp(pdf_data: bytes) -> bool
Whether the PDF carries a document-level RFC 3161 archival timestamp (PAdES-B-LTA).
Barcodes
generate_barcode_svg(barcode_type: int, data: str) -> str
generate_qr_svg(data: str, error_correction: int, size: int) -> str
Generate a 1D barcode or QR code as an SVG string. Requires the barcodes feature.
Split by Bookmarks
plan_split_by_bookmarks(src_bytes: bytes, title_prefix: str | None = None, ignore_case: bool = False, level: int = 1, include_front_matter: bool = True) -> list[dict]
split_by_bookmarks(src_bytes: bytes, title_prefix: str | None = None, ignore_case: bool = False, level: int = 1, include_front_matter: bool = True) -> list[tuple[dict, bytes]]
Plan or perform a split of a PDF at bookmark boundaries. plan_* returns segment metadata only; split_* returns each segment paired with its PDF bytes.
OCR Model Provisioning
prefetch_models(languages: list[str]) -> str
model_manifest() -> str
prefetch_available() -> bool
Provision OCR models for offline/air-gapped use, inspect the model manifest (JSON), and check whether this build can download models.
Logging
setup_logging() -> None
set_log_level(level: str) -> None # "off" | "error" | "warn" | "info" | "debug" | "trace"
get_log_level() -> str
disable_logging() -> None
Engine Tuning
set_max_ops_per_stream(limit: int | None) -> int | None
set_preserve_unmapped_glyphs(preserve: bool) -> bool
Adjust the per-stream operator cap (adversarial-input protection) and U+FFFD preservation for unmapped glyphs. Both return the previous value.
Cryptographic Governance
crypto_active_provider() -> str
crypto_available_providers() -> list[str]
crypto_use_fips() -> None # install the FIPS aws-lc-rs provider (requires the fips feature)
crypto_set_policy(spec: str) -> None # e.g. "strict" or "compat;deny:rc4@write"
crypto_policy() -> str
crypto_inventory() -> list[str]
crypto_cbom() -> str # CycloneDX 1.6 CBOM (JSON)
Asynchronous API
async/await wrappers that run blocking operations in a thread pool. Methods mirror their synchronous counterparts.
from pdf_oxide import AsyncPdfDocument, AsyncPdf, AsyncOfficeConverter
async def main():
doc = await AsyncPdfDocument.open("input.pdf")
text = await doc.extract_text(0)
await doc.close()
# Or use as an async context manager:
async with await AsyncPdfDocument.from_bytes(pdf_bytes) as doc:
md = await doc.to_markdown_all()
| Class | Constructors | Notes |
|---|---|---|
AsyncPdfDocument |
await AsyncPdfDocument.open(path, password=None), await AsyncPdfDocument.from_bytes(data, password=None) |
All PdfDocument methods are available as awaitables; supports async with and .close() |
AsyncPdf |
wraps Pdf factory methods |
await pdf.save(path), await pdf.to_bytes() |
AsyncOfficeConverter |
wraps OfficeConverter static methods |
e.g. await AsyncOfficeConverter.from_docx(path) |
Error Handling
PdfError
All PDF-specific errors raise PdfError:
from pdf_oxide import PdfDocument, PdfError
try:
doc = PdfDocument("file.pdf")
text = doc.extract_text(0)
except PdfError as e:
print(f"PDF error: {e}")
except FileNotFoundError:
print("File not found")
except IndexError:
print("Page index out of range")
Common error scenarios:
| Exception | Cause |
|---|---|
PdfError |
Malformed PDF, encrypted without password, parse failure |
FileNotFoundError |
File does not exist |
IndexError |
Page index exceeds page_count() |
ValueError |
Invalid argument (e.g., negative page index) |
Complete Example
from pdf_oxide import PdfDocument, Pdf
# --- Extraction ---
doc = PdfDocument("input.pdf")
print(f"Pages: {doc.page_count()}")
for i in range(doc.page_count()):
text = doc.extract_text(i)
print(f"Page {i + 1}: {len(text)} characters")
# Character-level analysis
chars = doc.extract_chars(0)
fonts = set(ch.font_name for ch in chars)
print(f"Fonts on page 1: {fonts}")
# Image extraction
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
with open(f"extracted_{i}.{img['format']}", "wb") as f:
f.write(img["data"])
# --- Creation ---
pdf = Pdf.from_markdown("# Report\n\nGenerated by PDF Oxide.",
title="Report", author="PDF Oxide")
pdf.save("report.pdf")
# --- Editing ---
doc = PdfDocument("document.pdf")
doc.set_title("Updated Title")
doc.set_author("New Author")
doc.rotate_all_pages(90)
# Search and replace via DOM
page = doc.page(0)
for text in page.find_text_containing("DRAFT"):
page.set_text(text.id, "FINAL")
doc.save_page(page)
# Form filling
fields = doc.get_form_fields()
for f in fields:
print(f"{f.name} ({f.field_type}) = {f.value}")
doc.set_form_field_value("name", "John Doe")
# Merge another PDF
merged_count = doc.merge_from("appendix.pdf")
print(f"Merged {merged_count} pages")
doc.save("output.pdf")
# --- Search ---
results = doc.search("configuration", case_insensitive=True)
for r in results:
print(f"Page {r.page + 1}: '{r.text}' at ({r.x:.0f}, {r.y:.0f})")
v0.3.38 additions
DocumentBuilder / FluentPageBuilder / EmbeddedFont
from pdf_oxide import DocumentBuilder, EmbeddedFont, StampType
font = EmbeddedFont.from_file("DejaVuSans.ttf")
# Alt: EmbeddedFont.from_bytes(data: bytes, name: str | None = None)
(DocumentBuilder()
.register_embedded_font("DejaVu", font)
.letter_page() # or .a4_page() / .page(size)
.at(72, 720).font("DejaVu", 12).text("Hello")
.heading(1, "Title")
.paragraph("Body text with automatic wrapping")
# Annotations (15 methods)
.link_url("https://example.com")
.link_page(2)
.link_named("glossary")
.highlight((1.0, 1.0, 0.0))
.underline((0.0, 0.0, 1.0))
.strikeout((1.0, 0.0, 0.0))
.squiggly((1.0, 0.5, 0.0))
.sticky_note("Review this")
.stamp(StampType.APPROVED)
.freetext((100, 500, 200, 50), "Comment")
.watermark("DRAFT")
.watermark_confidential()
.watermark_draft()
# AcroForm widgets (5 types)
.text_field("name", 150, 400, 200, 20, "Jane Doe")
.checkbox("agree", 72, 380, 15, 15, True)
.combo_box("country", 150, 360, 200, 20, ["US", "UK"], "US")
.radio_group("tier", [("free", 72, 340, 15, 15), ("pro", 120, 340, 15, 15)], "pro")
.push_button("submit", 72, 300, 80, 25, "Submit")
# Graphics primitives
.rect(50, 270, 500, 2)
.filled_rect(50, 260, 500, 2, (0.9, 0.9, 0.9))
.line(50, 250, 550, 250)
.done()
.save_encrypted("out.pdf", "user-pw", "owner-pw"))
# Alt: .save("out.pdf") / .build() -> bytes
# Alt: .to_bytes_encrypted("user-pw", "owner-pw") -> bytes
HTML + CSS pipeline
Pdf.from_html_css(html: str, css: str, font_bytes: bytes) -> Pdf
Pdf.from_html_css_with_fonts(html: str, css: str, fonts: list[tuple[str, bytes]]) -> Pdf
See Create from HTML.
Signature verification
from pdf_oxide import PdfDocument, Timestamp, TsaClient
doc = PdfDocument("signed.pdf")
doc.signature_count() # int
for sig in doc.signatures():
sig.signer_name # str
sig.reason # str | None
sig.location # str | None
sig.signing_time # datetime | None
sig.verify() # "Valid" | "Invalid" | "Unknown"
sig.verify_detached(pdf_bytes) # adds messageDigest check
# Timestamp
ts = Timestamp.parse(tst_bytes)
ts.time, ts.serial, ts.policy_oid, ts.tsa_name, ts.hash_algorithm, ts.message_imprint
# TSA client (behind `tsa-client` feature)
client = TsaClient(url="https://freetsa.org/tsr",
username=None, password=None,
timeout_seconds=30, hash_algorithm=2,
use_nonce=True, cert_req=True)
ts = client.request_timestamp(pdf_bytes)
ts = client.request_timestamp_hash(digest, algorithm=2)
See Digital Signatures for details.
Rendering
doc.render_page_region(page: int, x: float, y: float, w: float, h: float, format: int = 0) -> bytes
doc.render_page_fit(page: int, fit_width: int, fit_height: int, format: int = 0) -> bytes
format: 0 = PNG, 1 = JPEG. Coordinates in PDF points from lower-left.
Pdf flatten
doc.flatten_to_images(dpi: int = 150) -> bytes
Other Language Bindings
PDF Oxide ships native bindings for every major ecosystem: Rust, Node.js, WASM, C#, Golang, Java, PHP, Ruby, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Clojure, Objective-C, and Elixir.
Next Steps
- Types & Enums — all shared types and enums
- Page API Reference — consistent per-page iteration across bindings
- Getting Started with Python — tutorial