Skip to content

Python API 레퍼런스

PDF Oxide는 PyO3로 빌드된 네이티브 Python 바인딩을 제공합니다. Linux, macOS, Windows(x86_64 및 ARM64)에서 Python 3.8–3.14용 사전 빌드된 휠을 사용할 수 있습니다.

pip install pdf_oxide

Rust API는 [Rust API 레퍼런스]를 참조하세요(/docs/reference/api). JavaScript API는 [JavaScript API 레퍼런스]를 참조하세요(/docs/reference/javascript-api). 타입 상세 정보는 [타입 & 열거형]을 참조하세요(/docs/reference/types).


PdfDocument

The primary class for opening, extracting, editing, and saving PDF files.

from pdf_oxide import PdfDocument

Constructor

PdfDocument(path: str, password: str | None = None)
Parameter Type Description
path str Path to the PDF file
password str | None Optional password for encrypted PDFs (default: None)

Pass password= to open encrypted PDFs in one step. You can also use doc.authenticate(password) after opening as an alternative.

Raises FileNotFoundError if the file does not exist. Raises PdfError if the file is not a valid PDF.

Class Methods

PdfDocument.from_bytes(data: bytes, password: str | None = None) -> PdfDocument

Open a PDF from in-memory bytes (e.g., downloaded from S3, received via HTTP). Accepts an optional password for encrypted PDFs.

Parameter Type Description
data bytes Raw PDF file bytes
password str | None Optional password for encrypted PDFs (default: None)
from pdf_oxide import PdfDocument

# Open PDF from bytes (e.g., downloaded from S3)
doc = PdfDocument.from_bytes(pdf_bytes)

# Also supports password:
doc = PdfDocument.from_bytes(pdf_bytes, password="secret")

Methods

General

Method Return Type Description
version() tuple[int, int] PDF version as (major, minor) (e.g., (1, 7))
authenticate(password) bool Authenticate an encrypted PDF with user or owner password

Document Info

doc.page_count() -> int

Return the number of pages in the document.

doc.has_structure_tree() -> bool

Check if the document is a Tagged PDF with a structure tree.

Authentication

doc.authenticate(password: str) -> bool

Authenticate with a password after opening. Returns True if authentication succeeded.

Text Extraction

doc.extract_text(page: int) -> str

Extract plain text from a single page. Pages are zero-indexed.

doc.extract_chars(page: int) -> list[TextChar]

Extract per-character positioning and font metadata. Returns a list of TextChar objects.

doc.extract_spans(page: int, reading_order: str | None = None) -> list[TextSpan]

Extract text spans with font metadata. Each span is a run of identically-styled text. Pass reading_order="column_aware" for multi-column PDFs.

doc.extract_page_text(page: int) -> dict

Extract spans, characters, and page dimensions from a single pass. Returns a dict with keys: spans, chars, page_width, page_height, text. More efficient than calling extract_spans() + extract_chars() separately.

Conversion

doc.to_plain_text(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None
) -> str

Convert a page to plain text with layout options.

doc.to_plain_text_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None
) -> str

Convert all pages to plain text.

doc.to_markdown(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

Convert a page to Markdown.

doc.to_markdown_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

Convert all pages to Markdown.

doc.to_html(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

Convert a page to HTML.

doc.to_html_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

Convert all pages to HTML.

Image Extraction

doc.extract_images(page: int) -> list[ImageInfo]

Extract all images from a page, including images in content streams and nested Form XObjects.

doc.extract_image_bytes(page: int) -> list[dict]

Extract raw image bytes from a page. Each dict contains width, height, data (bytes), and format.

doc.search(
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0
) -> list[SearchResult]

Search for text across all pages. Set max_results=0 for unlimited results. Returns a list of matches with page number, text, and coordinates.

doc.search_page(
    page: int,
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0
) -> list[SearchResult]

Search for text on a single page.

Metadata Editing

Method Parameters Description
set_title(title) str Set document title
set_author(author) str Set document author
set_subject(subject) str Set document subject
set_keywords(keywords) str Set document keywords

Page Rotation

Method Parameters Returns Description
page_rotation(page) int int Get current rotation (0, 90, 180, 270)
set_page_rotation(page, degrees) int, int Set absolute rotation
rotate_page(page, degrees) int, int Add to current rotation
rotate_all_pages(degrees) int Rotate all pages

Page Dimensions

Method Parameters Returns Description
page_media_box(page) int tuple[float, float, float, float] Get MediaBox (llx, lly, urx, ury)
set_page_media_box(page, llx, lly, urx, ury) int, float, float, float, float Set MediaBox
page_crop_box(page) int `tuple None`
set_page_crop_box(page, llx, lly, urx, ury) int, float, float, float, float Set CropBox
crop_margins(left, right, top, bottom) float, float, float, float Crop all page margins

Erase / Whiteout

Method Parameters Description
erase_region(page, llx, lly, urx, ury) int, float, float, float, float Erase a rectangular region
erase_regions(page, rects) int, list[tuple] Erase multiple regions
clear_erase_regions(page) int Clear pending erase operations

Annotations

doc.get_annotations(page: int) -> list[dict]

Get annotation metadata (type, rect, contents, etc.) for a page.

Method Parameters Returns Description
flatten_page_annotations(page) int Flatten annotations on a page
flatten_all_annotations() Flatten all annotations
is_page_marked_for_flatten(page) int bool Check if page is marked for flatten
unmark_page_for_flatten(page) int Unmark a page for flatten

Redaction

Method Parameters Returns Description
apply_page_redactions(page) int Apply redactions on a page
apply_all_redactions() Apply all pending redactions
is_page_marked_for_redaction(page) int bool Check if page is marked for redaction
unmark_page_for_redaction(page) int Unmark a page for redaction

Form Fields

doc.get_form_fields() -> list[FormField]

Get all form fields. See FormField for properties.

doc.get_form_field_value(name: str) -> str | bool | list | None

Get a form field value by name. Returns the appropriate Python type based on the field type.

doc.set_form_field_value(name: str, value: str | bool) -> None

Set a form field value by name.

doc.has_xfa() -> bool

Check if the document contains XFA forms.

doc.export_form_data(path: str, format: str = "fdf") -> None

Export form data to a file. Supported formats: "fdf" and "xfdf".

Method Parameters Description
flatten_forms() Flatten all form fields into page content
flatten_forms_on_page(page) int Flatten forms on a specific page

Image Manipulation

doc.page_images(page: int) -> list[dict]

Get image names and bounds for positioning operations. Each dict contains name, bounds [x, y, width, height], and matrix.

Method Parameters Description
reposition_image(page, name, x, y) int, str, float, float Move an image
resize_image(page, name, width, height) int, str, float, float Resize an image
set_image_bounds(page, name, x, y, width, height) int, str, float, float, float, float Set image position and size
clear_image_modifications(page) int Clear pending image modifications
has_image_modifications(page) intbool Check for pending image modifications

Document Operations

doc.merge_from(source: str | PdfDocument) -> int

Merge pages from another PDF. Accepts a file path or PdfDocument instance. Returns the number of pages merged.

doc.embed_file(name: str, data: bytes) -> None

Attach a file to the PDF.

doc.get_outline() -> list[dict] | None

Get document bookmarks / table of contents. Returns None if no outline exists.

doc.extract_paths(page: int) -> list[dict]

Get vector paths (lines, curves, shapes) from a page.

doc.page_labels() -> list[dict]

Get page label ranges. Each dict contains start_page, style, prefix, and start_value.

doc.xmp_metadata() -> dict | None

Get XMP metadata as a dictionary with fields like dc_title, dc_creator, xmp_create_date, etc. Returns None if no XMP metadata exists.

OCR

doc.extract_text_ocr(page: int, engine: OcrEngine | None = None) -> str

Extract text using OCR. Requires the ocr feature in the Rust build. Pass a custom OcrEngine or None for the default engine.

Page API (v0.3.34)

PdfDocument is iterable and indexable, returning lazy PdfPage objects. See PdfPage.

len(doc)                  # number of pages
doc[i]                    # page at index i (negative indexing supported)
doc[-1]                   # last page
for page in doc: ...      # iterate pages

DOM Access

doc.page(index: int) -> EditorPage

Get a DOM-like page handle for element-level editing. See EditorPage. The class was renamed from PdfPage to EditorPage in v0.3.34 to avoid collision with the new page abstraction.

doc.save_page(page: EditorPage) -> None

Save a modified EditorPage back to the document.

Rendering

Method Return Type Description
render_page(page, dpi=None, format=None) bytes Render a page to PNG or JPEG bytes
flatten_to_images(dpi=150) bytes Flatten all pages to image-based PDF

Saving

doc.save(path: str) -> None

Save the PDF to a file.

doc.save_encrypted(
    path: str,
    user_password: str,
    owner_password: str | None = None,
    allow_print: bool = True,
    allow_copy: bool = True,
    allow_modify: bool = True,
    allow_annotate: bool = True
) -> None

Save with AES-256 password protection and permission controls. If owner_password is None, the user password is used.


PdfPage (v0.3.34)

A lazy page handle returned by doc[i] or iteration over PdfDocument. All properties are computed on access and dispatch to the parent document.

from pdf_oxide import PdfDocument

with PdfDocument("paper.pdf") as doc:
    page = doc[0]
    text = page.text
    md = page.markdown(detect_headings=True)

Properties (lazy)

Property Type Description
index int Zero-based page index
width, height float Page dimensions in PDF points
bbox tuple[float, 4] (llx, lly, urx, ury)
text str Extracted plain text
chars, words, lines, spans list[...] Structured text
tables list[dict] Tables with rows + cells (text + bboxes)
images, paths, annotations list[...] Page content

Methods

page.markdown(preserve_layout=False, detect_headings=True,
              include_images=False, image_output_dir=None,
              embed_images=True, include_form_fields=True) -> str
page.plain_text(...) -> str
page.html(...) -> str
page.render(dpi=None, format=None) -> bytes
page.search(pattern, case_insensitive=False, literal=False,
            whole_word=False, max_results=100) -> list
page.region(x, y, width, height) -> PdfPageRegion

EditorPage

DOM-like page handle for element-level access and editing. Obtained via PdfDocument.page(). Renamed from PdfPage in v0.3.34.

from pdf_oxide import PdfDocument

doc = PdfDocument("file.pdf")
page = doc.page(0)

Properties

Property Type Description
index int Zero-based page index
width float Page width in PDF points
height float Page height in PDF points

Methods

page.children() -> list[PdfElement]

Get all elements on the page.

page.find_text_containing(needle: str) -> list[PdfText]

Find all text elements containing the given substring.

page.find_images() -> list[PdfImage]

Find all image elements on the page.

page.get_element(element_id: str) -> PdfElement | None

Get a specific element by its ID.

page.set_text(text_id: PdfTextId, new_text: str) -> None

Replace the text content of an element identified by its PdfTextId.

page.annotations() -> list[PdfAnnotation]

Get all annotations on the page.

page.add_link(x: float, y: float, width: float, height: float, url: str) -> str

Add a URL link annotation. Returns the annotation ID.

page.add_highlight(x: float, y: float, width: float, height: float, color: tuple[float, float, float]) -> str

Add a highlight annotation with an RGB color. Returns the annotation ID.

page.add_note(x: float, y: float, text: str) -> str

Add a sticky note annotation. Returns the annotation ID.

page.remove_annotation(index: int) -> bool

Remove an annotation by index. Returns True if removed.

page.add_text(text: str, x: float, y: float, font_size: float = 12.0) -> PdfTextId

Add new text to the page. Returns a PdfTextId for later reference.

page.remove_element(element_id: PdfTextId) -> bool

Remove an element by its ID. Returns True if removed.

Example

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
page = doc.page(0)

# Find and replace text
for text in page.find_text_containing("DRAFT"):
    page.set_text(text.id, "FINAL")

# Add a link
page.add_link(100, 700, 200, 20, "https://example.com")

doc.save_page(page)
doc.save("invoice_updated.pdf")

Pdf

The unified class for creating PDFs from various source formats.

from pdf_oxide import Pdf

Factory Methods

Pdf.from_markdown(content: str, title: str | None = None, author: str | None = None) -> Pdf

Create a PDF from Markdown content.

Pdf.from_html(content: str, title: str | None = None, author: str | None = None) -> Pdf

Create a PDF from HTML content.

Pdf.from_text(content: str, title: str | None = None, author: str | None = None) -> Pdf

Create a PDF from plain text.

Pdf.from_image(path: str) -> Pdf

Create a single-page PDF from an image file (JPEG, PNG).

Pdf.from_bytes(data: bytes) -> Pdf

Open an existing PDF from in-memory bytes for modification. Useful for loading PDFs downloaded from S3, HTTP, or databases.

from pdf_oxide import Pdf

pdf = Pdf.from_bytes(existing_pdf_bytes)
pdf.save("modified.pdf")
Pdf.from_images(paths: list[str]) -> Pdf

Create a multi-page PDF from multiple image files, one page per image.

Pdf.from_image_bytes(data: bytes) -> Pdf

Create a single-page PDF from image bytes.

Methods

pdf.save(path: str) -> None

Save the PDF to a file.

pdf.to_bytes() -> bytes

Get the PDF content as bytes.

len(pdf) -> int

Get the PDF size in bytes (via __len__).


PdfText

Represents a text element on a page. Returned by PdfPage.find_text_containing().

Property Type Description
id PdfTextId Unique element identifier
value str Text content
text str Text content (alias for value)
bbox tuple[float, float, float, float] Bounding box (x0, y0, x1, y1)
font_name str PostScript font name
font_size float Font size in points
is_bold bool Whether text is bold
is_italic bool Whether text is italic

Methods

Method Parameters Returns Description
contains(needle) str bool Check if text contains substring
starts_with(prefix) str bool Check if text starts with prefix
ends_with(suffix) str bool Check if text ends with suffix

PdfImage

Represents an image element on a page. Returned by PdfPage.find_images().

Property Type Description
bbox tuple[float, float, float, float] Bounding box (x0, y0, x1, y1)
width int Image width in pixels
height int Image height in pixels
aspect_ratio float Width / height ratio

PdfAnnotation

Represents an annotation on a page. Returned by PdfPage.annotations().

Property Type Description
subtype str Annotation type (e.g., "Link", "Highlight", "Text")
rect tuple[float, float, float, float] Position (x0, y0, x1, y1)
contents `str None`
color `tuple[float, float, float] None`
is_modified bool Whether the annotation has been modified
is_new bool Whether the annotation is newly added

PdfElement

Generic element wrapper. Returned by PdfPage.children().

Method Returns Description
is_text() bool Check if element is text
is_image() bool Check if element is an image
is_path() bool Check if element is a vector path
is_table() bool Check if element is a table
is_structure() bool Check if element is a structure element
as_text() `PdfText None`
as_image() `PdfImage None`
Property Type Description
bbox tuple[float, float, float, float] Bounding box

TextChar

Represents a single character with positioning and font metadata. Returned by PdfDocument.extract_chars().

from pdf_oxide import TextChar  # or access via PdfDocument
Attribute Type Description
char str The Unicode character
bbox tuple[float, float, float, float] Bounding box (x0, y0, x1, y1)
font_name str PostScript font name
font_size float Font size in points
font_weight str Weight ("thin", "light", "normal", "medium", "semi-bold", "bold", "extra-bold", "black")
is_italic bool Whether the character is italic
color tuple[float, float, float] RGB color (r, g, b), values 0.0–1.0
rotation_degrees float Character rotation in degrees
origin_x float Text origin X position
origin_y float Text origin Y position
advance_width float Glyph advance width
mcid `int None`

Example

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
chars = doc.extract_chars(0)

for ch in chars[:5]:
    print(f"'{ch.char}' at bbox={ch.bbox} "
          f"font={ch.font_name} size={ch.font_size:.1f} "
          f"weight={ch.font_weight} italic={ch.is_italic}")

TextSpan

Represents a run of text sharing the same font and style. Returned by PdfDocument.extract_spans().

Attribute Type Description
text str The text content
bbox tuple[float, float, float, float] Bounding box (x0, y0, x1, y1)
font_name str PostScript font name
font_size float Font size in points
is_bold bool Whether the span is bold
is_italic bool Whether the span is italic
is_monospace bool Whether the font is fixed-width (Courier, Consolas, etc.)
char_widths list[float] Per-glyph advance widths for accurate bounding boxes
color tuple[float, float, float] RGB color (r, g, b), values 0.0–1.0

Example

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
spans = doc.extract_spans(0)

for span in spans:
    print(f"'{span.text}' font={span.font_name} size={span.font_size:.1f} "
          f"bold={span.is_bold} italic={span.is_italic} color={span.color}")

Image Extraction

extract_images() returns ImageInfo objects with image metadata. Use extract_image_bytes() for raw image data suitable for saving to disk.

extract_image_bytes() Return Format

Each dict returned by extract_image_bytes() has the following keys:

Key Type Description
width int Image width in pixels
height int Image height in pixels
data bytes Raw image data
format str Image format (e.g., "png", "jpeg")

Example

from pdf_oxide import PdfDocument

doc = PdfDocument("brochure.pdf")
images = doc.extract_image_bytes(0)

for i, img in enumerate(images):
    print(f"Image {i}: {img['width']}x{img['height']}")
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

SearchResult

Represents a text search match. Returned by search() and search_page().

Attribute Type Description
page int Zero-based page index
text str Matched text
x float X position in PDF points
y float Y position in PDF points

FormField

Represents a form field. Returned by PdfDocument.get_form_fields().

Property Type Description
name str Fully qualified field name
field_type str Field type: "text", "button", "choice", "signature", or "unknown"
value `str bool
tooltip `str None`
bounds `tuple[float, float, float, float] None`
flags `int None`
max_length `int None`
is_readonly bool Whether the field is read-only
is_required bool Whether the field is required

OfficeConverter

Convert Office documents (DOCX, XLSX, PPTX) to PDF. Requires the office feature in the Rust build.

from pdf_oxide import OfficeConverter

Methods

OfficeConverter.from_docx(path: str) -> Pdf

Convert a Word document to a Pdf object.

OfficeConverter.from_docx_bytes(data: bytes) -> Pdf

Convert Word document bytes to a Pdf object.

OfficeConverter.from_xlsx(path: str) -> Pdf

Convert an Excel spreadsheet to a Pdf object.

OfficeConverter.from_xlsx_bytes(data: bytes) -> Pdf

Convert Excel spreadsheet bytes to a Pdf object.

OfficeConverter.from_pptx(path: str) -> Pdf

Convert a PowerPoint presentation to a Pdf object.

OfficeConverter.from_pptx_bytes(data: bytes) -> Pdf

Convert PowerPoint presentation bytes to a Pdf object.

OfficeConverter.convert(path: str) -> Pdf

Auto-detect format and convert any supported Office document to a Pdf object.

Example

from pdf_oxide import OfficeConverter

pdf = OfficeConverter.from_docx("report.docx")
pdf.save("report.pdf")

# Or use convert() for auto-detection
pdf = OfficeConverter.convert("spreadsheet.xlsx")
pdf.save("spreadsheet.pdf")

Graphics Classes

These classes are available for advanced PDF creation with graphics:

Color

from pdf_oxide import Color

Color(r: float, g: float, b: float)  # RGB, values 0.0-1.0
Color.from_hex("#ff0000")
Color.black()
Color.white()
Color.red()
Color.green()
Color.blue()

BlendMode

from pdf_oxide import BlendMode

BlendMode.NORMAL()
BlendMode.MULTIPLY()
BlendMode.SCREEN()
BlendMode.OVERLAY()
BlendMode.DARKEN()
BlendMode.LIGHTEN()
BlendMode.COLOR_DODGE()
BlendMode.COLOR_BURN()
BlendMode.HARD_LIGHT()
BlendMode.SOFT_LIGHT()
BlendMode.DIFFERENCE()
BlendMode.EXCLUSION()

ExtGState

from pdf_oxide import ExtGState

gs = ExtGState()
gs = gs.fill_alpha(0.5)
gs = gs.stroke_alpha(0.8)
gs = gs.alpha(0.5)  # Set both fill and stroke
gs = gs.blend_mode(BlendMode.MULTIPLY())

gs = ExtGState.semi_transparent()  # Preset

LineCap / LineJoin

from pdf_oxide import LineCap, LineJoin

LineCap.BUTT()       # Default
LineCap.ROUND()
LineCap.SQUARE()

LineJoin.MITER()     # Default
LineJoin.ROUND()
LineJoin.BEVEL()

Gradients

from pdf_oxide import LinearGradient, RadialGradient, Color

# Linear gradient (fluent API)
grad = (LinearGradient()
    .start(0, 0)
    .end(100, 0)
    .add_stop(0.0, Color.red())
    .add_stop(1.0, Color.blue()))

# Convenience constructors
hgrad = LinearGradient.horizontal(200, Color.red(), Color.blue())
vgrad = LinearGradient.vertical(100, Color(1, 1, 0), Color(0, 0, 1))

# Radial gradient
rgrad = RadialGradient.centered(50, 50, 50)
rgrad = rgrad.add_stop(0.0, Color(1, 1, 0))
rgrad = rgrad.add_stop(1.0, Color(1, 0, 0))

PatternPresets

from pdf_oxide import PatternPresets, Color

PatternPresets.horizontal_stripes(width, height, stripe_height, color)
PatternPresets.vertical_stripes(width, height, stripe_width, color)
PatternPresets.checkerboard(size, color1, color2)
PatternPresets.dots(spacing, radius, color)
PatternPresets.diagonal_lines(size, line_width, color)
PatternPresets.crosshatch(size, line_width, color)

OCR Classes

Requires the ocr feature in the Rust build.

OcrEngine

from pdf_oxide import OcrEngine, OcrConfig

engine = OcrEngine(
    det_model_path: str,
    rec_model_path: str,
    dict_path: str,
    config: OcrConfig | None = None
)

OcrConfig

from pdf_oxide import OcrConfig

config = OcrConfig(
    det_threshold: float | None = None,
    box_threshold: float | None = None,
    rec_threshold: float | None = None,
    num_threads: int | None = None,
    max_candidates: int | None = None,
    use_v5: bool = False
)

Error Handling

PdfError

All PDF-specific errors raise PdfError:

from pdf_oxide import PdfDocument, PdfError

try:
    doc = PdfDocument("file.pdf")
    text = doc.extract_text(0)
except PdfError as e:
    print(f"PDF error: {e}")
except FileNotFoundError:
    print("File not found")
except IndexError:
    print("Page index out of range")

Common error scenarios:

Exception Cause
PdfError Malformed PDF, encrypted without password, parse failure
FileNotFoundError File does not exist
IndexError Page index exceeds page_count()
ValueError Invalid argument (e.g., negative page index)

Complete Example

from pdf_oxide import PdfDocument, Pdf

# --- Extraction ---
doc = PdfDocument("input.pdf")
print(f"Pages: {doc.page_count()}")

for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(f"Page {i + 1}: {len(text)} characters")

# Character-level analysis
chars = doc.extract_chars(0)
fonts = set(ch.font_name for ch in chars)
print(f"Fonts on page 1: {fonts}")

# Image extraction
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"extracted_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

# --- Creation ---
pdf = Pdf.from_markdown("# Report\n\nGenerated by PDF Oxide.",
                        title="Report", author="PDF Oxide")
pdf.save("report.pdf")

# --- Editing ---
doc = PdfDocument("document.pdf")
doc.set_title("Updated Title")
doc.set_author("New Author")
doc.rotate_all_pages(90)

# Search and replace via DOM
page = doc.page(0)
for text in page.find_text_containing("DRAFT"):
    page.set_text(text.id, "FINAL")
doc.save_page(page)

# Form filling
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name} ({f.field_type}) = {f.value}")
doc.set_form_field_value("name", "John Doe")

# Merge another PDF
merged_count = doc.merge_from("appendix.pdf")
print(f"Merged {merged_count} pages")

doc.save("output.pdf")

# --- Search ---
results = doc.search("configuration", case_insensitive=True)
for r in results:
    print(f"Page {r.page + 1}: '{r.text}' at ({r.x:.0f}, {r.y:.0f})")

v0.3.38 추가 사항

DocumentBuilder / FluentPageBuilder / EmbeddedFont

from pdf_oxide import DocumentBuilder, EmbeddedFont, StampType

font = EmbeddedFont.from_file("DejaVuSans.ttf")
# 또는: EmbeddedFont.from_bytes(data: bytes, name: str | None = None)

(DocumentBuilder()
    .register_embedded_font("DejaVu", font)
    .letter_page()           # 또는 .a4_page() / .page(size)
        .at(72, 720).font("DejaVu", 12).text("Hello")
        .heading(1, "Title")
        .paragraph("Body text with automatic wrapping")
        # 주석 (15개 메서드)
        .link_url("https://example.com")
        .link_page(2)
        .link_named("glossary")
        .highlight((1.0, 1.0, 0.0))
        .underline((0.0, 0.0, 1.0))
        .strikeout((1.0, 0.0, 0.0))
        .squiggly((1.0, 0.5, 0.0))
        .sticky_note("Review this")
        .stamp(StampType.APPROVED)
        .freetext((100, 500, 200, 50), "Comment")
        .watermark("DRAFT")
        .watermark_confidential()
        .watermark_draft()
        # AcroForm 위젯 (5종)
        .text_field("name", 150, 400, 200, 20, "Jane Doe")
        .checkbox("agree", 72, 380, 15, 15, True)
        .combo_box("country", 150, 360, 200, 20, ["US", "UK"], "US")
        .radio_group("tier", [("free", 72, 340, 15, 15), ("pro", 120, 340, 15, 15)], "pro")
        .push_button("submit", 72, 300, 80, 25, "Submit")
        # 그래픽 프리미티브
        .rect(50, 270, 500, 2)
        .filled_rect(50, 260, 500, 2, (0.9, 0.9, 0.9))
        .line(50, 250, 550, 250)
    .done()
    .save_encrypted("out.pdf", "user-pw", "owner-pw"))
# 또는: .save("out.pdf") / .build() -> bytes
# 또는: .to_bytes_encrypted("user-pw", "owner-pw") -> bytes

HTML + CSS 파이프라인

Pdf.from_html_css(html: str, css: str, font_bytes: bytes) -> Pdf
Pdf.from_html_css_with_fonts(html: str, css: str, fonts: list[tuple[str, bytes]]) -> Pdf

자세한 내용은 HTML에서 생성을 참고하세요.

서명 검증

from pdf_oxide import PdfDocument, Timestamp, TsaClient

doc = PdfDocument("signed.pdf")
doc.signature_count()                # int
for sig in doc.signatures():
    sig.signer_name                  # str
    sig.reason                       # str | None
    sig.location                     # str | None
    sig.signing_time                 # datetime | None
    sig.verify()                     # "Valid" | "Invalid" | "Unknown"
    sig.verify_detached(pdf_bytes)   # messageDigest 검사를 추가

# Timestamp
ts = Timestamp.parse(tst_bytes)
ts.time, ts.serial, ts.policy_oid, ts.tsa_name, ts.hash_algorithm, ts.message_imprint

# TSA 클라이언트 (`tsa-client` 기능 뒤에 위치)
client = TsaClient(url="https://freetsa.org/tsr",
                   username=None, password=None,
                   timeout_seconds=30, hash_algorithm=2,
                   use_nonce=True, cert_req=True)
ts = client.request_timestamp(pdf_bytes)
ts = client.request_timestamp_hash(digest, algorithm=2)

자세한 내용은 디지털 서명을 참고하세요.

렌더링

doc.render_page_region(page: int, x: float, y: float, w: float, h: float, format: int = 0) -> bytes
doc.render_page_fit(page: int, fit_width: int, fit_height: int, format: int = 0) -> bytes

format: 0 = PNG, 1 = JPEG. 좌표는 PDF 포인트, 좌측 하단 기준입니다.

Pdf 평탄화

doc.flatten_to_images(dpi: int = 150) -> bytes

Next Steps