What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Python API 레퍼런스

PDF Oxide는 PyO3로 빌드된 네이티브 Python 바인딩을 제공합니다. Linux, macOS, Windows(x86_64 및 ARM64)의 Python 3.8–3.14용 휠이 미리 빌드되어 제공됩니다.

pip install pdf_oxide

Rust API는 Rust API 레퍼런스를 참고하세요. JavaScript API는 Node.js API 레퍼런스 또는 WASM API 레퍼런스를 참고하세요. 타입에 대한 자세한 내용은 타입 및 열거형을 참고하세요.

PdfDocument

PDF 파일을 열고, 추출하고, 편집하고, 저장하기 위한 기본 클래스입니다.

from pdf_oxide import PdfDocument

생성자

PdfDocument(path: str, password: str | None = None)

매개변수	타입	설명
`path`	`str`	PDF 파일 경로
`password`	`str \| None`	암호화된 PDF를 위한 선택적 비밀번호(기본값: `None`)

password=를 전달하면 암호화된 PDF를 한 번에 열 수 있습니다. 열고 난 뒤 doc.authenticate(password)를 호출하는 방식도 대안으로 사용할 수 있습니다.

파일이 존재하지 않으면 FileNotFoundError를 발생시킵니다. 파일이 유효한 PDF가 아니면 PdfError를 발생시킵니다.

클래스 메서드

PdfDocument.from_bytes(data: bytes, password: str | None = None) -> PdfDocument

메모리 상의 바이트로 PDF를 엽니다(예: S3에서 다운로드, HTTP로 수신). 암호화된 PDF를 위한 선택적 비밀번호를 받습니다.

매개변수	타입	설명
`data`	`bytes`	PDF 파일의 원본 바이트
`password`	`str \| None`	암호화된 PDF를 위한 선택적 비밀번호(기본값: `None`)

from pdf_oxide import PdfDocument

# Open PDF from bytes (e.g., downloaded from S3)
doc = PdfDocument.from_bytes(pdf_bytes)

# Also supports password:
doc = PdfDocument.from_bytes(pdf_bytes, password="secret")

메서드

일반

메서드	반환 타입	설명
`version()`	`tuple[int, int]`	PDF 버전을 `(major, minor)`로 반환(예: `(1, 7)`)
`authenticate(password)`	`bool`	사용자 또는 소유자 비밀번호로 암호화된 PDF를 인증

문서 정보

doc.page_count() -> int

문서의 페이지 수를 반환합니다.

doc.has_structure_tree() -> bool

문서가 구조 트리를 가진 Tagged PDF인지 확인합니다.

인증

doc.authenticate(password: str) -> bool

문서를 연 뒤 비밀번호로 인증합니다. 인증에 성공하면 True를 반환합니다.

텍스트 추출

doc.extract_text(
    page: int,
    region: tuple[float, float, float, float] | None = None,
    exclude_layers: list[str] | None = None,
    exclude_inks: list[str] | None = None,
    extract_tables: bool = True
) -> str

단일 페이지에서 일반 텍스트를 추출합니다. 페이지는 0부터 시작하는 인덱스입니다. 선택적으로 region으로 잘라내거나, 이름이 지정된 선택적 콘텐츠 레이어 또는 잉크/분판 이름을 제외하거나, 표 재구성을 켜고 끌 수 있습니다.

doc.extract_chars(
    page: int,
    region: tuple[float, float, float, float] | None = None,
    exclude_layers: list[str] | None = None,
    exclude_inks: list[str] | None = None
) -> list[TextChar]

문자 단위 위치와 폰트 메타데이터를 추출합니다. TextChar 객체 목록을 반환합니다.

doc.extract_spans(page: int, region: tuple | None = None, reading_order: str | None = None) -> list[TextSpan]

폰트 메타데이터가 포함된 텍스트 스팬을 추출합니다. 각 스팬은 동일한 스타일이 적용된 연속 텍스트입니다. 다단(multi-column) PDF의 경우 reading_order="column_aware"를 전달하세요.

doc.extract_words(
    page: int,
    *,
    include_artifacts: bool = True,
    region: tuple | None = None,
    word_gap_threshold: float | None = None,
    profile: ExtractionProfile | None = None
) -> list[TextWord]

경계 상자가 포함된 단어 단위 텍스트를 추출합니다. TextWord 객체 목록을 반환합니다.

doc.extract_text_lines(
    page: int,
    *,
    include_artifacts: bool = True,
    region: tuple | None = None,
    word_gap_threshold: float | None = None,
    line_gap_threshold: float | None = None,
    profile: ExtractionProfile | None = None
) -> list[TextLine]

행 단위 텍스트를 추출합니다. TextLine 객체 목록을 반환합니다.

doc.extract_page_text(page: int, reading_order: str | None = None) -> dict

한 번의 처리로 스팬, 문자, 페이지 크기를 추출합니다. spans, chars, page_width, page_height, text 키를 가진 dict를 반환합니다. extract_spans()와 extract_chars()를 따로 호출하는 것보다 효율적입니다.

doc.page_layout_params(page: int) -> LayoutParams

페이지에 대한 적응형 레이아웃 매개변수(단어/행 간격 임계값, 중앙값 지표, 단 수)를 계산합니다. LayoutParams를 참고하세요.

doc.within(page: int, bbox: tuple[float, float, float, float]) -> PdfPageRegion

bbox 내부에서 텍스트, 단어, 행, 표, 이미지, 경로를 추출하기 위한 클리핑 영역 핸들을 만듭니다. PdfPageRegion을 참고하세요.

자동 추출 및 분류

doc.extract_text_auto(page: int) -> str

페이지에 가장 적합한 추출 전략(네이티브 텍스트 대 OCR)을 자동으로 선택하여 일반 텍스트를 반환합니다.

doc.extract_page_auto(page: int, options_json: str | None = None) -> str

페이지를 자동으로 추출하여 JSON 문서를 반환합니다. 파이프라인을 조정하려면 JSON options_json 문자열을 전달하세요.

doc.classify_page(page: int) -> str

단일 페이지를 분류합니다(예: "text", "scanned", "mixed").

doc.classify_document() -> str

페이지를 샘플링하여 문서 전체를 분류합니다.

doc.has_text_layer(page: int) -> bool

페이지에 (OCR이 필요한 대신) 추출 가능한 네이티브 텍스트 레이어가 이미 있는지 확인합니다.

변환

doc.to_plain_text(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None
) -> str

레이아웃 옵션과 함께 페이지를 일반 텍스트로 변환합니다.

doc.to_plain_text_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None
) -> str

모든 페이지를 일반 텍스트로 변환합니다.

doc.to_markdown(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

페이지를 Markdown으로 변환합니다.

doc.to_markdown_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

모든 페이지를 Markdown으로 변환합니다.

doc.to_html(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

페이지를 HTML로 변환합니다.

doc.to_html_all(
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
    include_form_fields: bool = True
) -> str

모든 페이지를 HTML로 변환합니다.

Office 변환

메서드	반환 타입	설명
`to_docx(path)`	–	PDF를 Word 문서 파일로 변환
`to_docx_bytes()`	`bytes`	PDF를 DOCX 바이트로 변환
`to_pptx(path)`	–	PDF를 PowerPoint 파일로 변환
`to_pptx_bytes()`	`bytes`	PDF를 PPTX 바이트로 변환
`to_xlsx(path)`	–	PDF를 Excel 통합 문서 파일로 변환
`to_xlsx_bytes()`	`bytes`	PDF를 XLSX 바이트로 변환

이미지 추출

doc.extract_images(page: int) -> list[ImageInfo]

콘텐츠 스트림 내 이미지와 중첩된 Form XObject를 포함하여 페이지의 모든 이미지를 추출합니다.

doc.extract_image_bytes(page: int) -> list[dict]

페이지에서 원본 이미지 바이트를 추출합니다. 각 dict는 width, height, data(bytes), format을 포함합니다.

검색

doc.search(
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0
) -> list[SearchResult]

모든 페이지에서 텍스트를 검색합니다. 결과 수에 제한을 두지 않으려면 max_results=0으로 설정하세요. 페이지 번호, 텍스트, 좌표가 포함된 일치 항목 목록을 반환합니다.

doc.search_page(
    page: int,
    pattern: str,
    case_insensitive: bool = False,
    literal: bool = False,
    whole_word: bool = False,
    max_results: int = 0
) -> list[SearchResult]

단일 페이지에서 텍스트를 검색합니다.

메타데이터 편집

메서드	매개변수	설명
`set_title(title)`	`str`	문서 제목 설정
`set_author(author)`	`str`	문서 작성자 설정
`set_subject(subject)`	`str`	문서 주제 설정
`set_keywords(keywords)`	`str`	문서 키워드 설정

페이지 회전

메서드	매개변수	반환값	설명
`page_rotation(page)`	`int`	`int`	현재 회전 각도 가져오기(0, 90, 180, 270)
`set_page_rotation(page, degrees)`	`int, int`	–	절대 회전 각도 설정
`rotate_page(page, degrees)`	`int, int`	–	현재 회전에 더하기
`rotate_all_pages(degrees)`	`int`	–	모든 페이지 회전

페이지 크기

메서드	매개변수	반환값	설명
`page_media_box(page)`	`int`	`tuple[float, float, float, float]`	MediaBox `(llx, lly, urx, ury)` 가져오기
`set_page_media_box(page, llx, lly, urx, ury)`	`int, float, float, float, float`	–	MediaBox 설정
`page_crop_box(page)`	`int`	`tuple	None`
`set_page_crop_box(page, llx, lly, urx, ury)`	`int, float, float, float, float`	–	CropBox 설정
`crop_margins(left, right, top, bottom)`	`float, float, float, float`	–	모든 페이지 여백 잘라내기

지우기 / 화이트아웃

메서드	매개변수	설명
`erase_region(page, llx, lly, urx, ury)`	`int, float, float, float, float`	직사각형 영역 지우기
`erase_regions(page, rects)`	`int, list[tuple]`	여러 영역 지우기
`clear_erase_regions(page)`	`int`	대기 중인 지우기 작업 취소

주석

doc.get_annotations(page: int) -> list[dict]

페이지의 주석 메타데이터(타입, 사각형 영역, 내용 등)를 가져옵니다.

메서드	매개변수	반환값	설명
`flatten_page_annotations(page)`	`int`	–	페이지의 주석 평탄화
`flatten_all_annotations()`	–	–	모든 주석 평탄화
`is_page_marked_for_flatten(page)`	`int`	`bool`	페이지가 평탄화 대상으로 표시되었는지 확인
`unmark_page_for_flatten(page)`	`int`	–	페이지의 평탄화 표시 해제

교정(Redaction)

doc.add_redaction(
    page: int,
    rect: tuple[float, float, float, float],
    fill: tuple[float, float, float] | None = None
) -> None

선택적 RGB 채움 색상과 함께 직사각형 영역을 교정 대상으로 표시합니다.

doc.redaction_count(page: int) -> int

페이지에 대기 중인 교정 작업의 수를 반환합니다.

doc.apply_redactions_destructive(
    scrub_metadata: bool = True,
    remove_javascript: bool = True,
    remove_embedded_files: bool = True,
    fill: tuple[float, float, float] = (0.0, 0.0, 0.0)
) -> None

모든 교정을 영구적으로 적용하여 기저 콘텐츠를 제거하고, 선택적으로 메타데이터, JavaScript, 첨부 파일을 정리합니다.

doc.sanitize_document(
    scrub_metadata: bool = True,
    remove_javascript: bool = True,
    remove_embedded_files: bool = True
) -> None

영역을 교정하지 않고 문서를 정리합니다. 메타데이터, JavaScript, 첨부 파일을 제거합니다.

메서드	매개변수	반환값	설명
`apply_page_redactions(page)`	`int`	–	페이지의 교정 적용
`apply_all_redactions()`	–	–	대기 중인 모든 교정 적용
`is_page_marked_for_redaction(page)`	`int`	`bool`	페이지가 교정 대상으로 표시되었는지 확인
`unmark_page_for_redaction(page)`	`int`	–	페이지의 교정 표시 해제

레이어 및 잉크

메서드	매개변수	반환값	설명
`get_layers()`	–	`list[str]`	선택적 콘텐츠(OCG) 레이어 이름 나열
`get_page_inks(page)`	`int`	`list[str]`	페이지의 잉크/분판 컬러런트 이름 나열
`get_page_inks_deep(page)`	`int`	`list[str]`	Form XObject에 중첩된 것을 포함하여 잉크 나열

머리글 / 바닥글 정리

doc.remove_headers(threshold: float = 0.8) -> int
doc.remove_footers(threshold: float = 0.8) -> int
doc.remove_artifacts(threshold: float = 0.8) -> int

문서 전체에서 반복되는 머리글, 바닥글, 페이지 아티팩트를 감지하여 제거합니다. threshold는 페이지 간 반복 비율입니다. 제거된 요소의 수를 반환합니다.

메서드	매개변수	설명
`erase_header(page)`	`int`	페이지에서 감지된 머리글 영역 지우기
`edit_header(page)`	`int`	머리글 영역을 편집 대상으로 표시
`erase_footer(page)`	`int`	페이지에서 감지된 바닥글 영역 지우기
`edit_footer(page)`	`int`	바닥글 영역을 편집 대상으로 표시
`erase_artifacts(page)`	`int`	페이지에서 감지된 아티팩트 지우기
`sync_editor_erasures()`	–	대기 중인 머리글/바닥글/아티팩트 지우기를 편집기에 반영

폼 필드

doc.get_form_fields() -> list[FormField]

모든 폼 필드를 가져옵니다. 속성은 FormField를 참고하세요.

doc.get_form_field_value(name: str) -> str | bool | list | None

이름으로 폼 필드 값을 가져옵니다. 필드 타입에 따라 적절한 Python 타입을 반환합니다.

doc.set_form_field_value(name: str, value: str | bool) -> None

이름으로 폼 필드 값을 설정합니다.

doc.has_xfa() -> bool

문서에 XFA 폼이 포함되어 있는지 확인합니다.

doc.export_form_data(path: str, format: str = "fdf") -> None

폼 데이터를 파일로 내보냅니다. 지원되는 형식: "fdf" 및 "xfdf".

메서드	매개변수	설명
`flatten_forms()`	–	모든 폼 필드를 페이지 콘텐츠로 평탄화
`flatten_forms_on_page(page)`	`int`	특정 페이지의 폼 평탄화

이미지 조작

doc.page_images(page: int) -> list[dict]

위치 지정 작업을 위한 이미지 이름과 경계를 가져옵니다. 각 dict는 name, bounds [x, y, width, height], matrix를 포함합니다.

메서드	매개변수	설명
`reposition_image(page, name, x, y)`	`int, str, float, float`	이미지 이동
`resize_image(page, name, width, height)`	`int, str, float, float`	이미지 크기 조정
`set_image_bounds(page, name, x, y, width, height)`	`int, str, float, float, float, float`	이미지 위치와 크기 설정
`clear_image_modifications(page)`	`int`	대기 중인 이미지 수정 취소
`has_image_modifications(page)`	`int` → `bool`	대기 중인 이미지 수정 확인

문서 작업

doc.merge_from(source: str | PdfDocument) -> int

다른 PDF의 페이지를 병합합니다. 파일 경로 또는 PdfDocument 인스턴스를 받습니다. 병합된 페이지 수를 반환합니다.

doc.embed_file(name: str, data: bytes) -> None

PDF에 파일을 첨부합니다.

doc.get_outline() -> list[dict] | None

문서의 책갈피/목차를 가져옵니다. 아웃라인이 없으면 None을 반환합니다.

doc.extract_paths(page: int, region: tuple | None = None) -> list[dict]

페이지에서 벡터 경로(선, 곡선, 도형)를 가져옵니다.

doc.extract_rects(page: int, region: tuple | None = None) -> list[dict]

페이지에서 축 정렬 직사각형(채움/획 경로 기반)을 가져옵니다.

doc.extract_lines(page: int, region: tuple | None = None) -> list[dict]

페이지에서 직선 세그먼트를 가져옵니다.

doc.extract_tables(page: int, region: tuple | None = None, table_settings: dict | None = None) -> list[dict]

표를 감지하고 추출합니다. 각 표는 행과 셀(텍스트 + 경계 상자)을 가진 dict입니다. 감지 전략을 조정하려면 table_settings를 전달하세요.

doc.extract_structured(page: int) -> str

페이지를 구조화된 JSON 문서(논리적 읽기 순서, 블록, 역할)로 추출합니다.

doc.page_labels() -> list[dict]

페이지 레이블 범위를 가져옵니다. 각 dict는 start_page, style, prefix, start_value를 포함합니다.

doc.xmp_metadata() -> dict | None

XMP 메타데이터를 dc_title, dc_creator, xmp_create_date 등의 필드를 가진 딕셔너리로 가져옵니다. XMP 메타데이터가 없으면 None을 반환합니다.

OCR

doc.extract_text_ocr(page: int, engine: OcrEngine | None = None) -> str

OCR을 사용해 텍스트를 추출합니다. Rust 빌드에 ocr 기능이 필요합니다. 사용자 지정 OcrEngine 또는 기본 엔진을 위한 None을 전달하세요.

페이지 추출 및 재정렬

doc.extract_pages(pages: list[int], output: str) -> None

지정된 페이지 인덱스를 output 위치의 새 PDF 파일로 추출합니다.

doc.extract_pages_to_bytes(pages: list[int]) -> bytes

지정된 페이지 인덱스를 바이트로 반환되는 새 PDF로 추출합니다.

doc.extract_page_ranges_to_bytes(ranges: list[tuple[int, int]]) -> bytes

하나 이상의 (start, end) 페이지 범위를 바이트로 반환되는 새 PDF로 추출합니다.

메서드	매개변수	설명
`select_pages(pages)`	`list[int]`	나열된 페이지만 지정된 순서대로 유지
`delete_page(index)`	`int`	단일 페이지 삭제
`move_page(from_index, to_index)`	`int, int`	페이지를 새 위치로 이동

규정 준수 및 검증

doc.validate_pdf_a(level: str = "1b") -> dict

PDF/A 적합성 수준(예: "1b", "2b", "3b")에 대해 검증합니다. 보고서 dict를 반환합니다.

doc.convert_to_pdf_a(level: str = "2b") -> dict

문서를 PDF/A로 변환하고 변환 보고서 dict를 반환합니다.

doc.validate_pdf_ua() -> dict

PDF/UA(접근성) 요구 사항에 대해 검증합니다.

doc.validate_pdf_x(level: str = "1a_2001") -> dict

PDF/X(인쇄 생산) 적합성 수준에 대해 검증합니다.

권한 및 경고

doc.permissions() -> dict

문서의 암호화 권한 플래그(인쇄, 복사, 수정, 주석 등)를 반환합니다.

doc.structured_warnings() -> list

구조화/태그된 콘텐츠 추출 중에 수집된 경고를 반환합니다.

doc.flatten_warnings() -> list[str]

폼/주석 평탄화 중에 수집된 경고를 반환합니다.

서명 및 문서 보안 저장소

doc.signatures() -> list[Signature]

문서의 모든 디지털 서명을 반환합니다. Signature를 참고하세요.

doc.signature_count() -> int

디지털 서명의 수를 반환합니다.

doc.dss() -> Dss | None

문서에서 파싱된 Document Security Store(LTV 자료)를 반환하거나 None을 반환합니다. Dss를 참고하세요.

Page API (v0.3.34)

PdfDocument는 반복과 인덱싱이 가능하며, 지연 평가 Page 객체를 반환합니다. Page를 참고하세요.

len(doc)                  # number of pages
doc[i]                    # page at index i (negative indexing supported)
doc[-1]                   # last page
for page in doc: ...      # iterate pages

DOM 접근

doc.page(index: int) -> PdfPage

요소 단위 편집을 위한 DOM 형태의 페이지 핸들을 가져옵니다. PdfPage를 참고하세요.

doc.save_page(page: PdfPage) -> None

수정된 PdfPage를 문서에 다시 저장합니다.

렌더링

doc.render_page(
    page: int,
    dpi: int | None = None,
    format: str | None = None,
    background: tuple[float, float, float, float] | None = None,
    transparent: bool = False,
    render_annotations: bool | None = None,
    jpeg_quality: int | None = None,
    excluded_layers: list[str] | None = None
) -> bytes

DPI, 배경, 투명도, 주석 렌더링, JPEG 품질, 제외 레이어를 제어하여 페이지를 PNG 또는 JPEG 바이트로 렌더링합니다.

doc.render_pixmap(page: int, dpi: int | None = None) -> RenderedPixmap

페이지를 원본 RGBA RenderedPixmap(width, height, data를 가진 네임드 튜플)으로 렌더링합니다.

doc.render_separations(page: int, dpi: int | None = None) -> list[SeparationPlate]

페이지의 잉크별 분판 플레이트를 렌더링합니다. SeparationPlate 네임드 튜플(name, width, height, data) 목록을 반환합니다.

doc.render_separation(page: int, ink_name: str, dpi: int | None = None) -> SeparationPlate

이름이 지정된 단일 잉크 분판 플레이트를 렌더링합니다.

메서드	반환 타입	설명
`render_page_fit(page, fit_width, fit_height, format=0)`	`bytes`	픽셀 상자에 맞게 페이지를 스케일링하여 렌더링
`flatten_to_images(dpi=150)`	`bytes`	모든 페이지를 이미지 기반 PDF로 평탄화

저장

doc.save(path: str, compress: bool = True, garbage_collect: bool = True, linearize: bool = False) -> None

PDF를 파일로 저장합니다. 스트림 압축, 미사용 객체 가비지 컬렉션, 선형화(빠른 웹 보기)를 켜고 끌 수 있습니다.

doc.to_bytes(compress: bool = True, garbage_collect: bool = True, linearize: bool = False) -> bytes

save()와 동일한 옵션으로 PDF를 바이트로 직렬화합니다.

doc.save_encrypted(
    path: str,
    user_password: str,
    owner_password: str | None = None,
    allow_print: bool = True,
    allow_copy: bool = True,
    allow_modify: bool = True,
    allow_annotate: bool = True
) -> None

AES-256 비밀번호 보호 및 권한 제어와 함께 저장합니다. owner_password가 None이면 사용자 비밀번호가 사용됩니다.

doc.to_bytes_encrypted(
    user_password: str,
    owner_password: str | None = None,
    allow_print: bool = True,
    allow_copy: bool = True,
    allow_modify: bool = True,
    allow_annotate: bool = True
) -> bytes

AES-256으로 암호화된 PDF를 바이트로 직렬화합니다.

Page

doc[i] 또는 PdfDocument 반복으로 반환되는 지연 평가 페이지 핸들입니다. 모든 속성은 접근 시 계산되어 부모 문서로 디스패치됩니다.

from pdf_oxide import PdfDocument

with PdfDocument("paper.pdf") as doc:
    page = doc[0]
    text = page.text
    md = page.markdown(detect_headings=True)

속성(지연 평가)

속성	타입	설명
`index`	`int`	0부터 시작하는 페이지 인덱스
`width`, `height`	`float`	PDF 포인트 단위의 페이지 크기
`bbox`	`tuple[float, 4]`	`(llx, lly, urx, ury)`
`text`	`str`	추출된 일반 텍스트
`chars`, `words`, `lines`, `spans`	`list[...]`	구조화된 텍스트
`tables`	`list[dict]`	행 + 셀(텍스트 + 경계 상자)이 포함된 표
`images`, `paths`, `annotations`	`list[...]`	페이지 콘텐츠

메서드

page.markdown(preserve_layout=False, detect_headings=True,
              include_images=False, image_output_dir=None,
              embed_images=True, include_form_fields=True) -> str
page.plain_text(...) -> str
page.html(...) -> str
page.render(dpi=None, format=None, background=None, transparent=False,
            render_annotations=None, jpeg_quality=None, excluded_layers=None) -> bytes
page.render_pixmap(dpi=None) -> RenderedPixmap
page.search(pattern, case_insensitive=False, literal=False,
            whole_word=False, max_results=100) -> list
page.region(x, y, width, height) -> PdfPageRegion

지연 평가 페이지 객체는 doc.pages()로도 노출됩니다(문서를 직접 반복하는 것과 동등한 이터레이터).

PdfPage

요소 단위 접근 및 편집을 위한 DOM 형태의 페이지 핸들입니다. PdfDocument.page()를 통해 얻습니다.

from pdf_oxide import PdfDocument

doc = PdfDocument("file.pdf")
page = doc.page(0)

속성

속성	타입	설명
`index`	`int`	0부터 시작하는 페이지 인덱스
`width`	`float`	PDF 포인트 단위의 페이지 너비
`height`	`float`	PDF 포인트 단위의 페이지 높이

메서드

page.children() -> list[PdfElement]

페이지의 모든 요소를 가져옵니다.

page.find_text_containing(needle: str) -> list[PdfText]

지정된 부분 문자열을 포함하는 모든 텍스트 요소를 찾습니다.

page.find_images() -> list[PdfImage]

페이지의 모든 이미지 요소를 찾습니다.

page.get_element(element_id: str) -> PdfElement | None

ID로 특정 요소를 가져옵니다.

page.set_text(text_id: PdfTextId, new_text: str) -> None

PdfTextId로 식별되는 요소의 텍스트 콘텐츠를 교체합니다.

page.annotations() -> list[PdfAnnotation]

페이지의 모든 주석을 가져옵니다.

page.add_link(x: float, y: float, width: float, height: float, url: str) -> str

URL 링크 주석을 추가합니다. 주석 ID를 반환합니다.

page.add_highlight(x: float, y: float, width: float, height: float, color: tuple[float, float, float]) -> str

RGB 색상이 적용된 강조 표시 주석을 추가합니다. 주석 ID를 반환합니다.

page.add_note(x: float, y: float, text: str) -> str

스티키 노트 주석을 추가합니다. 주석 ID를 반환합니다.

page.remove_annotation(index: int) -> bool

인덱스로 주석을 제거합니다. 제거되면 True를 반환합니다.

page.add_text(text: str, x: float, y: float, font_size: float = 12.0) -> PdfTextId

페이지에 새 텍스트를 추가합니다. 나중에 참조할 수 있도록 PdfTextId를 반환합니다.

page.remove_element(element_id: PdfTextId) -> bool

ID로 요소를 제거합니다. 제거되면 True를 반환합니다.

예제

from pdf_oxide import PdfDocument

doc = PdfDocument("invoice.pdf")
page = doc.page(0)

# Find and replace text
for text in page.find_text_containing("DRAFT"):
    page.set_text(text.id, "FINAL")

# Add a link
page.add_link(100, 700, 200, 20, "https://example.com")

doc.save_page(page)
doc.save("invoice_updated.pdf")

Pdf

다양한 소스 형식에서 PDF를 생성하기 위한 통합 클래스입니다.

from pdf_oxide import Pdf

팩토리 메서드

Pdf.from_markdown(content: str, title: str | None = None, author: str | None = None) -> Pdf

Markdown 콘텐츠에서 PDF를 생성합니다.

Pdf.from_html(content: str, title: str | None = None, author: str | None = None) -> Pdf

HTML 콘텐츠에서 PDF를 생성합니다.

Pdf.from_text(content: str, title: str | None = None, author: str | None = None) -> Pdf

일반 텍스트에서 PDF를 생성합니다.

Pdf.from_markdown_with_template(content: str, template: str, title: str | None = None, author: str | None = None) -> Pdf

이름이 지정된 CSS/레이아웃 템플릿을 통해 렌더링된 Markdown에서 PDF를 생성합니다.

Pdf.from_image(path: str) -> Pdf

이미지 파일(JPEG, PNG)에서 단일 페이지 PDF를 생성합니다.

Pdf.from_bytes(data: bytes) -> Pdf

수정을 위해 메모리 상의 바이트로 기존 PDF를 엽니다. S3, HTTP, 데이터베이스에서 다운로드한 PDF를 불러올 때 유용합니다.

from pdf_oxide import Pdf

pdf = Pdf.from_bytes(existing_pdf_bytes)
pdf.save("modified.pdf")

Pdf.from_images(paths: list[str]) -> Pdf

여러 이미지 파일에서 이미지당 한 페이지로 다중 페이지 PDF를 생성합니다.

Pdf.from_image_bytes(data: bytes) -> Pdf

이미지 바이트에서 단일 페이지 PDF를 생성합니다.

Pdf.merge(paths: list[str]) -> Pdf

여러 PDF 파일(경로 기준)을 단일 Pdf로 병합합니다.

메서드

pdf.save(path: str) -> None

PDF를 파일로 저장합니다.

pdf.to_bytes() -> bytes

PDF 콘텐츠를 바이트로 가져옵니다.

len(pdf) -> int

PDF 크기를 바이트 단위로 가져옵니다(__len__을 통해).

PdfText

페이지의 텍스트 요소를 나타냅니다. PdfPage.find_text_containing()이 반환합니다.

속성	타입	설명
`id`	`PdfTextId`	고유 요소 식별자
`value`	`str`	텍스트 콘텐츠
`text`	`str`	텍스트 콘텐츠(`value`의 별칭)
`bbox`	`tuple[float, float, float, float]`	경계 상자 `(x0, y0, x1, y1)`
`font_name`	`str`	PostScript 폰트 이름
`font_size`	`float`	폰트 크기(포인트)
`is_bold`	`bool`	텍스트가 굵게 표시되는지 여부
`is_italic`	`bool`	텍스트가 기울임꼴인지 여부

메서드

메서드	매개변수	반환값	설명
`contains(needle)`	`str`	`bool`	텍스트가 부분 문자열을 포함하는지 확인
`starts_with(prefix)`	`str`	`bool`	텍스트가 접두사로 시작하는지 확인
`ends_with(suffix)`	`str`	`bool`	텍스트가 접미사로 끝나는지 확인

PdfImage

페이지의 이미지 요소를 나타냅니다. PdfPage.find_images()가 반환합니다.

속성	타입	설명
`bbox`	`tuple[float, float, float, float]`	경계 상자 `(x0, y0, x1, y1)`
`width`	`int`	이미지 너비(픽셀)
`height`	`int`	이미지 높이(픽셀)
`aspect_ratio`	`float`	너비 / 높이 비율

PdfAnnotation

페이지의 주석을 나타냅니다. PdfPage.annotations()가 반환합니다.

속성	타입	설명
`subtype`	`str`	주석 타입(예: `"Link"`, `"Highlight"`, `"Text"`)
`rect`	`tuple[float, float, float, float]`	위치 `(x0, y0, x1, y1)`
`contents`	`str	None`
`color`	`tuple[float, float, float]	None`
`is_modified`	`bool`	주석이 수정되었는지 여부
`is_new`	`bool`	주석이 새로 추가되었는지 여부

PdfElement

일반 요소 래퍼입니다. PdfPage.children()이 반환합니다.

메서드	반환값	설명
`is_text()`	`bool`	요소가 텍스트인지 확인
`is_image()`	`bool`	요소가 이미지인지 확인
`is_path()`	`bool`	요소가 벡터 경로인지 확인
`is_table()`	`bool`	요소가 표인지 확인
`is_structure()`	`bool`	요소가 구조 요소인지 확인
`as_text()`	`PdfText	None`
`as_image()`	`PdfImage	None`

속성	타입	설명
`bbox`	`tuple[float, float, float, float]`	경계 상자

TextChar

위치와 폰트 메타데이터가 포함된 단일 문자를 나타냅니다. PdfDocument.extract_chars()가 반환합니다.

from pdf_oxide import TextChar  # or access via PdfDocument

속성	타입	설명
`char`	`str`	유니코드 문자
`bbox`	`tuple[float, float, float, float]`	경계 상자 `(x0, y0, x1, y1)`
`font_name`	`str`	PostScript 폰트 이름
`font_size`	`float`	폰트 크기(포인트)
`font_weight`	`str`	굵기(`"thin"`, `"light"`, `"normal"`, `"medium"`, `"semi-bold"`, `"bold"`, `"extra-bold"`, `"black"`)
`is_italic`	`bool`	문자가 기울임꼴인지 여부
`color`	`tuple[float, float, float]`	RGB 색상 `(r, g, b)`, 값 0.0–1.0
`rotation_degrees`	`float`	문자 회전 각도
`origin_x`	`float`	텍스트 원점 X 위치
`origin_y`	`float`	텍스트 원점 Y 위치
`advance_width`	`float`	글리프 어드밴스 너비
`mcid`	`int	None`

예제

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
chars = doc.extract_chars(0)

for ch in chars[:5]:
    print(f"'{ch.char}' at bbox={ch.bbox} "
          f"font={ch.font_name} size={ch.font_size:.1f} "
          f"weight={ch.font_weight} italic={ch.is_italic}")

TextSpan

같은 폰트와 스타일을 공유하는 연속 텍스트를 나타냅니다. PdfDocument.extract_spans()가 반환합니다.

속성	타입	설명
`text`	`str`	텍스트 콘텐츠
`bbox`	`tuple[float, float, float, float]`	경계 상자 `(x0, y0, x1, y1)`
`font_name`	`str`	PostScript 폰트 이름
`font_size`	`float`	폰트 크기(포인트)
`is_bold`	`bool`	스팬이 굵게 표시되는지 여부
`is_italic`	`bool`	스팬이 기울임꼴인지 여부
`is_monospace`	`bool`	폰트가 고정 너비(Courier, Consolas 등)인지 여부
`char_widths`	`list[float]`	정확한 경계 상자를 위한 글리프별 어드밴스 너비
`color`	`tuple[float, float, float]`	RGB 색상 `(r, g, b)`, 값 0.0–1.0

예제

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
spans = doc.extract_spans(0)

for span in spans:
    print(f"'{span.text}' font={span.font_name} size={span.font_size:.1f} "
          f"bold={span.is_bold} italic={span.is_italic} color={span.color}")

이미지 추출

extract_images()는 이미지 메타데이터가 포함된 ImageInfo 객체를 반환합니다. 디스크에 저장하기 적합한 원본 이미지 데이터를 얻으려면 extract_image_bytes()를 사용하세요.

extract_image_bytes() 반환 형식

extract_image_bytes()가 반환하는 각 dict는 다음 키를 가집니다.

키	타입	설명
`width`	`int`	이미지 너비(픽셀)
`height`	`int`	이미지 높이(픽셀)
`data`	`bytes`	원본 이미지 데이터
`format`	`str`	이미지 포맷(예: `"png"`, `"jpeg"`)

예제

from pdf_oxide import PdfDocument

doc = PdfDocument("brochure.pdf")
images = doc.extract_image_bytes(0)

for i, img in enumerate(images):
    print(f"Image {i}: {img['width']}x{img['height']}")
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

SearchResult

텍스트 검색 일치 항목을 나타냅니다. search()와 search_page()가 반환합니다.

속성	타입	설명
`page`	`int`	0부터 시작하는 페이지 인덱스
`text`	`str`	일치한 텍스트
`x`	`float`	PDF 포인트 단위의 X 위치
`y`	`float`	PDF 포인트 단위의 Y 위치

FormField

폼 필드를 나타냅니다. PdfDocument.get_form_fields()가 반환합니다.

속성	타입	설명
`name`	`str`	정규화된 필드 이름
`field_type`	`str`	필드 타입: `"text"`, `"button"`, `"choice"`, `"signature"`, 또는 `"unknown"`
`value`	`str	bool
`tooltip`	`str	None`
`bounds`	`tuple[float, float, float, float]	None`
`flags`	`int	None`
`max_length`	`int	None`
`is_readonly`	`bool`	필드가 읽기 전용인지 여부
`is_required`	`bool`	필드가 필수인지 여부

TextWord

단어 단위로 그룹화된 텍스트입니다. PdfDocument.extract_words()와 PdfPageRegion.extract_words()가 반환합니다.

속성	타입	설명
`text`	`str`	단어 텍스트
`bbox`	`tuple[float, float, float, float]`	경계 상자 `(x0, y0, x1, y1)`
`font_name`	`str`	PostScript 폰트 이름
`font_size`	`float`	폰트 크기(포인트)
`is_bold`	`bool`	단어가 굵게 표시되는지 여부
`is_italic`	`bool`	단어가 기울임꼴인지 여부
`chars`	`list[TextChar]`	구성 문자

TextLine

행 단위로 그룹화된 텍스트입니다. PdfDocument.extract_text_lines()와 PdfPageRegion.extract_text_lines()가 반환합니다.

속성	타입	설명
`text`	`str`	행 텍스트
`bbox`	`tuple[float, float, float, float]`	경계 상자 `(x0, y0, x1, y1)`
`words`	`list[TextWord]`	행의 단어
`chars`	`list[TextChar]`	행의 문자

PdfPageRegion

페이지의 클리핑된 영역입니다. PdfDocument.within()과 PdfPage.region()이 반환합니다.

속성	타입	설명
`bbox`	`tuple[float, float, float, float]`	영역의 경계

메서드

region.extract_text() -> str
region.extract_words() -> list[TextWord]
region.extract_text_lines() -> list[TextLine]
region.extract_tables(table_settings: dict | None = None) -> list[dict]
region.extract_images() -> list
region.extract_paths() -> list

영역의 경계 상자로 범위가 지정된 추출 메서드입니다.

LayoutParams

페이지에 대해 계산된 적응형 레이아웃 매개변수입니다. PdfDocument.page_layout_params()가 반환합니다.

속성	타입	설명
`word_gap_threshold`	`float`	포인트 단위의 단어 간 간격 임계값
`line_gap_threshold`	`float`	포인트 단위의 행 간 간격 임계값
`median_char_width`	`float`	문자 너비 중앙값
`median_font_size`	`float`	폰트 크기 중앙값
`median_line_spacing`	`float`	행 간격 중앙값
`column_count`	`int`	감지된 텍스트 단 수

ExtractionProfile

extract_words() / extract_text_lines()에 전달되는 조정 가능한 텍스트 추출 프로파일입니다.

from pdf_oxide import ExtractionProfile

정적 생성자

ExtractionProfile.conservative()
ExtractionProfile.aggressive()
ExtractionProfile.balanced()
ExtractionProfile.academic()
ExtractionProfile.policy()
ExtractionProfile.form()
ExtractionProfile.government()
ExtractionProfile.scanned_ocr()
ExtractionProfile.adaptive()
ExtractionProfile.available() -> list[str]   # names of all built-in profiles

속성

속성	타입	설명
`name`	`str`	프로파일 이름
`tj_offset_threshold`	`float`	TJ 배열 오프셋 단어 분리 임계값
`word_margin_ratio`	`float`	단어 여백 비율
`space_threshold_em_ratio`	`float`	공백 너비 임계값(em 비율)
`space_char_multiplier`	`float`	공백 문자 배수
`use_adaptive_threshold`	`bool`	적응형 임계값 활성화 여부

OfficeConverter

Office 문서(DOCX, XLSX, PPTX)를 PDF로 변환합니다. Rust 빌드에 office 기능이 필요합니다.

from pdf_oxide import OfficeConverter

OfficeConverter()   # instances are stateless; the conversion methods are also usable as static methods

메서드

OfficeConverter.from_docx(path: str) -> Pdf

Word 문서를 Pdf 객체로 변환합니다.

OfficeConverter.from_docx_bytes(data: bytes) -> Pdf

Word 문서 바이트를 Pdf 객체로 변환합니다.

OfficeConverter.from_xlsx(path: str) -> Pdf

Excel 스프레드시트를 Pdf 객체로 변환합니다.

OfficeConverter.from_xlsx_bytes(data: bytes) -> Pdf

Excel 스프레드시트 바이트를 Pdf 객체로 변환합니다.

OfficeConverter.from_pptx(path: str) -> Pdf

PowerPoint 프레젠테이션을 Pdf 객체로 변환합니다.

OfficeConverter.from_pptx_bytes(data: bytes) -> Pdf

PowerPoint 프레젠테이션 바이트를 Pdf 객체로 변환합니다.

OfficeConverter.convert(path: str) -> Pdf

형식을 자동으로 감지하여 지원되는 모든 Office 문서를 Pdf 객체로 변환합니다.

예제

from pdf_oxide import OfficeConverter

pdf = OfficeConverter.from_docx("report.docx")
pdf.save("report.pdf")

# Or use convert() for auto-detection
pdf = OfficeConverter.convert("spreadsheet.xlsx")
pdf.save("spreadsheet.pdf")

그래픽 클래스

다음 클래스들은 그래픽이 포함된 고급 PDF 생성에 사용할 수 있습니다.

Color

from pdf_oxide import Color

Color(r: float, g: float, b: float)  # RGB, values 0.0-1.0
Color.from_hex("#ff0000")
Color.black()
Color.white()
Color.red()
Color.green()
Color.blue()

BlendMode

from pdf_oxide import BlendMode

BlendMode.NORMAL()
BlendMode.MULTIPLY()
BlendMode.SCREEN()
BlendMode.OVERLAY()
BlendMode.DARKEN()
BlendMode.LIGHTEN()
BlendMode.COLOR_DODGE()
BlendMode.COLOR_BURN()
BlendMode.HARD_LIGHT()
BlendMode.SOFT_LIGHT()
BlendMode.DIFFERENCE()
BlendMode.EXCLUSION()

ExtGState

from pdf_oxide import ExtGState

gs = ExtGState()
gs = gs.fill_alpha(0.5)
gs = gs.stroke_alpha(0.8)
gs = gs.alpha(0.5)  # Set both fill and stroke
gs = gs.blend_mode(BlendMode.MULTIPLY())

gs = ExtGState.semi_transparent()  # Preset

LineCap / LineJoin

from pdf_oxide import LineCap, LineJoin

LineCap.BUTT()       # Default
LineCap.ROUND()
LineCap.SQUARE()

LineJoin.MITER()     # Default
LineJoin.ROUND()
LineJoin.BEVEL()

Gradients

from pdf_oxide import LinearGradient, RadialGradient, Color

# Linear gradient (fluent API)
grad = (LinearGradient()
    .start(0, 0)
    .end(100, 0)
    .add_stop(0.0, Color.red())
    .add_stop(1.0, Color.blue()))

# Convenience constructors
hgrad = LinearGradient.horizontal(200, Color.red(), Color.blue())
vgrad = LinearGradient.vertical(100, Color(1, 1, 0), Color(0, 0, 1))

# Radial gradient
rgrad = RadialGradient.centered(50, 50, 50)
rgrad = rgrad.add_stop(0.0, Color(1, 1, 0))
rgrad = rgrad.add_stop(1.0, Color(1, 0, 0))

PatternPresets

from pdf_oxide import PatternPresets, Color

PatternPresets.horizontal_stripes(width, height, stripe_height, color)
PatternPresets.vertical_stripes(width, height, stripe_width, color)
PatternPresets.checkerboard(size, color1, color2)
PatternPresets.dots(spacing, radius, color)
PatternPresets.diagonal_lines(size, line_width, color)
PatternPresets.crosshatch(size, line_width, color)

OCR 클래스

Rust 빌드에 ocr 기능이 필요합니다.

OcrEngine

from pdf_oxide import OcrEngine, OcrConfig

engine = OcrEngine(
    det_model_path: str,
    rec_model_path: str,
    dict_path: str,
    config: OcrConfig | None = None
)

OcrConfig

from pdf_oxide import OcrConfig

config = OcrConfig(
    det_threshold: float | None = None,
    box_threshold: float | None = None,
    rec_threshold: float | None = None,
    num_threads: int | None = None,
    max_candidates: int | None = None,
    use_v5: bool = False
)

DocumentBuilder

페이지를 하나씩 구성하기 위한 플루언트 빌더입니다. 아래 예제와 처음부터 생성을 참고하세요.

from pdf_oxide import DocumentBuilder

문서 수준 메서드

메서드	매개변수	설명
`DocumentBuilder()`	–	새 빌더 생성
`title(title)`	`str`	문서 제목 설정
`author(author)`	`str`	문서 작성자 설정
`subject(subject)`	`str`	문서 주제 설정
`keywords(keywords)`	`str`	문서 키워드 설정
`creator(creator)`	`str`	생성 애플리케이션 이름 설정
`on_open(script)`	`str`	문서 수준 열기 JavaScript 액션 설정
`tagged_pdf_ua1()`	–	Tagged PDF/UA-1 접근 가능 문서 생성
`language(lang)`	`str`	문서 언어 설정(예: `"en-US"`)
`role_map(custom, standard)`	`str, str`	사용자 지정 구조 태그를 표준 태그에 매핑
`register_embedded_font(name, font)`	`str, EmbeddedFont`	폰트 등록(`EmbeddedFont`를 소비)

페이지 팩토리

builder.a4_page() -> FluentPageBuilder       # 595 x 842 pt
builder.letter_page() -> FluentPageBuilder   # 612 x 792 pt
builder.page(width: float, height: float) -> FluentPageBuilder

출력

builder.build() -> bytes
builder.save(path: str) -> None
builder.save_encrypted(path: str, user_password: str, owner_password: str) -> None
builder.to_bytes_encrypted(user_password: str, owner_password: str) -> bytes

FluentPageBuilder

done()이 호출될 때까지 페이지 수준 작업을 버퍼링합니다. DocumentBuilder.a4_page() / letter_page() / page()가 반환합니다. 모든 메서드는 체이닝을 위해 self를 반환하며, done()은 페이지를 커밋하고 부모 DocumentBuilder를 반환합니다.

텍스트 및 레이아웃

메서드	매개변수	설명
`font(name, size)`	`str, float`	현재 폰트와 크기 설정
`at(x, y)`	`float, float`	커서를 절대 위치로 이동
`text(text)`	`str`	커서 위치에 텍스트 그리기
`heading(level, text)`	`int, str`	제목 그리기(레벨 1–6)
`paragraph(text)`	`str`	자동 줄바꿈되는 단락 그리기
`space(points)`	`float`	세로 공간 전진
`horizontal_rule()`	–	수평 구분선 그리기
`columns(column_count, gap_pt, text)`	`int, float, str`	균형 잡힌 다단 텍스트 흐름
`footnote(ref_mark, note_text)`	`str, str`	인라인 참조 표시 + 페이지 하단 각주
`new_page_same_size()`	–	동일한 크기로 새 페이지 시작
`measure(text) -> float`	`str`	렌더링된 텍스트 너비를 포인트로 측정
`remaining_space() -> float`	–	페이지에 남은 세로 공간

인라인 런

page.inline(text: str)
page.inline_bold(text: str)
page.inline_italic(text: str)
page.inline_color(text: str, r: float, g: float, b: float)
page.newline()

링크 및 액션

page.link_url(url: str)
page.link_page(page: int)
page.link_named(name: str)
page.link_javascript(script: str)
page.on_open(script: str)
page.on_close(script: str)
page.field_keystroke(script: str)
page.field_format(script: str)
page.field_validate(script: str)
page.field_calculate(script: str)

마크업 주석

page.highlight(color: tuple[float, float, float])
page.underline(color: tuple[float, float, float])
page.strikeout(color: tuple[float, float, float])
page.squiggly(color: tuple[float, float, float])
page.sticky_note(text: str)
page.sticky_note_at(x: float, y: float, text: str)
page.watermark(text: str)
page.watermark_confidential()
page.watermark_draft()
page.stamp(name: str)
page.freetext(x: float, y: float, w: float, h: float, text: str)

AcroForm 위젯

page.text_field(name: str, x: float, y: float, w: float, h: float, default_value: str | None = None)
page.checkbox(name: str, x: float, y: float, w: float, h: float, checked: bool = False)
page.combo_box(name: str, x: float, y: float, w: float, h: float, options: list[str], selected: str | None = None)
page.radio_group(name: str, buttons: list[tuple[str, float, float, float, float]], selected: str | None = None)
page.push_button(name: str, x: float, y: float, w: float, h: float, caption: str)
page.signature_field(name: str, x: float, y: float, w: float, h: float)

그래픽

page.rect(x: float, y: float, w: float, h: float)
page.filled_rect(x: float, y: float, w: float, h: float, r: float, g: float, b: float)
page.line(x1: float, y1: float, x2: float, y2: float)
page.text_in_rect(x: float, y: float, w: float, h: float, text: str, align: int | None = None)
page.stroke_rect(x, y, w, h, width=1.0, color=(0.0, 0.0, 0.0))
page.stroke_rect_dashed(x, y, w, h, dash, width=1.0, color=(0.0, 0.0, 0.0), phase=0.0)
page.stroke_line(x1, y1, x2, y2, width=1.0, color=(0.0, 0.0, 0.0))
page.stroke_line_dashed(x1, y1, x2, y2, dash, width=1.0, color=(0.0, 0.0, 0.0), phase=0.0)

이미지 및 바코드

page.image_with_alt(bytes: bytes, x: float, y: float, w: float, h: float, alt_text: str)
page.image_artifact(bytes: bytes, x: float, y: float, w: float, h: float)
page.barcode_1d(barcode_type: int, data: str, x: float, y: float, w: float, h: float)
page.barcode_qr(data: str, x: float, y: float, size: float)

barcode_type: 0=Code128, 1=Code39, 2=EAN13, 3=EAN8, 4=UPCA, 5=ITF, 6=Code93, 7=Codabar.

표

page.table(table: Table)
page.streaming_table(
    columns: list[Column],
    repeat_header: bool = False,
    mode: str = "fixed",
    sample_rows: int = 50,
    min_col_width_pt: float = 20.0,
    max_col_width_pt: float = 400.0,
    max_rowspan: int = 1,
    batch_size: int = 256
) -> StreamingTable

커밋

page.done() -> DocumentBuilder

EmbeddedFont

DocumentBuilder에 등록된 TTF/OTF 폰트입니다.

from pdf_oxide import EmbeddedFont

EmbeddedFont.from_file(path: str) -> EmbeddedFont
EmbeddedFont.from_bytes(data: bytes, name: str | None = None) -> EmbeddedFont

속성	타입	설명
`name`	`str`	폰트의 등록 이름

Tables

플루언트 표 API를 위한 값 객체입니다.

Align

from pdf_oxide import Align

Align.LEFT     # 0
Align.CENTER   # 1
Align.RIGHT    # 2

Column

from pdf_oxide import Column

Column(header: str, width: float = 100.0, align: Align | int | None = None)

속성	타입	설명
`header`	`str`	열 머리글 텍스트
`width`	`float`	포인트 단위의 열 너비
`align`	`int`	셀 정렬

Table

from pdf_oxide import Table

Table(columns: list[Column], rows: list[list[str]], has_header: bool = False)

FluentPageBuilder.table()에 의해 소비되는 버퍼링된 표입니다. has_header=True인 경우 열 머리글이 스타일이 적용된 머리글 행으로 렌더링됩니다.

StreamingTable

FluentPageBuilder.streaming_table()이 반환하는 행 스트리밍 표 핸들로, 한 번에 구체화하기에는 너무 큰 표를 위한 것입니다.

메서드	매개변수	설명
`push_row(cells)`	`list[str]`	셀 문자열로 이루어진 행 추가
`push_row_span(cells)`	`list[tuple[str, int]]`	`(text, colspan)` 셀로 이루어진 행 추가
`flush()`	–	현재 배치 플러시
`finish()`	–	표를 완료하고 `FluentPageBuilder` 반환
`column_count()`	– → `int`	열 수
`pending_row_count()`	– → `int`	버퍼링되었지만 아직 커밋되지 않은 행
`batch_count()`	– → `int`	완료된 배치 수

페이지 템플릿

여러 페이지에 적용되는 반복 머리글/바닥글 아티팩트입니다.

Artifact / ArtifactStyle

from pdf_oxide import Artifact, ArtifactStyle

Artifact()                       # empty artifact
Artifact.center(text: str)       # centered artifact text
artifact.with_left(text: str)    # add left-aligned text

style = ArtifactStyle()
style = style.font(name: str, size: float)
style = style.bold()

Header / Footer

from pdf_oxide import Header, Footer

Header()                  # or Header.center(text: str)
Footer()                  # or Footer.center(text: str)

PageTemplate

from pdf_oxide import PageTemplate, Header, Footer

template = (PageTemplate()
    .header(Header.center("Confidential"))
    .footer(Footer.center("Page")))

디지털 서명

PDF에 서명하고, 타임스탬프를 찍고, 검증합니다(PAdES / LTV). Rust 빌드에 signatures 기능(그리고 선택적으로 tsa-client 기능)이 필요합니다.

Certificate

from pdf_oxide import Certificate

Certificate.load(data: bytes) -> Certificate                       # DER certificate (verify only)
Certificate.load_pem(cert_pem: str, key_pem: str) -> Certificate   # signing credential
Certificate.load_pkcs12(data: bytes, password: str) -> Certificate # PKCS#12 / .p12 signing credential

메서드	반환값	설명
`subject()`	`str`	인증서 주체 DN
`issuer()`	`str`	인증서 발급자 DN
`serial()`	`str`	일련번호
`validity()`	`tuple[int, int]`	`(not_before, not_after)` Unix 타임스탬프
`is_valid()`	`bool`	인증서가 현재 유효 기간 내에 있는지 여부

Signature

PdfDocument.signatures()가 반환합니다.

속성 / 메서드	타입	설명
`signer_name`	`str	None`
`reason`	`str	None`
`location`	`str	None`
`contact_info`	`str	None`
`signing_time`	`int	None`
`covers_whole_document`	`bool`	서명이 파일 전체를 포함하는지 여부
`pades_level`	`PadesLevel`	감지된 PAdES 베이스라인(B-B/B-T/B-LT)
`verify()`	`bool`	서명을 암호학적으로 검증
`verify_detached(pdf_data)`	`bool`	파일 바이트에 대해 `messageDigest`를 포함하여 검증

Timestamp

from pdf_oxide import Timestamp

Timestamp.parse(data: bytes) -> Timestamp

속성 / 메서드	타입	설명
`time`	`int`	타임스탬프 시각(Unix)
`serial`	`str`	TSA 응답 일련번호
`policy_oid`	`str`	TSA 정책 OID
`tsa_name`	`str`	TSA 이름
`hash_algorithm`	`int`	메시지 임프린트 해시 알고리즘 코드
`message_imprint`	`bytes`	해시된 메시지 임프린트
`verify()`	`bool`	타임스탬프 토큰 검증

TsaClient

from pdf_oxide import TsaClient

client = TsaClient(
    url: str,
    username: str | None = None,
    password: str | None = None,
    timeout_seconds: int = 30,
    hash_algorithm: int = 2,
    use_nonce: bool = True,
    cert_req: bool = True
)
client.request_timestamp(data: bytes) -> Timestamp
client.request_timestamp_hash(digest: bytes, algorithm: int = 2) -> Timestamp

PadesLevel

from pdf_oxide import PadesLevel

PadesLevel.B_B     # baseline
PadesLevel.B_T     # + trusted timestamp
PadesLevel.B_LT    # + long-term validation material
PadesLevel.B_LTA   # + archival timestamp

RevocationMaterial

from pdf_oxide import RevocationMaterial

RevocationMaterial(
    certs: list[bytes] | None = None,
    crls: list[bytes] | None = None,
    ocsps: list[bytes] | None = None
)

B-LT 서명을 위한 DER 인코딩 인증서, CRL, OCSP 응답입니다.

Dss

PdfDocument.dss()가 반환하는 파싱된 Document Security Store입니다.

속성	타입	설명
`certs`	`list[bytes]`	문서 수준 인증서 DER 블롭
`crls`	`list[bytes]`	CRL DER 블롭
`ocsps`	`list[bytes]`	OCSP 응답 DER 블롭
`vri`	`list[str]`	서명별 VRI 키(`/Contents`의 16진수 SHA-1)

모듈 수준 함수

from pdf_oxide import (
    sign_pdf_bytes, sign_pdf_bytes_pades, has_document_timestamp,
    generate_barcode_svg, generate_qr_svg,
    plan_split_by_bookmarks, split_by_bookmarks,
)

서명

sign_pdf_bytes(pdf_data: bytes, cert: Certificate, reason: str | None = None, location: str | None = None) -> bytes

불러온 서명용 Certificate로 원본 PDF 바이트에 서명하고 서명된 PDF를 반환합니다.

sign_pdf_bytes_pades(
    pdf_data: bytes,
    cert: Certificate,
    level: PadesLevel,
    tsa_url: str | None = None,
    reason: str | None = None,
    location: str | None = None,
    revocation: RevocationMaterial | None = None
) -> bytes

PAdES 베이스라인 수준으로 원본 PDF 바이트에 서명합니다. B_T/B_LT에는 tsa_url이 필요합니다.

has_document_timestamp(pdf_data: bytes) -> bool

PDF에 문서 수준 RFC 3161 아카이브 타임스탬프(PAdES-B-LTA)가 포함되어 있는지 여부를 반환합니다.

바코드

generate_barcode_svg(barcode_type: int, data: str) -> str
generate_qr_svg(data: str, error_correction: int, size: int) -> str

1D 바코드 또는 QR 코드를 SVG 문자열로 생성합니다. barcodes 기능이 필요합니다.

책갈피 기준 분할

plan_split_by_bookmarks(src_bytes: bytes, title_prefix: str | None = None, ignore_case: bool = False, level: int = 1, include_front_matter: bool = True) -> list[dict]
split_by_bookmarks(src_bytes: bytes, title_prefix: str | None = None, ignore_case: bool = False, level: int = 1, include_front_matter: bool = True) -> list[tuple[dict, bytes]]

책갈피 경계에서 PDF 분할을 계획하거나 수행합니다. plan_*은 세그먼트 메타데이터만 반환하고, split_*은 각 세그먼트를 해당 PDF 바이트와 함께 반환합니다.

OCR 모델 프로비저닝

prefetch_models(languages: list[str]) -> str
model_manifest() -> str
prefetch_available() -> bool

오프라인/에어갭 사용을 위해 OCR 모델을 프로비저닝하고, 모델 매니페스트(JSON)를 확인하며, 이 빌드가 모델을 다운로드할 수 있는지 확인합니다.

로깅

setup_logging() -> None
set_log_level(level: str) -> None     # "off" | "error" | "warn" | "info" | "debug" | "trace"
get_log_level() -> str
disable_logging() -> None

엔진 튜닝

set_max_ops_per_stream(limit: int | None) -> int | None
set_preserve_unmapped_glyphs(preserve: bool) -> bool

스트림당 연산자 상한(악의적 입력 보호)과 매핑되지 않은 글리프에 대한 U+FFFD 보존을 조정합니다. 둘 다 이전 값을 반환합니다.

암호화 거버넌스

crypto_active_provider() -> str
crypto_available_providers() -> list[str]
crypto_use_fips() -> None                 # install the FIPS aws-lc-rs provider (requires the fips feature)
crypto_set_policy(spec: str) -> None      # e.g. "strict" or "compat;deny:rc4@write"
crypto_policy() -> str
crypto_inventory() -> list[str]
crypto_cbom() -> str                      # CycloneDX 1.6 CBOM (JSON)

비동기 API

블로킹 작업을 스레드 풀에서 실행하는 async/await 래퍼입니다. 메서드는 동기 버전과 동일하게 대응됩니다.

from pdf_oxide import AsyncPdfDocument, AsyncPdf, AsyncOfficeConverter

async def main():
    doc = await AsyncPdfDocument.open("input.pdf")
    text = await doc.extract_text(0)
    await doc.close()
    # Or use as an async context manager:
    async with await AsyncPdfDocument.from_bytes(pdf_bytes) as doc:
        md = await doc.to_markdown_all()

클래스	생성자	비고
`AsyncPdfDocument`	`await AsyncPdfDocument.open(path, password=None)`, `await AsyncPdfDocument.from_bytes(data, password=None)`	모든 `PdfDocument` 메서드를 awaitable로 사용할 수 있으며, `async with`와 `.close()`를 지원
`AsyncPdf`	`Pdf` 팩토리 메서드를 래핑	`await pdf.save(path)`, `await pdf.to_bytes()`
`AsyncOfficeConverter`	`OfficeConverter` 정적 메서드를 래핑	예: `await AsyncOfficeConverter.from_docx(path)`

오류 처리

PdfError

모든 PDF 관련 오류는 PdfError를 발생시킵니다.

from pdf_oxide import PdfDocument, PdfError

try:
    doc = PdfDocument("file.pdf")
    text = doc.extract_text(0)
except PdfError as e:
    print(f"PDF error: {e}")
except FileNotFoundError:
    print("File not found")
except IndexError:
    print("Page index out of range")

일반적인 오류 시나리오:

예외	원인
`PdfError`	형식이 잘못된 PDF, 비밀번호 없이 암호화됨, 파싱 실패
`FileNotFoundError`	파일이 존재하지 않음
`IndexError`	페이지 인덱스가 `page_count()`를 초과
`ValueError`	잘못된 인수(예: 음수 페이지 인덱스)

전체 예제

from pdf_oxide import PdfDocument, Pdf

# --- Extraction ---
doc = PdfDocument("input.pdf")
print(f"Pages: {doc.page_count()}")

for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(f"Page {i + 1}: {len(text)} characters")

# Character-level analysis
chars = doc.extract_chars(0)
fonts = set(ch.font_name for ch in chars)
print(f"Fonts on page 1: {fonts}")

# Image extraction
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
    with open(f"extracted_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

# --- Creation ---
pdf = Pdf.from_markdown("# Report\n\nGenerated by PDF Oxide.",
                        title="Report", author="PDF Oxide")
pdf.save("report.pdf")

# --- Editing ---
doc = PdfDocument("document.pdf")
doc.set_title("Updated Title")
doc.set_author("New Author")
doc.rotate_all_pages(90)

# Search and replace via DOM
page = doc.page(0)
for text in page.find_text_containing("DRAFT"):
    page.set_text(text.id, "FINAL")
doc.save_page(page)

# Form filling
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name} ({f.field_type}) = {f.value}")
doc.set_form_field_value("name", "John Doe")

# Merge another PDF
merged_count = doc.merge_from("appendix.pdf")
print(f"Merged {merged_count} pages")

doc.save("output.pdf")

# --- Search ---
results = doc.search("configuration", case_insensitive=True)
for r in results:
    print(f"Page {r.page + 1}: '{r.text}' at ({r.x:.0f}, {r.y:.0f})")

v0.3.38 추가 사항

`DocumentBuilder` / `FluentPageBuilder` / `EmbeddedFont`

from pdf_oxide import DocumentBuilder, EmbeddedFont, StampType

font = EmbeddedFont.from_file("DejaVuSans.ttf")
# Alt: EmbeddedFont.from_bytes(data: bytes, name: str | None = None)

(DocumentBuilder()
    .register_embedded_font("DejaVu", font)
    .letter_page()           # or .a4_page() / .page(size)
        .at(72, 720).font("DejaVu", 12).text("Hello")
        .heading(1, "Title")
        .paragraph("Body text with automatic wrapping")
        # Annotations (15 methods)
        .link_url("https://example.com")
        .link_page(2)
        .link_named("glossary")
        .highlight((1.0, 1.0, 0.0))
        .underline((0.0, 0.0, 1.0))
        .strikeout((1.0, 0.0, 0.0))
        .squiggly((1.0, 0.5, 0.0))
        .sticky_note("Review this")
        .stamp(StampType.APPROVED)
        .freetext((100, 500, 200, 50), "Comment")
        .watermark("DRAFT")
        .watermark_confidential()
        .watermark_draft()
        # AcroForm widgets (5 types)
        .text_field("name", 150, 400, 200, 20, "Jane Doe")
        .checkbox("agree", 72, 380, 15, 15, True)
        .combo_box("country", 150, 360, 200, 20, ["US", "UK"], "US")
        .radio_group("tier", [("free", 72, 340, 15, 15), ("pro", 120, 340, 15, 15)], "pro")
        .push_button("submit", 72, 300, 80, 25, "Submit")
        # Graphics primitives
        .rect(50, 270, 500, 2)
        .filled_rect(50, 260, 500, 2, (0.9, 0.9, 0.9))
        .line(50, 250, 550, 250)
    .done()
    .save_encrypted("out.pdf", "user-pw", "owner-pw"))
# Alt: .save("out.pdf") / .build() -> bytes
# Alt: .to_bytes_encrypted("user-pw", "owner-pw") -> bytes

HTML + CSS 파이프라인

Pdf.from_html_css(html: str, css: str, font_bytes: bytes) -> Pdf
Pdf.from_html_css_with_fonts(html: str, css: str, fonts: list[tuple[str, bytes]]) -> Pdf

HTML에서 생성을 참고하세요.

서명 검증

from pdf_oxide import PdfDocument, Timestamp, TsaClient

doc = PdfDocument("signed.pdf")
doc.signature_count()                # int
for sig in doc.signatures():
    sig.signer_name                  # str
    sig.reason                       # str | None
    sig.location                     # str | None
    sig.signing_time                 # datetime | None
    sig.verify()                     # "Valid" | "Invalid" | "Unknown"
    sig.verify_detached(pdf_bytes)   # adds messageDigest check

# Timestamp
ts = Timestamp.parse(tst_bytes)
ts.time, ts.serial, ts.policy_oid, ts.tsa_name, ts.hash_algorithm, ts.message_imprint

# TSA client (behind `tsa-client` feature)
client = TsaClient(url="https://freetsa.org/tsr",
                   username=None, password=None,
                   timeout_seconds=30, hash_algorithm=2,
                   use_nonce=True, cert_req=True)
ts = client.request_timestamp(pdf_bytes)
ts = client.request_timestamp_hash(digest, algorithm=2)

자세한 내용은 디지털 서명을 참고하세요.

렌더링

doc.render_page_region(page: int, x: float, y: float, w: float, h: float, format: int = 0) -> bytes
doc.render_page_fit(page: int, fit_width: int, fit_height: int, format: int = 0) -> bytes

format: 0 = PNG, 1 = JPEG. 좌표는 좌측 하단을 기준으로 한 PDF 포인트입니다.

`Pdf` 평탄화

doc.flatten_to_images(dpi: int = 150) -> bytes

Other Language Bindings

PDF Oxide는 모든 주요 생태계를 위한 네이티브 바인딩을 제공합니다: Rust, Node.js, WASM, C#, Golang, Java, PHP, Ruby, C++, Swift, Kotlin, Dart, R, Julia, Zig, Scala, Clojure, Objective-C, Elixir

다음 단계

타입 & 열거형 — 모든 공유 타입과 열거형
Page API 레퍼런스 — 바인딩 간 일관된 페이지 단위 순회
Python 시작하기 — 튜토리얼