What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

JavaScript API 레퍼런스

PDF Oxide는 JavaScript와 TypeScript용 WebAssembly 바인딩을 제공합니다. npm 패키지 pdf-oxide-wasm은 Node.js, 브라우저, 번들러, Deno, Cloudflare Workers에서 모두 동작합니다.

npm install pdf-oxide-wasm

멀티 타깃 패키징 (v0.3.38)

pdf-oxide-wasm은 이제 package.json 조건부 exports로 세 가지 빌드를 나란히 제공합니다. 런타임에 맞는 서브패스를 선택하세요. 대부분의 환경은 최상위 import가 exports 필드를 통해 자동으로 올바른 빌드로 연결됩니다.

서브패스	대상
`pdf-oxide-wasm/nodejs`	Node.js (CommonJS + ESM)
`pdf-oxide-wasm/bundler`	Vite, webpack, Rollup, esbuild, Bun
`pdf-oxide-wasm/web`	브라우저, Deno, Cloudflare Workers

// Node.js
import { WasmPdfDocument } from "pdf-oxide-wasm/nodejs";

// Vite / webpack / Rollup
import init, { WasmPdfDocument } from "pdf-oxide-wasm/bundler";
await init();

// 브라우저 / Deno / Workers
import init, { WasmPdfDocument } from "pdf-oxide-wasm/web";
await init();

v0.3.38 이전에 브라우저 번들러에서 발생하던 ReferenceError: Can't find variable: __dirname 오류도 이 변경으로 해결됩니다.

Rust API는 Rust API 레퍼런스, Python API는 Python API 레퍼런스, 타입 정의는 타입 & 열거형을 참고하세요.

WasmPdfDocument

PDF 열기, 추출, 편집 및 저장을 위한 기본 클래스입니다.

import { WasmPdfDocument } from "pdf-oxide-wasm";

생성자

`new WasmPdfDocument(data)`

원시 바이트에서 PDF 문서를 로드합니다.

Parameter	타입	설명
`data`	`Uint8Array`	PDF 파일 내용

예외: PDF가 유효하지 않거나 파싱할 수 없는 경우 Error를 throw합니다.

const bytes = new Uint8Array(readFileSync("document.pdf"));
const doc = new WasmPdfDocument(bytes);

핵심 읽기 전용

`pageCount() -> number`

문서의 페이지 수를 가져옵니다.

`version() -> Uint8Array`

[major, minor] 형식으로 PDF 버전을 가져옵니다.

const [major, minor] = doc.version();
console.log(`PDF ${major}.${minor}`);

`authenticate(password) -> boolean`

암호화된 PDF를 복호화합니다. 인증에 성공하면 true를 반환합니다.

Parameter	타입	설명
`password`	`string`	비밀번호 문자열

`hasStructureTree() -> boolean`

문서가 구조 트리를 가진 태그된 PDF인지 확인합니다.

텍스트 추출

`extractText(pageIndex) -> string`

단일 페이지에서 일반 텍스트를 추출합니다.

Parameter	타입	설명
`pageIndex`	`number`	0부터 시작하는 페이지 번호

const text = doc.extractText(0);

`extractAllText() -> string`

폼 피드 문자로 구분된 모든 페이지에서 일반 텍스트를 추출합니다.

`extractChars(pageIndex) -> Array`

정밀한 위치 지정 및 폰트 메타데이터를 사용하여 개별 문자를 추출합니다.

Parameter	타입	설명
`pageIndex`	`number`	0부터 시작하는 페이지 번호

반환값: 다음 필드를 가진 객체의 배열:

Field	타입	설명
`char`	`string`	문자
`bbox`	`{x, y, width, height}`	바운딩 박스
`font_name`	`string`	글꼴 이름
`font_size`	`number`	글꼴 크기(포인트)
`font_weight`	`string`	글꼴 두께 (Normal, Bold 등)
`is_italic`	`boolean`	기울임꼴 플래그
`color`	`{r, g, b}`	RGB color (0.0–1.0)

const chars = doc.extractChars(0);
for (const c of chars) {
  console.log(`'${c.char}' at (${c.bbox.x}, ${c.bbox.y})`);
}

`extractSpans(pageIndex) -> Array`

폰트 메타데이터를 포함하는 스타일 텍스트 스팬을 추출합니다.

Parameter	타입	설명
`pageIndex`	`number`	0부터 시작하는 페이지 번호

반환값: 다음 필드를 가진 객체의 배열:

Field	타입	설명
`text`	`string`	텍스트 콘텐츠
`bbox`	`{x, y, width, height}`	바운딩 박스
`font_name`	`string`	글꼴 이름
`font_size`	`number`	글꼴 크기(포인트)
`font_weight`	`string`	글꼴 두께 (Normal, Bold 등)
`is_italic`	`boolean`	기울임꼴 플래그
`color`	`{r, g, b}`	RGB color (0.0–1.0)

const result = doc.extractPageText(0);
console.log(`Page: ${result.pageWidth}x${result.pageHeight} pt`);
for (const span of result.spans) {
  console.log(`'${span.text}' font=${span.fontName} size=${span.fontSize}`);
}

const spans = doc.extractSpans(0);
for (const span of spans) {
  console.log(`"${span.text}" size=${span.fontSize}`);
}

형식 변환

`toMarkdown(pageIndex, detectHeadings?, includeImages?, includeFormFields?) -> string`

단일 페이지를 Markdown으로 변환합니다.

Parameter	타입	기본값	설명
`pageIndex`	`number`	–	0부터 시작하는 페이지 번호
`detectHeadings`	`boolean`	`true`	폰트 크기에서 제목 감지
`includeImages`	`boolean`	`true`	이미지 포함
`includeFormFields`	`boolean`	`true`	양식 필드 값 포함

`toMarkdownAll(detectHeadings?, includeImages?, includeFormFields?) -> string`

모든 페이지를 Markdown으로 변환합니다.

`toHtml(pageIndex, preserveLayout?, detectHeadings?, includeFormFields?) -> string`

단일 페이지를 HTML로 변환합니다.

Parameter	타입	기본값	설명
`pageIndex`	`number`	–	0부터 시작하는 페이지 번호
`preserveLayout`	`boolean`	`false`	시각적 레이아웃 보존
`detectHeadings`	`boolean`	`true`	제목 감지
`includeFormFields`	`boolean`	`true`	양식 필드 값 포함

`toHtmlAll(preserveLayout?, detectHeadings?, includeFormFields?) -> string`

모든 페이지를 HTML로 변환합니다.

`toPlainText(pageIndex) -> string`

단일 페이지를 일반 텍스트로 변환합니다.

`toPlainTextAll() -> string`

모든 페이지를 일반 텍스트로 변환합니다.

검색

`search(pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array`

모든 페이지에서 텍스트를 검색합니다.

Parameter	타입	기본값	설명
`pattern`	`string`	–	검색 패턴 (문자열 또는 정규식)
`caseInsensitive`	`boolean`	`false`	대소문자 무시 검색
`literal`	`boolean`	`false`	패턴을 리터럴 문자열로 처리
`wholeWord`	`boolean`	`false`	전체 단어만 매칭
`maxResults`	`number`	–	반환할 최대 결과 수

반환값: 다음 필드를 가진 객체의 배열:

Field	타입	설명
`page`	`number`	페이지 번호
`text`	`string`	일치된 텍스트
`bbox`	`object`	바운딩 박스
`start_index`	`number`	페이지 텍스트의 시작 인덱스
`end_index`	`number`	페이지 텍스트의 끝 인덱스

`searchPage(pageIndex, pattern, caseInsensitive?, literal?, wholeWord?, maxResults?) -> Array`

단일 페이지 내에서 텍스트를 검색합니다.

이미지 정보

`extractImages(pageIndex) -> Array`

페이지의 이미지 메타데이터를 가져옵니다.

Field	타입	설명
`width`	`number`	이미지 너비(픽셀)
`height`	`number`	이미지 높이(픽셀)
`color_space`	`string`	색 공간 (예: `DeviceRGB`)
`bits_per_component`	`number`	색상 채널당 비트 수
`bbox`	`object`	페이지 위치

`extractImageBytes(pageIndex) -> Array`

페이지에서 원시 이미지 바이트를 추출합니다. 다음 객체의 배열을 반환합니다:

Field	타입	설명
`width`	`number`	이미지 너비(픽셀)
`height`	`number`	이미지 높이(픽셀)
`data`	`Uint8Array`	원시 이미지 바이트
`format`	`string`	이미지 형식

`pageImages(pageIndex) -> Array`

위치 지정 작업을 위한 이미지 이름과 경계를 가져옵니다.

Field	타입	설명
`name`	`string`	XObject 이름
`bounds`	`number[]`	경계 `[x, y, width, height]`
`matrix`	`number[]`	변환 행렬 `[a, b, c, d, e, f]`

문서 구조

`getOutline() -> Array | null`

문서 북마크 / 목차를 가져옵니다. 반환값 null if no outline exists.

`getAnnotations(pageIndex) -> Array`

페이지의 주석 메타데이터(타입, 사각형, 내용 등)를 가져옵니다.

`extractPaths(pageIndex) -> Array`

페이지에서 벡터 경로(선, 곡선, 도형)를 가져옵니다.

`pageLabels() -> Array`

페이지 라벨 범위를 가져옵니다. 다음 객체의 배열을 반환합니다:

Field	타입	설명
`start_page`	`number`	이 범위의 첫 번째 페이지
`style`	`string`	번호 매기기 스타일
`prefix`	`string`	라벨 접두사
`start_value`	`number`	시작 번호

`xmpMetadata() -> object | null`

XMP 메타데이터를 가져옵니다. 존재하지 않으면 null을 반환합니다. 객체 필드:

Field	타입	설명
`dc_title`	`string \| null`	문서 제목
`dc_creator`	`string[] \| null`	작성자 목록
`dc_description`	`string \| null`	설명
`xmp_creator_tool`	`string \| null`	생성 도구
`xmp_create_date`	`string \| null`	생성 날짜
`xmp_modify_date`	`string \| null`	수정 날짜
`pdf_producer`	`string \| null`	PDF 생성기

양식 필드

`getFormFields() -> Array`

이름, 타입, 값, 플래그를 포함하는 모든 양식 필드를 가져옵니다.

Field	타입	설명
`name`	`string`	필드 이름
`field_type`	`string`	필드 타입 (text, checkbox, etc.)
`value`	`string`	현재 값
`flags`	`number`	필드 플래그

const fields = doc.getFormFields();
for (const f of fields) {
  console.log(`${f.name} (${f.field_type}) = ${f.value}`);
}

`hasXfa() -> boolean`

문서에 XFA 양식이 포함되어 있는지 확인합니다.

`getFormFieldValue(name) -> any`

이름으로 양식 필드 값을 가져옵니다. 필드 타입에 따라 string, boolean 또는 null을 반환합니다.

Parameter	타입	설명
`name`	`string`	필드 이름

`setFormFieldValue(name, value) -> void`

이름으로 양식 필드 값을 설정합니다.

Parameter	타입	설명
`name`	`string`	필드 이름
`value`	`string \| boolean`	새 필드 값

`exportFormData(format?) -> Uint8Array`

양식 데이터를 FDF(기본값) 또는 XFDF로 내보냅니다.

Parameter	타입	기본값	설명
`format`	`string`	`"fdf"`	내보내기 형식: `"fdf"` or `"xfdf"`

편집

메타데이터

메서드	매개변수	설명
`setTitle(title)`	`string`	문서 제목 설정
`setAuthor(author)`	`string`	문서 저자 설정
`setSubject(subject)`	`string`	문서 주제 설정
`setKeywords(keywords)`	`string`	문서 키워드 설정

페이지 회전

메서드	매개변수	설명
`pageRotation(pageIndex)`	`number`	현재 회전 값 가져오기 (0, 90, 180, 270)
`setPageRotation(pageIndex, degrees)`	`number, number`	절대 회전 설정
`rotatePage(pageIndex, degrees)`	`number, number`	현재 회전에 추가
`rotateAllPages(degrees)`	`number`	모든 페이지 회전

페이지 크기

메서드	매개변수	설명
`pageMediaBox(pageIndex)`	`number`	MediaBox 가져오기 `[llx, lly, urx, ury]`
`setPageMediaBox(pageIndex, llx, lly, urx, ury)`	`number, ...`	MediaBox 설정
`pageCropBox(pageIndex)`	`number`	CropBox 가져오기 (null일 수 있음)
`setPageCropBox(pageIndex, llx, lly, urx, ury)`	`number, ...`	CropBox 설정
`cropMargins(left, right, top, bottom)`	`number, ...`	모든 페이지 여백 자르기

삭제 / 화이트아웃

메서드	매개변수	설명
`eraseRegion(pageIndex, llx, lly, urx, ury)`	`number, ...`	영역 삭제
`eraseRegions(pageIndex, rects)`	`number, Float32Array`	여러 영역 삭제
`clearEraseRegions(pageIndex)`	`number`	보류 중인 삭제 해제

주석 및 교정

메서드	매개변수	설명
`flattenPageAnnotations(pageIndex)`	`number`	페이지의 주석 플래튼
`flattenAllAnnotations()`	–	모든 주석 플래튼
`applyPageRedactions(pageIndex)`	`number`	페이지의 교정 적용
`applyAllRedactions()`	–	모든 교정 적용

양식 플래튼

메서드	매개변수	설명
`flattenForms()`	–	모든 양식 필드를 페이지 콘텐츠로 플래튼
`flattenFormsOnPage(pageIndex)`	`number`	특정 페이지의 양식 플래튼

병합 및 임베드

`mergeFrom(data) -> number`

다른 PDF에서 페이지를 병합합니다. 병합된 페이지 수를 반환합니다.

Parameter	타입	설명
`data`	`Uint8Array`	소스 PDF 파일 바이트

`embedFile(name, data) -> void`

PDF에 파일을 첨부합니다.

Parameter	타입	설명
`name`	`string`	첨부 파일 이름
`data`	`Uint8Array`	파일 내용

이미지 조작

메서드	매개변수	설명
`repositionImage(pageIndex, name, x, y)`	`number, string, number, number`	이미지 이동
`resizeImage(pageIndex, name, w, h)`	`number, string, number, number`	이미지 크기 변경
`setImageBounds(pageIndex, name, x, y, w, h)`	`number, string, ...`	이미지 경계 설정

렌더링

메서드	매개변수	반환값	설명
`renderPage(pageIndex, dpi?)`	`number, number`	`Uint8Array`	페이지를 PNG 바이트로 렌더링
`flattenToImages(dpi?)`	`number`	`Uint8Array`	모든 페이지를 이미지 기반 PDF로 평면화

저장

`saveToBytes() -> Uint8Array`

편집된 PDF를 바이트로 저장합니다.

`saveEncryptedToBytes(password, ownerPassword?, allowPrint?, allowCopy?, allowModify?, allowAnnotate?) -> Uint8Array`

AES-256 암호화로 저장합니다.

Parameter	타입	기본값	설명
`password`	`string`	–	사용자 비밀번호
`ownerPassword`	`string`	user password	소유자 비밀번호
`allowPrint`	`boolean`	`true`	인쇄 허용
`allowCopy`	`boolean`	`true`	복사 허용
`allowModify`	`boolean`	`false`	수정 허용
`allowAnnotate`	`boolean`	`true`	주석 허용

`free()`

WASM 메모리를 해제합니다. 문서 작업이 끝나면 항상 호출하세요.

WasmPdf

새 PDF를 생성하기 위한 팩토리 클래스입니다.

import { WasmPdf } from "pdf-oxide-wasm";

정적 메서드

`WasmPdf.fromMarkdown(content, title?, author?) -> WasmPdf`

Markdown 텍스트에서 PDF를 생성합니다.

Parameter	타입	기본값	설명
`content`	`string`	–	Markdown 콘텐츠
`title`	`string`	–	문서 제목
`author`	`string`	–	문서 저자

`WasmPdf.fromHtml(content, title?, author?) -> WasmPdf`

HTML에서 PDF를 생성합니다.

`WasmPdf.fromText(content, title?, author?) -> WasmPdf`

일반 텍스트에서 PDF를 생성합니다.

`WasmPdf.fromImageBytes(data) -> WasmPdf`

이미지 바이트에서 단일 페이지 PDF를 생성합니다.

Parameter	타입	설명
`data`	`Uint8Array`	이미지 파일 바이트 (JPEG, PNG)

`WasmPdf.fromMultipleImageBytes(imagesArray) -> WasmPdf`

여러 이미지에서 이미지당 한 페이지씩 다중 페이지 PDF를 생성합니다.

Parameter	타입	설명
`imagesArray`	`Uint8Array[]`	이미지 파일 바이트 배열

인스턴스 메서드

`toBytes() -> Uint8Array`

PDF를 바이트로 가져옵니다.

`size -> number`

PDF 크기(바이트, 읽기 전용 속성).

const pdf = WasmPdf.fromMarkdown("# Hello World\n\nThis is a PDF.");
console.log(`PDF size: ${pdf.size} bytes`);
writeFileSync("output.pdf", pdf.toBytes());

기능 지원 현황

일부 기능은 네이티브 종속성이 필요하며 WebAssembly에서는 사용할 수 없습니다:

Feature	WASM	비고
Text extraction	Yes	전체 지원
Structured extraction	Yes	Chars, spans
PDF creation	Yes	Markdown, HTML, text, images
PDF editing	Yes	Metadata, rotation, dimensions, erase
Form fields	Yes	Read, write, export, flatten
Search	Yes	Full regex support
Encryption	Yes	AES-256 read and write
Annotations	Yes	Read, flatten, redact
Merge PDFs	Yes	다른 PDF에서 페이지 병합
Embedded files	Yes	Attach files to PDFs
Page labels	Yes	Read page label ranges
XMP metadata	Yes	Read XMP metadata
OCR	No	네이티브 ONNX Runtime 필요
Digital signatures	No	네이티브 암호화 라이브러리 필요
Page rendering	No	네이티브 tiny-skia 필요
Barcodes	No	네이티브 렌더링 필요
Office conversion	No	네이티브 LibreOffice 필요

오류 처리

실패할 수 있는 모든 메서드는 JavaScript Error 객체를 throw합니다:

try {
  const doc = new WasmPdfDocument(new Uint8Array([0, 1, 2]));
} catch (e) {
  console.error(`Failed to open: ${e.message}`);
}

TypeScript

전체 타입 정의가 패키지에 포함되어 있습니다:

import { WasmPdfDocument, WasmPdf } from "pdf-oxide-wasm";

const doc: WasmPdfDocument = new WasmPdfDocument(bytes);
const text: string = doc.extractText(0);
const pdf: WasmPdf = WasmPdf.fromMarkdown("# Hello");