What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Node.js PDF 라이브러리 — PDF Oxide

PDF Oxide는 Node.js에서 가장 빠른 PDF 라이브러리입니다. 텍스트 추출 평균 0.8ms, PyMuPDF보다 5배, pypdf보다 15배 빠르며, 3,830개 PDF에서 100% 통과율을 기록했습니다. 추출·생성·편집을 한 패키지로 제공하며 TypeScript 타입 정의가 포함되어 있습니다. MIT / Apache-2.0 라이선스.

브라우저, Deno, Bun, Cloudflare Workers에서 실행하시나요? WASM 빌드를 사용하세요 — 동일한 API, 네이티브 바이너리 없음. 이 페이지의 네이티브 애드온은 Node.js와 Electron 전용입니다.

설치

npm install pdf-oxide

요구 사항: Node.js 18 이상. 시스템 의존성이나 Rust 툴체인은 필요 없습니다. Linux(glibc + musl) x64/arm64, macOS x64/arm64, Windows x64/arm64용 사전 빌드 .node 애드온은 플랫폼별 optionalDependencies를 통해 자동으로 내려받습니다. 설치 시 컴파일은 일어나지 않습니다.

PDF 열기

JavaScript

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("research-paper.pdf");
console.log(`Pages: ${doc.getPageCount()}`);

const { major, minor } = doc.getVersion();
console.log(`PDF version: ${major}.${minor}`);

doc.close();

TypeScript

import { PdfDocument } from "pdf-oxide";

const doc: PdfDocument = new PdfDocument("research-paper.pdf");
const pageCount: number = doc.getPageCount();
const { major, minor }: { major: number; minor: number } = doc.getVersion();
console.log(`${pageCount} pages, PDF ${major}.${minor}`);
doc.close();

Node.js 22 이상에서는 using으로 자동 정리할 수 있습니다.

{
  using doc = new PdfDocument("report.pdf");
  const text = doc.extractText(0);
} // doc.close()가 자동 호출됨

Page API

v0.3.34부터 PdfDocument는 반복 가능하며, doc.page(i)는 width / height / rotation이 캐시된 PdfPage를 반환합니다. 페이지 단위 추출 메서드도 함께 제공됩니다.

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("paper.pdf");
for (const page of doc) {
  console.log(`Page ${page.index}: ${page.width}x${page.height} (rotation ${page.rotation})`);
  const md = page.markdown();
  const tables = page.tables();       // 행과 셀, bbox 포함
}
doc.close();

인덱싱: doc.page(0), doc.page(-1)(마지막 페이지). 페이지 메서드: text(), markdown(), html(), plainText(), words(), lines(), tables(), images(), paths(), annotations(), fonts(), search(query, caseSensitive).

텍스트 추출

단일 페이지

JavaScript

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const text = doc.extractText(0);
console.log(text);
doc.close();

TypeScript

import { PdfDocument } from "pdf-oxide";

const doc: PdfDocument = new PdfDocument("report.pdf");
const text: string = doc.extractText(0);
console.log(text);
doc.close();

모든 페이지

const doc = new PdfDocument("book.pdf");
const pageCount = doc.getPageCount();

for (let i = 0; i < pageCount; i++) {
  console.log(`--- Page ${i + 1} ---`);
  console.log(doc.extractText(i));
}

doc.close();

비동기 추출

모든 동기 메서드에는 libuv 스레드 풀에서 실행되는 *Async 짝이 있습니다. HTTP 핸들러나 그 밖의 동시성 서버 코드에서는 이를 사용해 추출이 이벤트 루프를 막지 않도록 하세요.

const { PdfDocument } = require("pdf-oxide");

async function extract(path) {
  const doc = new PdfDocument(path);
  try {
    return await doc.extractTextAsync(0);
  } finally {
    doc.close();
  }
}

페이지별로 Promise.all 팬아웃하는 패턴 등은 비동기 가이드를 참고하세요.

구조화 추출

문자 단위·스팬 단위로 위치와 폰트 메타데이터를 얻을 수 있습니다.

const chars = doc.extractChars(0);
for (const ch of chars.slice(0, 10)) {
  console.log(`'${ch.char}' at (${ch.x.toFixed(1)}, ${ch.y.toFixed(1)}) ` +
              `size=${ch.fontSize.toFixed(1)} font=${ch.fontName}`);
}

const spans = doc.extractSpans(0);
for (const span of spans) {
  console.log(`"${span.text}" font=${span.fontName} size=${span.fontSize}`);
}

임계값을 조정할 수 있는 단어·줄 추출:

const words = doc.extractWords(0);
const lines = doc.extractTextLines(0, { wordGapThreshold: 2.5, lineGapThreshold: 1.2 });

Markdown 변환

JavaScript

const md = doc.toMarkdown(0, { detectHeadings: true });
console.log(md);

// 모든 페이지
const allMd = doc.toMarkdownAll();

TypeScript

const md: string = doc.toMarkdown(0, { detectHeadings: true });
const allMd: string = doc.toMarkdownAll();

HTML 변환

const html = doc.toHtml(0);
const allHtml = doc.toHtmlAll();

이미지 추출

const { writeFileSync } = require("fs");

const doc = new PdfDocument("brochure.pdf");
const images = doc.extractImages(0);

for (const [i, img] of images.entries()) {
  console.log(`Image ${i}: ${img.width}x${img.height} ${img.format} (${img.data.length} bytes)`);
  writeFileSync(`image_${i}.${img.format}`, img.data);
}

doc.close();

Indexed 색상을 쓰는 PDF의 이미지는 자동으로 RGB로 확장됩니다. 1/2/4/8 bpc 인덱스 팔레트와 RGB·Grayscale·CMYK 기반 색 공간 모두 지원합니다.

바이트에서 열기

메모리 상의 바이트로 PDF를 열 수 있습니다. S3, HTTP, 데이터베이스에서 가져온 데이터에 유용합니다.

const { PdfDocument } = require("pdf-oxide");
const { readFileSync } = require("fs");

const bytes = readFileSync("document.pdf");
const doc = PdfDocument.openFromBytes(bytes);
const text = doc.extractText(0);
doc.close();

비밀번호 보호 PDF

const doc = PdfDocument.openWithPassword("confidential.pdf", "secret");
const text = doc.extractText(0);
doc.close();

연 뒤에 인증할 수도 있습니다.

const doc = new PdfDocument("confidential.pdf");
if (doc.authenticate("secret")) {
  console.log(doc.extractText(0));
}
doc.close();

AES-256(V=5, R=6) PDF도 완전히 지원합니다. 푸시 버튼 위젯 캡션과, 지연 인증 이후 올바르게 무효화되는 객체 캐시까지 포함됩니다.

PDF 생성

Pdf 클래스는 다양한 입력 형식으로 PDF를 만드는 팩토리 메서드를 제공합니다.

Markdown에서

const { Pdf } = require("pdf-oxide");
const { writeFileSync } = require("fs");

const pdf = Pdf.fromMarkdown("# Hello World\n\nThis is a PDF.");
writeFileSync("output.pdf", pdf.toBytes());

HTML에서

const pdf = Pdf.fromHtml("<h1>Invoice</h1><p>Amount due: $42.00</p>");
writeFileSync("invoice.pdf", pdf.toBytes());

일반 텍스트에서

const pdf = Pdf.fromText("Plain text document.\n\nSecond paragraph.");
writeFileSync("notes.pdf", pdf.toBytes());

이미지에서

const pdf = Pdf.fromImage("scan.jpg");
writeFileSync("scan.pdf", pdf.toBytes());

검색

const doc = new PdfDocument("manual.pdf");

// 모든 페이지 검색
const results = doc.searchAll("configuration", { caseSensitive: false });
for (const r of results) {
  console.log(`Page ${r.page}: "${r.text}" at (${r.x.toFixed(0)}, ${r.y.toFixed(0)})`);
}

// 단일 페이지 검색
const pageResults = doc.searchPage(0, "configuration");
doc.close();

대용량 문서에 대해 스트리밍 검색을 하려면 SearchStream을 사용하세요.

const { PdfDocument, SearchStream, SearchManager } = require("pdf-oxide");

const doc = new PdfDocument("large.pdf");
const manager = new SearchManager(doc);
const stream = new SearchStream(manager, "invoice");

stream.on("data", (r) => console.log(`page ${r.pageIndex + 1}: ${r.text}`));
stream.on("end", () => doc.close());

자세한 내용은 Node.js 스트림 가이드를 참조하세요.

편집

메타데이터, 페이지 작업, 주석, 폼 필드 편집에는 DocumentEditor를 사용합니다.

const { DocumentEditor } = require("pdf-oxide");

const editor = DocumentEditor.open("document.pdf");

// 메타데이터
editor.setTitle("Updated Title");
editor.setAuthor("Jane Doe");

// 페이지 작업
editor.rotatePage(0, 90);
editor.deletePage(5);
editor.movePage(2, 0);

// 폼
editor.setFormFieldValue("employee.name", "Jane Doe");
editor.flattenForms();

editor.save("edited.pdf");
editor.close();

OCR

스캔된 페이지에 OCR을 적용하려면 설치 시 ocr 피처를 활성화하세요.

npm install pdf-oxide --build-from-source -- --features ocr

const { PdfDocument, OcrEngine } = require("pdf-oxide");

const doc = new PdfDocument("scanned.pdf");
const ocr = new OcrEngine();

if (ocr.pageNeedsOcr(doc, 0)) {
  const text = ocr.extractText(doc, 0);
  console.log(text);
}

ocr.close();
doc.close();

엔드투엔드 레시피는 OCR 가이드를 참조하세요.

스레드 안전성

PdfDocument는 Send + Sync이므로 하나의 문서를 여러 워커 스레드에서 공유해 페이지를 병렬로 읽을 수 있습니다. *Async 계열은 libuv 스레드 풀을 이용해 이를 자동으로 처리합니다. 수동 워커 구성은 동시성 문서를 참고하세요.

오류 처리

모든 메서드는 실패 시 예외를 던집니다.

const { PdfDocument } = require("pdf-oxide");

try {
  const doc = new PdfDocument("document.pdf");
  const text = doc.extractText(0);
  doc.close();
} catch (err) {
  console.error(`Extraction failed: ${err.message}`);
}

다음 단계

Python 시작하기 — Python에서 PDF Oxide 사용
WASM 시작하기 — 브라우저 / Deno / Bun / 엣지 런타임
Node.js API 레퍼런스 — 네이티브 API 전체 문서
비동기 가이드 — *Async 메서드와 Promise.all 패턴
Node.js 스트림 — SearchStream 등
텍스트 추출 — 상세 옵션