What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 시작하기 ®

PDF Oxide는 빠른 PDF 텍스트·Markdown·HTML 추출을 위한 관용적인 R 바인딩을 제공합니다. 텍스트 추출 평균 0.8ms, 3,830개 PDF에서 100% 통과율을 기록하며, 다른 모든 바인딩과 동일한 Rust 코어를 기반으로 합니다. R 패키지는 R의 .Call 인터페이스를 통해 pdf_oxide C ABI를 감싸며, 문서 핸들은 가비지 컬렉터가 해제하는 R 외부 포인터입니다. 페이지 인덱스는 내부 엔진과 맞추기 위해 0부터 시작합니다.

설치

R 패키지는 기본 기능 cdylib를 링크합니다. 먼저 네이티브 라이브러리를 빌드한 다음, 헤더와 cdylib 위치를 지정하여 패키지를 설치합니다.

# 1. build the native library (shipped binding feature set)
cargo build --release --lib \
  --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. install the R package
PDF_OXIDE_INCLUDE_DIR="$PWD/include" PDF_OXIDE_LIB_DIR="$PWD/target/release" \
  R CMD INSTALL r/

실행 시 링커가 cdylib를 찾을 수 있도록 경로를 지정합니다.

LD_LIBRARY_PATH="$PWD/target/release" Rscript your_script.R

PDF 열기

pdf_open()으로 파일을 연 다음 메타데이터를 확인합니다. pdf_version()은 major와 minor를 담은 명명된 리스트를 반환합니다.

library(pdfoxide)

doc <- pdf_open("research-paper.pdf")

pdf_page_count(doc)               # number of pages
v <- pdf_version(doc)
cat("PDF version:", paste(v$major, v$minor, sep = "."), "\n")
pdf_is_encrypted(doc)             # logical

텍스트 추출

0부터 시작하는 단일 페이지의 텍스트를 읽기 순서대로 추출하려면 pdf_extract_text()를 사용합니다.

library(pdfoxide)

doc <- pdf_open("report.pdf")
text <- pdf_extract_text(doc, 0)  # 0-based page index
cat(text)

pdf_page_count()를 사용해 모든 페이지를 순회합니다.

doc <- pdf_open("book.pdf")
for (page in seq_len(pdf_page_count(doc)) - 1L) {   # 0-based indices
  cat("--- Page", page + 1L, "---\n")
  cat(pdf_extract_text(doc, page), "\n")
}

Markdown과 HTML

단일 페이지를 Markdown이나 HTML로 변환하거나, 문서 전체를 한 번에 변환할 수 있습니다.

library(pdfoxide)

doc <- pdf_open("paper.pdf")

md  <- pdf_to_markdown(doc, 0)    # one page as Markdown
html <- pdf_to_html(doc, 0)       # one page as HTML

all_md   <- pdf_to_markdown_all(doc)    # whole document
all_text <- pdf_to_plain_text_all(doc)  # whole document, plain text

cat(all_md)

단어, 문자, 줄

요소 추출은 위치 정보가 담긴 바운딩 박스를 포함한 레코드 리스트를 반환합니다. 각 bbox는 x, y, width, height를 담은 명명된 리스트입니다.

library(pdfoxide)

doc <- pdf_open("paper.pdf")

# Positioned words — each has $text, $bbox, $font_name, $font_size, $bold
words <- pdf_extract_words(doc, 0)
for (w in head(words, 10)) {
  cat(sprintf("'%s' at (%.1f, %.1f) font=%s bold=%s\n",
              w$text, w$bbox$x, w$bbox$y, w$font_name, w$bold))
}

# Reading-order lines — each has $text, $bbox, $word_count
lines <- pdf_extract_text_lines(doc, 0)
for (ln in head(lines, 5)) {
  cat(sprintf("[%d words] %s\n", ln$word_count, ln$text))
}

# Positioned characters — $character is the Unicode codepoint (integer)
chars <- pdf_extract_chars(doc, 0)
for (ch in head(chars, 10)) {
  cat(sprintf("'%s' at (%.1f, %.1f) size=%.1f\n",
              intToUtf8(ch$character), ch$bbox$x, ch$bbox$y, ch$font_size))
}

표

pdf_extract_tables()는 감지된 표를 반환합니다. 각 표 레코드는 row_count, col_count, has_header와 함께 1부터 인덱싱되는 cells 문자 행렬을 가지며, tbl$cells[row, col]로 접근합니다.

library(pdfoxide)

doc <- pdf_open("statement.pdf")
tables <- pdf_extract_tables(doc, 0)

for (tbl in tables) {
  cat(sprintf("Table: %d rows x %d cols (header=%s)\n",
              tbl$row_count, tbl$col_count, tbl$has_header))
  for (r in seq_len(tbl$row_count)) {
    cat(paste(tbl$cells[r, ], collapse = " | "), "\n")
  }
}

검색

pdf_search()로 단일 페이지를, pdf_search_all()로 문서 전체를 검색합니다. 두 함수 모두 선택적인 case_sensitive 플래그(기본값 FALSE)를 받으며, text, page, bbox를 담은 레코드를 반환합니다.

library(pdfoxide)

doc <- pdf_open("manual.pdf")

# Whole document
hits <- pdf_search_all(doc, "configuration")
for (h in hits) {
  cat(sprintf("Page %d: '%s' at (%.0f, %.0f)\n",
              h$page, h$text, h$bbox$x, h$bbox$y))
}

# Single page, case-sensitive
page_hits <- pdf_search(doc, 0, "Configuration", case_sensitive = TRUE)

바이트에서 열기

메모리에 담긴 PDF를 pdf_open_from_bytes()로 열 수 있습니다. S3, HTTP, 데이터베이스에서 읽어올 때 유용하며, 이 함수는 raw 벡터를 받습니다.

library(pdfoxide)

bytes <- readBin("report.pdf", "raw", file.info("report.pdf")$size)
doc <- pdf_open_from_bytes(bytes)
cat(pdf_extract_text(doc, 0))

비밀번호로 보호된 PDF

암호화된 문서는 pdf_open_with_password()로 열거나, 문서를 연 뒤 pdf_authenticate()를 호출합니다. pdf_authenticate()는 성공 시 TRUE, 비밀번호가 틀리면 FALSE를 반환합니다.

library(pdfoxide)

doc <- pdf_open_with_password("confidential.pdf", "secret")
cat(pdf_extract_text(doc, 0))

PDF 생성

빌더 함수는 Markdown, HTML, 일반 텍스트로부터 pdfoxide_pdf를 생성합니다. pdf_save()로 경로에 저장하거나, pdf_to_bytes()로 raw 벡터로 직렬화할 수 있습니다. 직렬화한 벡터는 pdf_open_from_bytes()로 다시 열 수 있습니다.

library(pdfoxide)

pdf <- pdf_from_markdown("# Hello World\n\nThis is a PDF.\n")
pdf_save(pdf, "output.pdf")

pdf_from_html("<h1>Invoice</h1><p>Amount due: $42.00</p>") |>
  pdf_save("invoice.pdf")

pdf_from_text("Plain text document.\n\nSecond paragraph.") |>
  pdf_save("notes.pdf")

# Round-trip through bytes
bytes <- pdf_to_bytes(pdf_from_markdown("# In memory\n\nbody\n"))
doc <- pdf_open_from_bytes(bytes)
cat(pdf_extract_text(doc, 0))

다음 단계

Python 시작하기 – Python에서 PDF Oxide 사용하기
Rust 시작하기 – 기반이 되는 Rust 크레이트
텍스트 추출 – 자세한 추출 옵션과 활용 예제
PDF 생성 – 빌더, 암호화, 메타데이터를 활용한 고급 생성