What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 시작하기 (Scala)

PDF Oxide는 텍스트 추출 기능을 기본 내장한, JVM에서 가장 빠른 PDF 라이브러리입니다 — 3,830개 PDF에서 평균 0.8ms, 100% 통과율을 기록합니다. Scala 3 바인딩은 성숙한 Java 바인딩 위에 얹은 얇고 관용적인 파사드입니다. 네이티브 코드를 전혀 추가하지 않으며, java.util.Optional[T]를 Option[T]로, java.util.List[T]를 Seq[T]로 바꿔 주는 Scala 확장 메서드를 한 겹 덧입힙니다. AutoCloseable 핸들은 scala.util.Using과 곧바로 잘 맞물립니다.

설치

build.sbt에 의존성을 추가하세요.

libraryDependencies += "fyi.oxide" % "pdf-oxide" % "0.3.69"

Scala 파사드는 단일 JNI 네이티브 브리지를 소유한 fyi.oxide:pdf-oxide Java 바인딩에 의존합니다. Scala 3.3 이상이 필요합니다.

빠른 시작

Markdown으로 PDF를 만든 다음, 그 PDF를 열어 다시 텍스트를 추출해 봅니다. Using.resource가 각 핸들을 알아서 닫아 줍니다.

import fyi.oxide.pdf.{Pdf, PdfDocument, producerOption}
import scala.util.Using

Using.resource(Pdf.fromMarkdown("# Hello pdf_oxide\n\nThis is a **Scala** binding.\n")): pdf =>
  Using.resource(PdfDocument.open(pdf.save())): doc =>
    println(s"pages:    ${doc.pageCount()}")
    println(s"producer: ${doc.producerOption.getOrElse("(none)")}")
    println(doc.extractText(0))

Pdf.fromMarkdown은 Pdf 핸들을 반환하고, pdf.save()는 이를 Array[Byte]로 직렬화합니다. PdfDocument.open은 그 바이트를 받아 문서 API를 노출합니다.

텍스트 추출

일반 텍스트

0부터 시작하는 인덱스로 임의의 페이지에서 일반 텍스트를 추출합니다.

import fyi.oxide.pdf.PdfDocument
import scala.util.Using

Using.resource(PdfDocument.open(pdfBytes)): doc =>
  assert(doc.isOpen)
  val text = doc.extractText(0)
  println(text)

Markdown과 HTML

문서 전체를 한 번의 호출로 Markdown 또는 HTML로 변환합니다.

import fyi.oxide.pdf.PdfDocument
import scala.util.Using

Using.resource(PdfDocument.open(pdfBytes)): doc =>
  println(doc.toMarkdown())
  println(doc.toHtml())

페이지 요소

doc.page(i)는 PdfPage를 반환합니다. 파사드는 각 요소 추출기를 *Seq 확장 메서드를 통해 Scala Seq로 노출합니다: wordsSeq, linesSeq, charsSeq, tablesSeq, imagesSeq, annotationsSeq. 각 TextWord는 자신의 text와 bbox를 담고 있습니다.

import fyi.oxide.pdf.{PdfDocument, wordsSeq, linesSeq, charsSeq, tablesSeq, imagesSeq, annotationsSeq}
import scala.util.Using

Using.resource(PdfDocument.open(pdfBytes)): doc =>
  val page = doc.page(0)
  println(s"size: ${page.width()} x ${page.height()}")

  page.wordsSeq.take(8).foreach { w =>
    println(s"  ${w.text} @ ${w.bbox}  (w=${w.bbox.width})")
  }

  println(s"lines:       ${page.linesSeq.size}")
  println(s"chars:       ${page.charsSeq.size}")
  println(s"tables:      ${page.tablesSeq.size}")
  println(s"images:      ${page.imagesSeq.size}")
  println(s"annotations: ${page.annotationsSeq.size}")

doc.pagesSeq로 모든 페이지를 Seq처럼 순회할 수도 있습니다 (그 크기는 doc.pageCount()와 일치합니다).

import fyi.oxide.pdf.{PdfDocument, pagesSeq, wordsSeq}
import scala.util.Using

Using.resource(PdfDocument.open(pdfBytes)): doc =>
  doc.pagesSeq.zipWithIndex.foreach { (page, i) =>
    println(s"page $i: ${page.wordsSeq.size} words")
  }

검색

doc.searchSeq(query)는 Seq[SearchMatch]를 반환합니다. 각 매치는 자신의 text를 노출합니다.

import fyi.oxide.pdf.{PdfDocument, searchSeq}
import scala.util.Using

Using.resource(PdfDocument.open(pdfBytes)): doc =>
  val matches = doc.searchSeq("Hello")
  println(s"${matches.size} match(es)")
  matches.foreach(m => println(s"  ${m.text}"))

Option으로 다루는 메타데이터

null이 될 수 있는 문서 메타데이터는 producerOption과 creatorOption을 통해 Option[String]으로 드러나므로, 값이 없는 경우를 Scala다운 방식으로 처리할 수 있습니다.

import fyi.oxide.pdf.{PdfDocument, producerOption, creatorOption}
import scala.util.Using

Using.resource(PdfDocument.open(pdfBytes)): doc =>
  println(doc.producerOption.getOrElse("(unknown producer)"))
  println(doc.creatorOption.getOrElse("(unknown creator)"))

  // 폼 필드도 Seq로 돌아옵니다:
  println(s"form fields: ${doc.formFieldsSeq.size}")

렌더링

doc.render(i)는 페이지를 래스터화하여 인코딩된 이미지 바이트를 반환합니다.

import fyi.oxide.pdf.PdfDocument
import scala.util.Using

Using.resource(PdfDocument.open(pdfBytes)): doc =>
  val png = doc.render(0)
  java.nio.file.Files.write(java.nio.file.Path.of("page-0.png"), png)

자동 추출

AutoExtractor.of(doc).extractDocument()는 추출된 text, 선택적인 markdown/html 렌더링, 그리고 여전히 OCR이 필요한 페이지 목록을 담은 AutoResult를 반환합니다 — 모두 파사드를 통해 관용적으로 노출됩니다 (markdownOption, htmlOption, pagesNeedingOcrSeq).

import fyi.oxide.pdf.{PdfDocument, AutoExtractor, markdownOption, htmlOption, pagesNeedingOcrSeq}
import scala.util.Using

Using.resource(PdfDocument.open(pdfBytes)): doc =>
  val result = AutoExtractor.of(doc).extractDocument()
  println(result.text)
  result.markdownOption.foreach(println)
  result.htmlOption.foreach(println)
  println(s"pages needing OCR: ${result.pagesNeedingOcrSeq}")

편집

DocumentEditor.open은 구조적 편집을 위해 기존 PDF를 엽니다. 아래에서는 메타데이터를 말끔히 지운 뒤 결과를 다시 바이트로 직렬화합니다.

import fyi.oxide.pdf.DocumentEditor
import scala.util.Using

Using.resource(DocumentEditor.open(pdfBytes)): editor =>
  assert(editor.isOpen)
  editor.scrubMetadata()
  val cleaned: Array[Byte] = editor.save()
  java.nio.file.Files.write(java.nio.file.Path.of("scrubbed.pdf"), cleaned)

다음 단계

Rust 시작하기 – Rust에서 PDF Oxide 사용하기
Python 시작하기 – Python에서 PDF Oxide 사용하기
텍스트 추출 – 자세한 추출 옵션과 레시피
PDF 생성 – 고급 생성, 암호화, 메타데이터
편집 – 기존 PDF 수정, 주석, 폼 필드