What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide をはじめよう（Scala）

PDF Oxide は、テキスト抽出を標準搭載した JVM 向け最速の PDF ライブラリです。3,830 件の PDF で平均 0.8ms、合格率 100% を記録しています。Scala 3 バインディングは、成熟した Java バインディングの上に薄くかぶせたイディオマティックなファサードです。ネイティブコードは一切追加せず、Scala の拡張メソッドを重ねることで java.util.Optional[T] を Option[T] に、java.util.List[T] を Seq[T] に変換します。AutoCloseable なハンドルは scala.util.Using でそのまま扱えます。

インストール

build.sbt に依存関係を追加します。

libraryDependencies += "fyi.oxide" % "pdf-oxide" % "0.3.69"

Scala ファサードは、唯一の JNI ネイティブブリッジを担う fyi.oxide:pdf-oxide Java バインディングに依存します。Scala 3.3 以降が必要です。

クイックスタート

Markdown から PDF を作成し、それを開いてテキストを抽出して取り出してみます。Using.resource が各ハンドルを自動的にクローズしてくれます。

import fyi.oxide.pdf.{Pdf, PdfDocument, producerOption}
import scala.util.Using

Using.resource(Pdf.fromMarkdown("# Hello pdf_oxide\n\nThis is a **Scala** binding.\n")): pdf =>
  Using.resource(PdfDocument.open(pdf.save())): doc =>
    println(s"pages:    ${doc.pageCount()}")
    println(s"producer: ${doc.producerOption.getOrElse("(none)")}")
    println(doc.extractText(0))

Pdf.fromMarkdown は Pdf ハンドルを返します。pdf.save() はそれを Array[Byte] にシリアライズします。PdfDocument.open はそのバイト列を受け取り、ドキュメント API を公開します。

テキスト抽出

プレーンテキスト

任意のページから、0 始まりのインデックスを指定してプレーンテキストを抽出します。

import fyi.oxide.pdf.PdfDocument
import scala.util.Using

Using.resource(PdfDocument.open(pdfBytes)): doc =>
  assert(doc.isOpen)
  val text = doc.extractText(0)
  println(text)

Markdown と HTML

ドキュメント全体を 1 回の呼び出しで Markdown または HTML に変換します。

import fyi.oxide.pdf.PdfDocument
import scala.util.Using

Using.resource(PdfDocument.open(pdfBytes)): doc =>
  println(doc.toMarkdown())
  println(doc.toHtml())

ページ要素

doc.page(i) は PdfPage を返します。ファサードは、各要素のエクストラクターを *Seq 拡張メソッドによって Scala の Seq として公開します。wordsSeq、linesSeq、charsSeq、tablesSeq、imagesSeq、annotationsSeq です。各 TextWord は text と bbox を保持します。

import fyi.oxide.pdf.{PdfDocument, wordsSeq, linesSeq, charsSeq, tablesSeq, imagesSeq, annotationsSeq}
import scala.util.Using

Using.resource(PdfDocument.open(pdfBytes)): doc =>
  val page = doc.page(0)
  println(s"size: ${page.width()} x ${page.height()}")

  page.wordsSeq.take(8).foreach { w =>
    println(s"  ${w.text} @ ${w.bbox}  (w=${w.bbox.width})")
  }

  println(s"lines:       ${page.linesSeq.size}")
  println(s"chars:       ${page.charsSeq.size}")
  println(s"tables:      ${page.tablesSeq.size}")
  println(s"images:      ${page.imagesSeq.size}")
  println(s"annotations: ${page.annotationsSeq.size}")

doc.pagesSeq を使えば、すべてのページを Seq として反復処理することもできます（その size は doc.pageCount() と一致します）。

import fyi.oxide.pdf.{PdfDocument, pagesSeq, wordsSeq}
import scala.util.Using

Using.resource(PdfDocument.open(pdfBytes)): doc =>
  doc.pagesSeq.zipWithIndex.foreach { (page, i) =>
    println(s"page $i: ${page.wordsSeq.size} words")
  }

検索

doc.searchSeq(query) は Seq[SearchMatch] を返します。各マッチは text を公開します。

import fyi.oxide.pdf.{PdfDocument, searchSeq}
import scala.util.Using

Using.resource(PdfDocument.open(pdfBytes)): doc =>
  val matches = doc.searchSeq("Hello")
  println(s"${matches.size} match(es)")
  matches.foreach(m => println(s"  ${m.text}"))

メタデータを Option として扱う

null になりうるドキュメントのメタデータは、producerOption と creatorOption を通じて Option[String] として現れます。これにより、値が存在しないケースを Scala らしく扱えます。

import fyi.oxide.pdf.{PdfDocument, producerOption, creatorOption}
import scala.util.Using

Using.resource(PdfDocument.open(pdfBytes)): doc =>
  println(doc.producerOption.getOrElse("(unknown producer)"))
  println(doc.creatorOption.getOrElse("(unknown creator)"))

  // フォームフィールドも Seq として返ります:
  println(s"form fields: ${doc.formFieldsSeq.size}")

レンダリング

doc.render(i) はページをラスタライズし、エンコード済みの画像バイト列を返します。

import fyi.oxide.pdf.PdfDocument
import scala.util.Using

Using.resource(PdfDocument.open(pdfBytes)): doc =>
  val png = doc.render(0)
  java.nio.file.Files.write(java.nio.file.Path.of("page-0.png"), png)

自動抽出

AutoExtractor.of(doc).extractDocument() は AutoResult を返します。これには抽出された text、オプションの markdown／html レンダリング、そして OCR がまだ必要なページのリストが含まれ、いずれもファサードを通じてイディオマティックに公開されます（markdownOption、htmlOption、pagesNeedingOcrSeq）。

import fyi.oxide.pdf.{PdfDocument, AutoExtractor, markdownOption, htmlOption, pagesNeedingOcrSeq}
import scala.util.Using

Using.resource(PdfDocument.open(pdfBytes)): doc =>
  val result = AutoExtractor.of(doc).extractDocument()
  println(result.text)
  result.markdownOption.foreach(println)
  result.htmlOption.foreach(println)
  println(s"pages needing OCR: ${result.pagesNeedingOcrSeq}")

編集

DocumentEditor.open は、構造的な編集のために既存の PDF を開きます。ここではメタデータを除去し、その結果をバイト列に書き戻します。

import fyi.oxide.pdf.DocumentEditor
import scala.util.Using

Using.resource(DocumentEditor.open(pdfBytes)): editor =>
  assert(editor.isOpen)
  editor.scrubMetadata()
  val cleaned: Array[Byte] = editor.save()
  java.nio.file.Files.write(java.nio.file.Path.of("scrubbed.pdf"), cleaned)

次のステップ

Rust をはじめよう – Rust から PDF Oxide を使う
Python をはじめよう – Python から PDF Oxide を使う
テキスト抽出 – 抽出オプションとレシピの詳細
PDF の作成 – 高度な作成、暗号化、メタデータ
編集 – 既存の PDF、注釈、フォームフィールドの編集