What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Markdown変換

PDF OxideはPDFページをクリーンで読みやすいMarkdownに変換します。変換パイプラインはテキストスパンを抽出し、行にクラスタリングし、タグ付きPDFでは見出しやリストの役割を/StructTreeRootから参照し、複数段組のガターや逆方向x読み取り順の折り返しを検出し、段落をグループ化してMarkdown構文を出力します。

v0.3.36以降、タグ付きPDFでは、フォントサイズから見出しレベルを再導出する代わりに、StructRole(Heading(1..6) | ListItem | ListItemLabel | ListItemBody)を/StructTreeRootから直接読み込みます。ロール情報はネストされたMCR（H1 → Span → MCR、LI → LBody → Span → MCR）を通じて伝播されます。タグなしドキュメントでは、ジオメトリックフォールバックが引き続き適用されます：太字 + 5%サイズアップでH4に昇格し、is_ordered_list_markerが1. / 12. / a) / iv. / A.を認識しながら、図のキャプションや年号を除外します。

複数段組の処理： > max(3 × font_size, 30 pt)で分離された同一ベースラインのスパンは、列をまたぐものとして扱われます。逆方向x読み取り順の折り返し（列優先の末尾→先頭スパン）は、意味不明なトークンに連結される代わりに段落を分割します。

RTL： bidi並び替えはデフォルトでオフです。以前の無条件な視覚的→論理的並び替えは、論理順のPDF（ヘブライ語בנימיןが逆順になっていた）を破壊していました。アラビア語の文脈グリフ周辺の誤った**bold**マーカーは除去されます。入力が視覚順の場合、呼び出し元はtext::bidi::reorder_visual_to_logicalを手動で呼び出せます（Rust）。

インライン画像は200 KBのbase64ペイロード上限に制限されています（v0.3.36追加）。上限を超えた画像は元のサイズを示すHTMLコメントを出力します。ディスクに書き出すにはimage_output_dirを使用してください。

クイックサンプル

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("paper.pdf");
const md = doc.toMarkdown(0, { detectHeadings: true });
console.log(md);
doc.close();

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()
md, _ := doc.ToMarkdown(0)
fmt.Println(md)

using PdfOxide.Core;

using var doc = PdfDocument.Open("paper.pdf");
var md = doc.ToMarkdown(0);
Console.WriteLine(md);

WASM

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0, true);
console.log(md);

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;

let mut doc = PdfDocument::open("paper.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let md = doc.to_markdown(0, &options)?;
println!("{}", md);

Java

import fyi.oxide.pdf.PdfDocument;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("paper.pdf"))) {
    String md = doc.toMarkdown(0);
    System.out.println(md);
}

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(java.nio.file.Path.of("paper.pdf")).use { doc ->
    val md = doc.toMarkdown(0)
    println(md)
}

Scala

import fyi.oxide.pdf.PdfDocument
import scala.util.Using

Using.resource(PdfDocument.open("paper.pdf")) { doc =>
  val md = doc.toMarkdown(0)
  println(md)
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "paper.pdf")]
  (println (pdf/to-markdown doc 0)))

PHP

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');
echo $doc->toMarkdown(0);
$doc->close();

Ruby

require 'pdf_oxide'

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  puts doc.to_markdown(0)
end

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("paper.pdf");
auto md = doc.to_markdown(0);
std::cout << md << std::endl;

Swift

import PdfOxide

let doc = try Document.open("paper.pdf")
let md = try doc.toMarkdown(0)
print(md)

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('paper.pdf');
final md = doc.toMarkdown(0);
print(md);

library(pdfoxide)

doc <- pdf_open("paper.pdf")
md <- pdf_to_markdown(doc, 0)
cat(md)

Julia

using PdfOxide

doc = open_document("paper.pdf")
md = to_markdown(doc, 0)
println(md)

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("paper.pdf");
const md = try doc.toMarkdown(a, 0);
std.debug.print("{s}\n", .{md});

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"paper.pdf" error:&err];
NSString *md = [doc toMarkdown:0 error:&err];
NSLog(@"%@", md);

Elixir

{:ok, doc} = PdfOxide.open("paper.pdf")
{:ok, md} = PdfOxide.to_markdown(doc, 0)
IO.puts(md)

APIリファレンス

`to_markdown(page_index, ...) -> str`

単一ページをMarkdownに変換します。

Python Signature

doc.to_markdown(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
) -> str

JavaScript Signature

doc.toMarkdown(pageIndex, detectHeadings?, includeImages?, includeFormFields?) -> string

Rust Signature

pub fn to_markdown(
    &mut self,
    page_index: usize,
    options: &ConversionOptions,
) -> Result<String>

Java Signature

String toMarkdown(int pageIndex)

Kotlin Signature

fun toMarkdown(pageIndex: Int): String

Scala Signature

def toMarkdown(pageIndex: Int): String

Clojure Signature

(pdf/to-markdown doc page-index) ; => String

PHP Signature

public function toMarkdown(int $pageIndex): string

Ruby Signature

doc.to_markdown(page_index) # => String

C++ Signature

std::string to_markdown(int page_index) const;

Swift Signature

func toMarkdown(_ pageIndex: Int) throws -> String

Dart Signature

String toMarkdown(int pageIndex)

R Signature

pdf_to_markdown(doc, page_index)  # character

Julia Signature

to_markdown(doc, page_index)::String

Zig Signature

pub fn toMarkdown(self: *Document, allocator: std.mem.Allocator, page_index: usize) ![]u8

Objective-C Signature

- (NSString *)toMarkdown:(NSInteger)pageIndex error:(NSError **)error;

Elixir Signature

PdfOxide.to_markdown(doc, page_index) :: {:ok, String.t()} | {:error, term()}

パラメータ	型	デフォルト	説明
`page_index`	`int` / `usize` / `number`	–	0始まりのページインデックス
`preserve_layout`	`bool`	`false`	視覚的レイアウトの配置を保持
`detect_headings`	`bool`	`true`	フォントサイズと太さに基づいて見出しを検出
`include_images`	`bool`	`true`	出力に画像を含める
`image_output_dir`	`str` / `None`	`None`	抽出した画像の保存先ディレクトリ（Python/Rustのみ）。200 KBインライン上限の影響を受けない。
`embed_images`	`bool`	`true`	画像をbase64データURIとして埋め込む（Python/Rustのみ）。200 KBを超えるペイロードは元のサイズを示すプレースホルダーHTMLコメントを出力（v0.3.36）。
`include_form_fields`	`bool`	`true`	フォームフィールドの値を含める（Python/JS）

返り値： ページのMarkdown文字列。

`to_markdown_all(...) -> str`

全ページをMarkdownに変換し、水平線（---）で区切って結合します。