What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Markdown 转换

PDF Oxide 将 PDF 页面转换为清晰易读的 Markdown。转换流水线提取文本段，将其聚类成行，在带标签的 PDF 中从 /StructTreeRoot 获取标题和列表角色，检测多列间距和逆向 x 阅读顺序换行，对段落进行分组，最终输出 Markdown 语法。

自 v0.3.36 起，对于带标签的 PDF，转换器直接从 /StructTreeRoot 读取 StructRole(Heading(1..6) | ListItem | ListItemLabel | ListItemBody)，而不再通过字体大小重新推导标题级别。角色信息通过嵌套 MCR 传播（H1 → Span → MCR，LI → LBody → Span → MCR）。对于未标签文档，几何回退仍然适用：粗体 + 5% 大小提升可晋升为 H4，而 is_ordered_list_marker 能识别 1. / 12. / a) / iv. / A.，同时排除图注和年份。

多列处理： 基线相同但间距超过 > max(3 × font_size, 30 pt) 的文本段被视为跨列内容。逆向 x 阅读顺序换行（列优先的末尾→首段）会拆分段落，而不是将其拼接成无意义的字符串。

RTL： bidi 重排默认关闭——之前无条件的视觉顺序→逻辑顺序重排会破坏逻辑顺序 PDF（希伯来语 בנימין 被反转了）。阿拉伯语上下文字形周围多余的 **bold** 标记已被剔除。若输入为视觉顺序，调用方可手动调用 text::bidi::reorder_visual_to_logical（Rust）。

内联图片的 base64 载荷上限为 200 KB（v0.3.36 新增）。超过上限的图片会输出一条注明原始大小的 HTML 注释；使用 image_output_dir 可将图片写入磁盘。

快速示例

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("paper.pdf");
const md = doc.toMarkdown(0, { detectHeadings: true });
console.log(md);
doc.close();

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()
md, _ := doc.ToMarkdown(0)
fmt.Println(md)

using PdfOxide.Core;

using var doc = PdfDocument.Open("paper.pdf");
var md = doc.ToMarkdown(0);
Console.WriteLine(md);

WASM

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0, true);
console.log(md);

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;

let mut doc = PdfDocument::open("paper.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let md = doc.to_markdown(0, &options)?;
println!("{}", md);

Java

import fyi.oxide.pdf.PdfDocument;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("paper.pdf"))) {
    String md = doc.toMarkdown(0);
    System.out.println(md);
}

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(java.nio.file.Path.of("paper.pdf")).use { doc ->
    val md = doc.toMarkdown(0)
    println(md)
}

Scala

import fyi.oxide.pdf.PdfDocument
import scala.util.Using

Using.resource(PdfDocument.open("paper.pdf")) { doc =>
  val md = doc.toMarkdown(0)
  println(md)
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "paper.pdf")]
  (println (pdf/to-markdown doc 0)))

PHP

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');
echo $doc->toMarkdown(0);
$doc->close();

Ruby

require 'pdf_oxide'

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  puts doc.to_markdown(0)
end

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("paper.pdf");
auto md = doc.to_markdown(0);
std::cout << md << std::endl;

Swift

import PdfOxide

let doc = try Document.open("paper.pdf")
let md = try doc.toMarkdown(0)
print(md)

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('paper.pdf');
final md = doc.toMarkdown(0);
print(md);

library(pdfoxide)

doc <- pdf_open("paper.pdf")
md <- pdf_to_markdown(doc, 0)
cat(md)

Julia

using PdfOxide

doc = open_document("paper.pdf")
md = to_markdown(doc, 0)
println(md)

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("paper.pdf");
const md = try doc.toMarkdown(a, 0);
std.debug.print("{s}\n", .{md});

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"paper.pdf" error:&err];
NSString *md = [doc toMarkdown:0 error:&err];
NSLog(@"%@", md);

Elixir

{:ok, doc} = PdfOxide.open("paper.pdf")
{:ok, md} = PdfOxide.to_markdown(doc, 0)
IO.puts(md)

API 参考

`to_markdown(page_index, ...) -> str`

将单个页面转换为 Markdown。

Python Signature

doc.to_markdown(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
) -> str

JavaScript Signature

doc.toMarkdown(pageIndex, detectHeadings?, includeImages?, includeFormFields?) -> string

Rust Signature

pub fn to_markdown(
    &mut self,
    page_index: usize,
    options: &ConversionOptions,
) -> Result<String>

Java Signature

String toMarkdown(int pageIndex)

Kotlin Signature

fun toMarkdown(pageIndex: Int): String

Scala Signature

def toMarkdown(pageIndex: Int): String

Clojure Signature

(pdf/to-markdown doc page-index) ; => String

PHP Signature

public function toMarkdown(int $pageIndex): string

Ruby Signature

doc.to_markdown(page_index) # => String

C++ Signature

std::string to_markdown(int page_index) const;

Swift Signature

func toMarkdown(_ pageIndex: Int) throws -> String

Dart Signature

String toMarkdown(int pageIndex)

R Signature

pdf_to_markdown(doc, page_index)  # character

Julia Signature

to_markdown(doc, page_index)::String

Zig Signature

pub fn toMarkdown(self: *Document, allocator: std.mem.Allocator, page_index: usize) ![]u8

Objective-C Signature

- (NSString *)toMarkdown:(NSInteger)pageIndex error:(NSError **)error;

Elixir Signature

PdfOxide.to_markdown(doc, page_index) :: {:ok, String.t()} | {:error, term()}

参数	类型	默认值	说明
`page_index`	`int` / `usize` / `number`	–	从零开始的页面索引
`preserve_layout`	`bool`	`false`	保留视觉布局定位
`detect_headings`	`bool`	`true`	根据字体大小和粗细检测标题
`include_images`	`bool`	`true`	在输出中包含图片
`image_output_dir`	`str` / `None`	`None`	保存提取图片的目录（仅 Python/Rust）。不受 200 KB 内联上限影响。
`embed_images`	`bool`	`true`	将图片以 base64 数据 URI 嵌入（仅 Python/Rust）。超过 200 KB 的载荷会输出注明原始大小的占位 HTML 注释（v0.3.36）。
`include_form_fields`	`bool`	`true`	包含表单字段值（Python/JS）

返回值： 该页面的 Markdown 字符串。

`to_markdown_all(...) -> str`

将所有页面转换为 Markdown，各页面之间以水平线（---）分隔。