What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Markdown 변환

PDF Oxide는 PDF 페이지를 깔끔하고 읽기 쉬운 Markdown으로 변환합니다. 변환 파이프라인은 텍스트 스팬을 추출하고 행으로 클러스터링한 뒤, 태그된 PDF의 경우 /StructTreeRoot에서 제목과 목록 역할을 참조하고, 다단 컬럼 간격과 역방향 x 읽기 순서 줄바꿈을 감지하며, 단락을 그룹화하여 Markdown 구문을 출력합니다.

v0.3.36부터 태그된 PDF의 경우, 폰트 크기에서 제목 수준을 다시 유추하는 대신 /StructTreeRoot에서 StructRole(Heading(1..6) | ListItem | ListItemLabel | ListItemBody)를 직접 읽어옵니다. 역할 정보는 중첩된 MCR을 통해 전파됩니다(H1 → Span → MCR, LI → LBody → Span → MCR). 태그 없는 문서에서는 기존의 기하학적 폴백이 그대로 적용됩니다. 굵은 텍스트 + 5% 크기 증가는 H4로 승격되며, is_ordered_list_marker는 1. / 12. / a) / iv. / A.를 인식하면서 그림 캡션과 연도는 제외합니다.

다단 컬럼 처리: > max(3 × font_size, 30 pt)로 분리된 동일 기준선의 스팬은 컬럼 간 요소로 처리됩니다. 역방향 x 읽기 순서 줄바꿈(컬럼 우선 마지막→첫 스팬)은 의미 없는 토큰으로 합치는 대신 단락을 나눕니다.

RTL: bidi 재정렬은 기본적으로 꺼져 있습니다. 이전의 무조건적인 시각적→논리적 재정렬은 논리 순서 PDF(히브리어 בנימין가 뒤집어졌던 문제)를 망가뜨렸습니다. 아랍어 컨텍스트 글리프 주변의 불필요한 **bold** 마커는 제거됩니다. 입력이 시각적 순서인 경우 호출자가 text::bidi::reorder_visual_to_logical을 수동으로 호출할 수 있습니다(Rust).

인라인 이미지는 base64 페이로드가 200 KB로 제한됩니다(v0.3.36 추가). 한도를 초과한 이미지는 원본 크기를 명시하는 HTML 주석을 출력합니다. 디스크에 저장하려면 image_output_dir을 사용하세요.

빠른 예제

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("paper.pdf");
const md = doc.toMarkdown(0, { detectHeadings: true });
console.log(md);
doc.close();

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()
md, _ := doc.ToMarkdown(0)
fmt.Println(md)

using PdfOxide.Core;

using var doc = PdfDocument.Open("paper.pdf");
var md = doc.ToMarkdown(0);
Console.WriteLine(md);

WASM

const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0, true);
console.log(md);

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;

let mut doc = PdfDocument::open("paper.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let md = doc.to_markdown(0, &options)?;
println!("{}", md);

Java

import fyi.oxide.pdf.PdfDocument;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("paper.pdf"))) {
    String md = doc.toMarkdown(0);
    System.out.println(md);
}

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(java.nio.file.Path.of("paper.pdf")).use { doc ->
    val md = doc.toMarkdown(0)
    println(md)
}

Scala

import fyi.oxide.pdf.PdfDocument
import scala.util.Using

Using.resource(PdfDocument.open("paper.pdf")) { doc =>
  val md = doc.toMarkdown(0)
  println(md)
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "paper.pdf")]
  (println (pdf/to-markdown doc 0)))

PHP

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('paper.pdf');
echo $doc->toMarkdown(0);
$doc->close();

Ruby

require 'pdf_oxide'

PdfOxide::PdfDocument.open('paper.pdf') do |doc|
  puts doc.to_markdown(0)
end

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("paper.pdf");
auto md = doc.to_markdown(0);
std::cout << md << std::endl;

Swift

import PdfOxide

let doc = try Document.open("paper.pdf")
let md = try doc.toMarkdown(0)
print(md)

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('paper.pdf');
final md = doc.toMarkdown(0);
print(md);

library(pdfoxide)

doc <- pdf_open("paper.pdf")
md <- pdf_to_markdown(doc, 0)
cat(md)

Julia

using PdfOxide

doc = open_document("paper.pdf")
md = to_markdown(doc, 0)
println(md)

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("paper.pdf");
const md = try doc.toMarkdown(a, 0);
std.debug.print("{s}\n", .{md});

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"paper.pdf" error:&err];
NSString *md = [doc toMarkdown:0 error:&err];
NSLog(@"%@", md);

Elixir

{:ok, doc} = PdfOxide.open("paper.pdf")
{:ok, md} = PdfOxide.to_markdown(doc, 0)
IO.puts(md)

API 레퍼런스

`to_markdown(page_index, ...) -> str`

단일 페이지를 Markdown으로 변환합니다.

Python Signature

doc.to_markdown(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
) -> str

JavaScript Signature

doc.toMarkdown(pageIndex, detectHeadings?, includeImages?, includeFormFields?) -> string

Rust Signature

pub fn to_markdown(
    &mut self,
    page_index: usize,
    options: &ConversionOptions,
) -> Result<String>

Java Signature

String toMarkdown(int pageIndex)

Kotlin Signature

fun toMarkdown(pageIndex: Int): String

Scala Signature

def toMarkdown(pageIndex: Int): String

Clojure Signature

(pdf/to-markdown doc page-index) ; => String

PHP Signature

public function toMarkdown(int $pageIndex): string

Ruby Signature

doc.to_markdown(page_index) # => String

C++ Signature

std::string to_markdown(int page_index) const;

Swift Signature

func toMarkdown(_ pageIndex: Int) throws -> String

Dart Signature

String toMarkdown(int pageIndex)

R Signature

pdf_to_markdown(doc, page_index)  # character

Julia Signature

to_markdown(doc, page_index)::String

Zig Signature

pub fn toMarkdown(self: *Document, allocator: std.mem.Allocator, page_index: usize) ![]u8

Objective-C Signature

- (NSString *)toMarkdown:(NSInteger)pageIndex error:(NSError **)error;

Elixir Signature

PdfOxide.to_markdown(doc, page_index) :: {:ok, String.t()} | {:error, term()}

매개변수	타입	기본값	설명
`page_index`	`int` / `usize` / `number`	–	0부터 시작하는 페이지 인덱스
`preserve_layout`	`bool`	`false`	시각적 레이아웃 배치 유지
`detect_headings`	`bool`	`true`	폰트 크기와 굵기로 제목 감지
`include_images`	`bool`	`true`	출력에 이미지 포함
`image_output_dir`	`str` / `None`	`None`	추출된 이미지를 저장할 디렉터리 (Python/Rust만 해당). 200 KB 인라인 한도의 영향을 받지 않음.
`embed_images`	`bool`	`true`	이미지를 base64 데이터 URI로 삽입 (Python/Rust만 해당). 200 KB 초과 페이로드는 원본 크기를 명시하는 HTML 주석 출력 (v0.3.36).
`include_form_fields`	`bool`	`true`	폼 필드 값 포함 (Python/JS)

반환값: 해당 페이지의 Markdown 문자열.

`to_markdown_all(...) -> str`

모든 페이지를 Markdown으로 변환하고 수평선(---)으로 구분하여 결합합니다.