What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

HTML 변환

PDF Oxide는 PDF 페이지를 구조화된 HTML로 변환합니다. 제목 감지, 폰트 스타일링, CSS 기반 레이아웃 보존 옵션을 지원합니다. 단일 페이지 변환에는 to_html()을, 전체 문서 변환에는 to_html_all()을 사용하세요. preserve_layout을 활성화하면 원본 PDF 레이아웃에 맞게 CSS 절대 좌표로 요소가 배치됩니다. 비활성화 시에는 자연스러운 흐름의 시맨틱 HTML이 출력됩니다.

빠른 예제

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
html = doc.to_html(0, detect_headings=True)
print(html)

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const html = doc.toHtml(0);
console.log(html);
doc.close();

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
html, _ := doc.ToHtml(0)
fmt.Println(html)

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");
var html = doc.ToHtml(0);
Console.WriteLine(html);

WASM

const doc = new WasmPdfDocument(bytes);
const html = doc.toHtml(0);
console.log(html);

Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;

let mut doc = PdfDocument::open("report.pdf")?;
let options = ConversionOptions { detect_headings: true, ..Default::default() };
let html = doc.to_html(0, &options)?;
println!("{}", html);

Java

import fyi.oxide.pdf.PdfDocument;

try (PdfDocument doc = PdfDocument.open(java.nio.file.Path.of("report.pdf"))) {
    String html = doc.toHtml(0);
    System.out.println(html);
}

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(java.nio.file.Path.of("report.pdf")).use { doc ->
    val html = doc.toHtml(0)
    println(html)
}

Scala

import fyi.oxide.pdf.PdfDocument
import scala.util.Using

Using.resource(PdfDocument.open("report.pdf")) { doc =>
  val html = doc.toHtml(0)
  println(html)
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "report.pdf")]
  (println (pdf/to-html doc 0)))

PHP

use PdfOxide\PdfDocument;

$doc = PdfDocument::open('report.pdf');
$html = $doc->toHtml(0);
echo $html;
$doc->close();

Ruby

require 'pdf_oxide'

PdfOxide::PdfDocument.open('report.pdf') do |doc|
  html = doc.to_html(0)
  puts html
end

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("report.pdf");
auto html = doc.to_html(0);
std::cout << html << std::endl;

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
let html = try doc.toHtml(0)
print(html)

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('report.pdf');
final html = doc.toHtml(0);
print(html);

library(pdfoxide)

doc <- pdf_open("report.pdf")
html <- pdf_to_html(doc, 0)
cat(html)

Julia

using PdfOxide

doc = open_document("report.pdf")
html = to_html(doc, 0)
println(html)

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("report.pdf");
const html = try doc.toHtml(a, 0);
std.debug.print("{s}\n", .{html});

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
NSString *html = [doc toHtml:0 error:&err];
NSLog(@"%@", html);

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, html} = PdfOxide.to_html(doc, 0)
IO.puts(html)

API 레퍼런스

`to_html(page_index, ...) -> str`

단일 페이지를 HTML로 변환합니다.

Python Signature

doc.to_html(
    page: int,
    preserve_layout: bool = False,
    detect_headings: bool = True,
    include_images: bool = True,
    image_output_dir: str | None = None,
    embed_images: bool = True,
) -> str

JavaScript Signature

doc.toHtml(pageIndex, preserveLayout?, detectHeadings?, includeFormFields?) -> string

Rust Signature

pub fn to_html(
    &mut self,
    page_index: usize,
    options: &ConversionOptions,
) -> Result<String>

파라미터	타입	기본값	설명
`page_index`	`int` / `usize` / `number`	–	0부터 시작하는 페이지 인덱스
`preserve_layout`	`bool`	`false`	PDF 레이아웃에 맞춰 CSS 절대 위치 사용
`detect_headings`	`bool`	`true`	폰트 크기로부터 제목 수준 자동 감지
`include_images`	`bool`	`true`	HTML 출력에 이미지 포함
`image_output_dir`	`str` / `None`	`None`	추출된 이미지를 저장할 디렉터리 (Python/Rust만 해당)
`embed_images`	`bool`	`true`	이미지를 base64 데이터 URI로 삽입 (Python/Rust만 해당)
`include_form_fields`	`bool`	`true`	폼 필드 값 포함 (Python/JS)

반환값: 해당 페이지의 HTML 문자열.

preserve_layout이 true이면 CSS 절대 위치를 사용하는 <div> 요소로 출력됩니다:

<div style="position: absolute; left: 72.0px; top: 100.0px; font-size: 24px; font-weight: bold;">
  Introduction
</div>

preserve_layout이 false이면 시맨틱 요소로 출력됩니다:

<h1>Introduction</h1>
<p>This report examines the quarterly results...</p>

`to_html_all(...) -> str`

모든 페이지를 HTML로 변환합니다. 각 페이지는 <div class="page"> 요소로 감싸집니다.