What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Criar a partir de HTML

Há dois pontos de entrada disponíveis:

Pdf::from_html(content) — HTML estrutural básico (títulos, parágrafos, listas, código, negrito/itálico). Sem estilização. Em todas as bindings.
Pdf::from_html_css(html, css, font_bytes) — pipeline completo de HTML+CSS em puro Rust, introduzido na v0.3.37. Motor CSS feito à mão (subconjunto de seletores L3 + L4, cascata, calc() / var(), @page / @media print), layout block / flex / grid baseado no Taffy, quebra de linha conforme o UAX #14, modelagem RTL via rustybuzz, ::before / ::after, page-break-*, <a href> → anotação de link, <img> data-URI → /XObject, cascata de múltiplas fontes. Zero dependências MPL. Em todas as bindings.

Exemplo rápido

Python

from pdf_oxide import Pdf

pdf = Pdf.from_html("<h1>Hello</h1><p>World</p>")
pdf.save("out.pdf")

WASM

import { WasmPdf } from "pdf-oxide-wasm";
import { writeFileSync } from "fs";

const pdf = WasmPdf.fromHtml("<h1>Hello</h1><p>World</p>");
writeFileSync("out.pdf", pdf.toBytes());

Rust

use pdf_oxide::api::Pdf;

let pdf = Pdf::from_html("<h1>Hello</h1><p>World</p>")?;
pdf.save("out.pdf")?;

package main

import (
    "log"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

func main() {
    pdf, err := pdfoxide.FromHtml("<h1>Hello</h1><p>World</p>")
    if err != nil { log.Fatal(err) }
    defer pdf.Close()

    if err := pdf.Save("out.pdf"); err != nil { log.Fatal(err) }
}

using PdfOxide;

using var pdf = Pdf.FromHtml("<h1>Hello</h1><p>World</p>");
pdf.Save("out.pdf");

Java

import fyi.oxide.pdf.Pdf;
import java.nio.file.Path;

try (Pdf pdf = Pdf.fromHtml("<h1>Hello</h1><p>World</p>")) {
    pdf.saveTo(Path.of("out.pdf"));
}

PHP

use PdfOxide\Pdf;

$pdf = Pdf::fromHtml('<h1>Hello</h1><p>World</p>');
file_put_contents('out.pdf', $pdf->save());

Ruby

require 'pdf_oxide'

PdfOxide::Pdf.from_html('<h1>Hello</h1><p>World</p>') { |pdf| pdf.save('out.pdf') }

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto pdf = pdf_oxide::Pdf::from_html("<h1>Hello</h1><p>World</p>");
pdf.save("out.pdf");

Swift

import PdfOxide

let pdf = try Pdf.fromHtml("<h1>Hello</h1><p>World</p>")
try pdf.save("out.pdf")

Kotlin

import fyi.oxide.pdf.Pdf

Pdf.fromHtml("<h1>Hello</h1><p>World</p>").use { it.saveTo(java.nio.file.Path.of("out.pdf")) }

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final pdf = Pdf.fromHtml('<h1>Hello</h1><p>World</p>');
pdf.save('out.pdf');

library(pdfoxide)

pdf <- pdf_from_html("<h1>Hello</h1><p>World</p>")
pdf_save(pdf, "out.pdf")

Julia

using PdfOxide

pdf = from_html("<h1>Hello</h1><p>World</p>")
save(pdf, "out.pdf")

Zig

const pdf_oxide = @import("pdf_oxide");

var pdf = try pdf_oxide.Pdf.fromHtml("<h1>Hello</h1><p>World</p>");
try pdf.save("out.pdf");

Scala

import fyi.oxide.pdf.Pdf
import scala.util.Using

Using.resource(Pdf.fromHtml("<h1>Hello</h1><p>World</p>"))(_.saveTo(java.nio.file.Path.of("out.pdf")))

Clojure

(require '[pdf-oxide.core :as pdf])

(let [p (pdf/from-html "<h1>Hello</h1><p>World</p>")]
  (.saveTo p (java.nio.file.Path/of "out.pdf" (into-array String []))))

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXPdf *pdf = [POXPdf fromHtml:@"<h1>Hello</h1><p>World</p>" error:&err];
[pdf saveToPath:@"out.pdf" error:&err];

Elixir

{:ok, pdf} = PdfOxide.from_html("<h1>Hello</h1><p>World</p>")
PdfOxide.save(pdf, "out.pdf")

Pipeline HTML + CSS (v0.3.37)

Pdf::from_html_css(html, css, font_bytes) recebe HTML, uma folha de estilos CSS e os bytes de uma fonte TTF/OTF. Retorna um PDF paginado. O extract_text faz o trajeto de ida e volta byte a byte, então os PDFs produzidos participam da infraestrutura de testes existente.

Rust:

use pdf_oxide::api::Pdf;

let font = std::fs::read("DejaVuSans.ttf")?;
let pdf = Pdf::from_html_css(
    "<h1>Hello</h1><p>World</p>",
    "h1 { color: blue; font-size: 24pt } p { line-height: 1.5 }",
    font,
)?;
pdf.save("out.pdf")?;

Python:

from pdf_oxide import Pdf

with open("DejaVuSans.ttf", "rb") as f:
    font = f.read()

pdf = Pdf.from_html_css(
    "<h1>Hello</h1><p>World</p>",
    "h1 { color: blue; font-size: 24pt }",
    font,
)
pdf.save("out.pdf")

Node / TypeScript:

import { Pdf } from "pdf-oxide";
import { readFileSync } from "fs";

const font = readFileSync("DejaVuSans.ttf");
const pdf = Pdf.fromHtmlCss(
  "<h1>Hello</h1><p>World</p>",
  "h1 { color: blue; font-size: 24pt }",
  font,
);
pdf.save("out.pdf");

Go:

font, _ := os.ReadFile("DejaVuSans.ttf")
pdf, err := pdfoxide.FromHtmlCss(
    "<h1>Hello</h1><p>World</p>",
    "h1 { color: blue; font-size: 24pt }",
    font,
)
if err != nil { log.Fatal(err) }
defer pdf.Close()
_ = pdf.Save("out.pdf")

C#:

var font = File.ReadAllBytes("DejaVuSans.ttf");
using var pdf = Pdf.FromHtmlCss(
    "<h1>Hello</h1><p>World</p>",
    "h1 { color: blue; font-size: 24pt }",
    font);
pdf.Save("out.pdf");

C++:

#include <pdf_oxide/pdf_oxide.hpp>
#include <fstream>

std::ifstream in("DejaVuSans.ttf", std::ios::binary);
std::string font((std::istreambuf_iterator<char>(in)), {});
auto pdf = pdf_oxide::Pdf::from_html_css(
    "<h1>Hello</h1><p>World</p>",
    "h1 { color: blue; font-size: 24pt }",
    std::vector<uint8_t>(font.begin(), font.end()));
pdf.save("out.pdf");

Swift:

import PdfOxide
import Foundation

let font = [UInt8](try Data(contentsOf: URL(fileURLWithPath: "DejaVuSans.ttf")))
let pdf = try Pdf.fromHtmlCss(
    html: "<h1>Hello</h1><p>World</p>",
    css: "h1 { color: blue; font-size: 24pt }",
    fontBytes: font)
try pdf.save("out.pdf")

Dart:

import 'dart:io';
import 'package:pdf_oxide/pdf_oxide.dart';

final font = File('DejaVuSans.ttf').readAsBytesSync();
final pdf = Pdf.fromHtmlCss(
    '<h1>Hello</h1><p>World</p>',
    'h1 { color: blue; font-size: 24pt }',
    font);
pdf.save('out.pdf');

library(pdfoxide)

font <- readBin("DejaVuSans.ttf", "raw", file.info("DejaVuSans.ttf")$size)
pdf <- pdf_from_html_css(
    "<h1>Hello</h1><p>World</p>",
    "h1 { color: blue; font-size: 24pt }",
    font)
pdf_save(pdf, "out.pdf")

Julia:

using PdfOxide

font = read("DejaVuSans.ttf")
pdf = from_html_css(
    "<h1>Hello</h1><p>World</p>",
    "h1 { color: blue; font-size: 24pt }",
    font)
save(pdf, "out.pdf")

Zig:

const pdf_oxide = @import("pdf_oxide");
const std = @import("std");

const font = try std.fs.cwd().readFileAlloc(std.heap.page_allocator, "DejaVuSans.ttf", 1 << 24);
var pdf = try pdf_oxide.Pdf.fromHtmlCss(
    "<h1>Hello</h1><p>World</p>",
    "h1 { color: blue; font-size: 24pt }",
    font);
try pdf.save("out.pdf");

Objective-C:

#import "POXPdfOxide.h"
NSError *err = nil;

NSData *font = [NSData dataWithContentsOfFile:@"DejaVuSans.ttf"];
POXPdf *pdf = [POXPdf fromHtml:@"<h1>Hello</h1><p>World</p>"
                          css:@"h1 { color: blue; font-size: 24pt }"
                    fontBytes:font
                        error:&err];
[pdf saveToPath:@"out.pdf" error:&err];

Elixir:

font = File.read!("DejaVuSans.ttf")
{:ok, pdf} = PdfOxide.from_html_css(
    "<h1>Hello</h1><p>World</p>",
    "h1 { color: blue; font-size: 24pt }",
    font)
PdfOxide.save(pdf, "out.pdf")

Cascata de múltiplas fontes

Use Pdf::from_html_css_with_fonts(html, css, fonts) quando o documento combinar várias famílias de fontes. A propriedade CSS font-family em qualquer elemento é resolvida contra as famílias registradas (sem distinção entre maiúsculas e minúsculas, com ou sem aspas, nomes de várias palavras sem aspas). Famílias desconhecidas recorrem à primeira fonte registrada.

from pdf_oxide import Pdf

fonts = [
    ("DejaVu Sans", open("DejaVuSans.ttf", "rb").read()),
    ("Noto Sans CJK", open("NotoSansCJKtc-Regular.otf", "rb").read()),
]

pdf = Pdf.from_html_css_with_fonts(
    '<h1 style="font-family: DejaVu Sans">English</h1>'
    '<p style="font-family: \'Noto Sans CJK\'">中文段落</p>',
    "h1 { font-size: 24pt }",
    fonts,
)
pdf.save("multilang.pdf")

O conteúdo CJK é automaticamente reduzido a subconjunto na saída (v0.3.38 #385) — um PDF com 5 caracteres de uma fonte CJK de cerca de 17 MB normalmente fica abaixo de 100 KB.

Superfície de CSS suportada

Seletores — subconjunto L3 + L4: :is / :where / :not / :has, pseudoclasses estruturais, comparadores de atributos com flags i / s.
Cascata — ordenação por origem / especificidade / ordem do código, herança, mesclagem de estilos inline, propriedades personalizadas (var() com detecção de ciclos).
Funções — calc(), min(), max(), clamp().
Regras at — @media print (sempre verdadeira), (min/max-width), @page :first / :left / :right / :blank com caixas de margem, @font-face, @import, @supports.
Valores tipados — cor (~150 nomeadas, hex, rgb/rgba, hsl), comprimento (todas as unidades de CSS Values L4), display, font-size / weight / style / family, atalhos de margin / padding, line-height.
Contadores — counter / counters, counter-reset / -increment / -set, numeração romana / grega / alfabética.
Pseudoelementos — ::before / ::after com strings literais, attr(name), open-quote / close-quote.
Layout — block, flex, grid (tudo via Taffy), colapso de margens, múltiplas colunas (column-count / column-width / column-gap), tabelas (algoritmos de coluna auto e fixed).
Inline — quebra de linha conforme o UAX #14, text-align, modos de white-space, quebras forçadas, caixas inline atômicas.
Efeitos — opacity, transform: translate*(), page-break-before: always, page-break-after: always.
HTML — tokenizador HTML5, extração de <style> / <link rel="stylesheet"> / style="" inline, decodificação de <img> data-URI (/XObject), <a href> → anotação /Link com /URI, marcadores de lista <ul> / <ol>.

Fora do escopo

Filtros CSS, transformações 3D, animações, SVG em HTML (todo crate de SVG viável em Rust é MPL), MathML, hyphens: auto, shape-outside, execução de JavaScript, transform de matriz completa (escala / rotação), gradientes, box-shadow.

Licença

cargo deny check licenses passa com zero dependências transitivas MPL. A pilha CSS da Mozilla (cssparser, selectors, html5ever, lightningcss, stylo) é toda MPL-2.0; a v0.3.37 reimplementa à mão os equivalentes para manter o pdf_oxide inteiramente sob MIT/Apache.

Elementos HTML suportados

Elemento	Descrição
`<h1>` a `<h6>`	Títulos (mapeados para tamanhos de título do PDF)
`<p>`	Parágrafos com espaçamento automático
`<b>`, `<strong>`	Texto em negrito
`<i>`, `<em>`	Texto em itálico
`<ul>`, `<ol>`, `<li>`	Listas não ordenadas e ordenadas
`<pre>`, `<code>`	Texto pré-formatado e código inline
`<blockquote>`	Citações em bloco
`<br>`	Quebras de linha
`<hr>`	Linhas horizontais

Referência completa da API

`Pdf::from_html(content)` (Método estático)

Cria um PDF a partir de conteúdo HTML usando as configurações padrão (página Letter, margens de 72pt, Helvetica de 12pt).

Rust:

use pdf_oxide::api::Pdf;

let html = r#"
<h1>Product Specification</h1>
<p>This document describes the <strong>technical requirements</strong>
for the new product line.</p>
<h2>Requirements</h2>
<ul>
    <li>Operating temperature: -20C to 60C</li>
    <li>Power consumption: &lt;5W</li>
    <li>Weight: &lt;200g</li>
</ul>
"#;

let pdf = Pdf::from_html(html)?;
pdf.save("spec.pdf")?;

JavaScript:

import { WasmPdf } from "pdf-oxide-wasm";
import { writeFileSync } from "fs";

const html = `
<h1>Product Specification</h1>
<p>This document describes the <strong>technical requirements</strong>
for the new product line.</p>
`;

const pdf = WasmPdf.fromHtml(html);
writeFileSync("spec.pdf", pdf.toBytes());

Python:

from pdf_oxide import Pdf

html = """
<h1>Product Specification</h1>
<p>This document describes the <strong>technical requirements</strong>
for the new product line.</p>
"""

pdf = Pdf.from_html(html)
pdf.save("spec.pdf")

Java:

import fyi.oxide.pdf.Pdf;
import java.nio.file.Path;

String html = "<h1>Product Specification</h1>"
            + "<p>This document describes the <strong>technical requirements</strong>.</p>";

try (Pdf pdf = Pdf.fromHtml(html)) {
    pdf.saveTo(Path.of("spec.pdf"));
}

PHP:

use PdfOxide\Pdf;

$html = '<h1>Product Specification</h1>'
      . '<p>This document describes the <strong>technical requirements</strong>.</p>';

$pdf = Pdf::fromHtml($html);
file_put_contents('spec.pdf', $pdf->save());

Ruby:

require 'pdf_oxide'

html = '<h1>Product Specification</h1>' \
       '<p>This document describes the <strong>technical requirements</strong>.</p>'

PdfOxide::Pdf.from_html(html) { |pdf| pdf.save('spec.pdf') }

C++:

#include <pdf_oxide/pdf_oxide.hpp>

std::string html =
    "<h1>Product Specification</h1>"
    "<p>This document describes the <strong>technical requirements</strong>.</p>";

auto pdf = pdf_oxide::Pdf::from_html(html);
pdf.save("spec.pdf");

Swift:

import PdfOxide

let html = """
<h1>Product Specification</h1>
<p>This document describes the <strong>technical requirements</strong>.</p>
"""

let pdf = try Pdf.fromHtml(html)
try pdf.save("spec.pdf")

Kotlin:

import fyi.oxide.pdf.Pdf

val html = """
    <h1>Product Specification</h1>
    <p>This document describes the <strong>technical requirements</strong>.</p>
""".trimIndent()

Pdf.fromHtml(html).use { it.saveTo(java.nio.file.Path.of("spec.pdf")) }

Dart:

import 'package:pdf_oxide/pdf_oxide.dart';

final html = '<h1>Product Specification</h1>'
    '<p>This document describes the <strong>technical requirements</strong>.</p>';

final pdf = Pdf.fromHtml(html);
pdf.save('spec.pdf');

library(pdfoxide)

html <- paste0(
    "<h1>Product Specification</h1>",
    "<p>This document describes the <strong>technical requirements</strong>.</p>")

pdf <- pdf_from_html(html)
pdf_save(pdf, "spec.pdf")

Julia:

using PdfOxide

html = """
<h1>Product Specification</h1>
<p>This document describes the <strong>technical requirements</strong>.</p>
"""

pdf = from_html(html)
save(pdf, "spec.pdf")

Zig:

const pdf_oxide = @import("pdf_oxide");

const html =
    "<h1>Product Specification</h1>" ++
    "<p>This document describes the <strong>technical requirements</strong>.</p>";

var pdf = try pdf_oxide.Pdf.fromHtml(html);
try pdf.save("spec.pdf");

Scala:

import fyi.oxide.pdf.Pdf
import scala.util.Using

val html =
  "<h1>Product Specification</h1>" +
  "<p>This document describes the <strong>technical requirements</strong>.</p>"

Using.resource(Pdf.fromHtml(html))(_.saveTo(java.nio.file.Path.of("spec.pdf")))

Clojure:

(require '[pdf-oxide.core :as pdf])

(let [html (str "<h1>Product Specification</h1>"
                "<p>This document describes the <strong>technical requirements</strong>.</p>")
      p    (pdf/from-html html)]
  (.saveTo p (java.nio.file.Path/of "spec.pdf" (into-array String []))))

Objective-C:

#import "POXPdfOxide.h"
NSError *err = nil;

NSString *html = @"<h1>Product Specification</h1>"
                  "<p>This document describes the <strong>technical requirements</strong>.</p>";

POXPdf *pdf = [POXPdf fromHtml:html error:&err];
[pdf saveToPath:@"spec.pdf" error:&err];

Elixir:

html =
  "<h1>Product Specification</h1>" <>
  "<p>This document describes the <strong>technical requirements</strong>.</p>"

{:ok, pdf} = PdfOxide.from_html(html)
PdfOxide.save(pdf, "spec.pdf")

Assinatura em Python:

Pdf.from_html(
    content: str,
    title: str | None = None,
    author: str | None = None
) -> Pdf

`PdfBuilder::new().from_html(content)` (Padrão Builder)

Use o PdfBuilder para controlar o tamanho da página, as margens, o tamanho da fonte e os metadados do documento.

Rust:

use pdf_oxide::api::PdfBuilder;
use pdf_oxide::writer::PageSize;

let pdf = PdfBuilder::new()
    .title("Technical Specification")
    .author("Engineering")
    .page_size(PageSize::A4)
    .margin(54.0)
    .font_size(11.0)
    .from_html("<h1>Spec</h1><p>Version 2.0</p>")?;

pdf.save("spec_a4.pdf")?;

Exemplos avançados

Relatório estruturado

use pdf_oxide::api::Pdf;

let html = r#"
<h1>Incident Report</h1>
<h2>Summary</h2>
<p>On <em>2025-11-15</em>, a service disruption was detected in the
<strong>payment processing</strong> pipeline.</p>

<h2>Timeline</h2>
<ol>
    <li>14:32 UTC - Alert triggered for elevated error rates</li>
    <li>14:35 UTC - On-call engineer acknowledged</li>
    <li>14:48 UTC - Root cause identified: database connection pool exhaustion</li>
    <li>15:02 UTC - Fix deployed, services recovering</li>
    <li>15:15 UTC - Full recovery confirmed</li>
</ol>

<h2>Root Cause</h2>
<p>A configuration change deployed at 14:00 UTC reduced the maximum
connection pool size from 100 to 10.</p>

<h2>Code Reference</h2>
<pre><code>max_connections: 10  # Should be 100
timeout_seconds: 30
</code></pre>

<h2>Action Items</h2>
<ul>
    <li>Add validation for connection pool configuration</li>
    <li>Implement canary deployment for config changes</li>
    <li>Add alerting for connection pool utilization</li>
</ul>
"#;

let pdf = Pdf::from_html(html)?;
pdf.save("incident_report.pdf")?;

Python com HTML dinâmico

from pdf_oxide import Pdf

rows = [
    ("Widget A", "$12.99", 150),
    ("Widget B", "$24.50", 89),
    ("Widget C", "$7.25", 312),
]

html = "<h1>Inventory Report</h1>"
html += "<p>Generated on 2025-11-20</p>"
html += "<h2>Current Stock</h2><ul>"
for name, price, qty in rows:
    html += f"<li><strong>{name}</strong> - {price} ({qty} units)</li>"
html += "</ul>"

pdf = Pdf.from_html(html, title="Inventory Report")
pdf.save("inventory.pdf")

Lendo HTML de um arquivo

from pdf_oxide import Pdf

with open("report.html") as f:
    html = f.read()

pdf = Pdf.from_html(html, title="Report")
pdf.save("report.pdf")

import { WasmPdf } from "pdf-oxide-wasm";
import { readFileSync, writeFileSync } from "fs";

const html = readFileSync("report.html", "utf-8");
const pdf = WasmPdf.fromHtml(html);
writeFileSync("report.pdf", pdf.toBytes());

use pdf_oxide::api::Pdf;

let html = std::fs::read_to_string("report.html")?;
let pdf = Pdf::from_html(&html)?;
pdf.save("report.pdf")?;

Páginas relacionadas

Criar a partir de Markdown – Converter Markdown em PDF
API fluente do PdfBuilder – Todas as opções de configuração do builder
API de baixo nível do DocumentBuilder – Construção programática de páginas

Criar a partir de HTML

Exemplo rápido

Pipeline HTML + CSS (v0.3.37)

Cascata de múltiplas fontes

Superfície de CSS suportada

Fora do escopo

Licença

Elementos HTML suportados

Referência completa da API

Pdf::from_html(content) (Método estático)

PdfBuilder::new().from_html(content) (Padrão Builder)

Exemplos avançados

Relatório estruturado

Python com HTML dinâmico

Lendo HTML de um arquivo

Páginas relacionadas

`Pdf::from_html(content)` (Método estático)

`PdfBuilder::new().from_html(content)` (Padrão Builder)