What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

C# / .NET PDF 库 — PDF Oxide

PDF Oxide 是 .NET 最快的 PDF 库：文本提取平均 0.8 ms，比 PyMuPDF 快 5 倍，比 pypdf 快 15 倍，在 3830 个 PDF 上 100% 通过率。一个包覆盖提取、生成、编辑——支持 NativeAOT、trim-safe，提供地道的 using、async Task<T>、CancellationToken 以及 LINQ 友好的集合 API。MIT / Apache-2.0 双协议。

安装

dotnet add package PdfOxide

目标框架： net8.0 与 net10.0。已默认启用 IsAotCompatible=true 与 IsTrimmable=true。

NuGet 包中内置了适用于 Windows、macOS（Intel 与 Apple Silicon）、Linux（x64 与 ARM64）的预编译原生库。无系统依赖，也不需要 Rust 工具链。

打开 PDF

using PdfOxide.Core;

using var doc = PdfDocument.Open("research-paper.pdf");
Console.WriteLine($"页数: {doc.PageCount}");
Console.WriteLine($"PDF 版本: {doc.Version.Major}.{doc.Version.Minor}");

从流打开：

using var stream = File.OpenRead("report.pdf");
using var doc = PdfDocument.Open(stream);

带密码：

using var doc = PdfDocument.OpenWithPassword("secure.pdf", "user-password");

完整支持 AES-256 (V=5, R=6) 加密的 PDF。

页面 API

自 v0.3.34 起，PdfDocument 暴露 Pages 属性（IReadOnlyList<PdfPage>）以及 int 索引器，可直接 foreach 迭代或使用 LINQ。

using PdfOxide.Core;

using var doc = PdfDocument.Open("paper.pdf");

foreach (var page in doc.Pages)
{
    Console.WriteLine($"--- 第 {page.Index + 1} 页 ---");
    Console.WriteLine(page.ExtractText());
}

// 直接按索引访问
PdfPage first = doc[0];
string md = await first.ToMarkdownAsync();

每个 PdfPage 都提供完整的同步与异步 API：ExtractText() / ExtractTextAsync()、ToMarkdown()、ToHtml()、ToPlainText()、ExtractWords()、ExtractTextLines()、ExtractTables()、ExtractChars()、ExtractImages()、Search()。

文本提取

单页

using var doc = PdfDocument.Open("report.pdf");

string text = doc.ExtractText(0);
Console.WriteLine(text);

全部页面

string allText = doc.ExtractAllText();

手动遍历

for (int i = 0; i < doc.PageCount; i++)
{
    Console.WriteLine($"--- 第 {i + 1} 页 ---");
    Console.WriteLine(doc.ExtractText(i));
}

异步提取

每个提取方法都提供了返回 Task<T> 的 *Async 版本，并可选接收 CancellationToken。

using PdfOxide.Core;

using var doc = PdfDocument.Open("large.pdf");

string text = await doc.ExtractTextAsync(0);

// 带取消的扇出
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(30));
var tasks = Enumerable.Range(0, doc.PageCount)
    .Select(i => doc.ExtractTextAsync(i, cts.Token));
string[] pages = await Task.WhenAll(tasks);

完整模式见异步指南。

结构化提取

var words = doc.ExtractWords(0);
foreach (var (text, x, y, w, h) in words)
{
    Console.WriteLine($"\"{text}\" 位于 ({x:F1}, {y:F1})");
}

// 按区域
string regionText = doc.ExtractTextInRect(0, x: 50, y: 700, width: 200, height: 50);

var tables = doc.ExtractTables(0);
foreach (var (rows, cols) in tables)
{
    Console.WriteLine($"{rows}x{cols} 表格");
}

转 Markdown

string markdown = doc.ToMarkdown(0);
string allMarkdown = doc.ToMarkdownAll();

转 HTML

string html = doc.ToHtml(0);
string allHtml = doc.ToHtmlAll();

图像提取

using PdfOxide.Core;

using var doc = PdfDocument.Open("brochure.pdf");
var images = doc.ExtractImages(0);

foreach (var img in images)
{
    Console.WriteLine($"{img.Width}x{img.Height} {img.Format} ({img.Colorspace}, {img.BitsPerComponent} bpc, {img.Data.Length} 字节)");
    File.WriteAllBytes($"image_{Array.IndexOf(images.ToArray(), img)}.{img.Format}", img.Data);
}

索引色 PDF 会自动展开为 RGB（基础色空间为 RGB、灰度或 CMYK，位深 1/2/4/8 bpc）。

搜索

var results = doc.SearchAll("季度营收");
foreach (var (page, text, x, y, w, h) in results)
{
    Console.WriteLine($"第 {page} 页: \"{text}\" 位于 ({x}, {y})");
}

// 单页，区分大小写
var pageResults = doc.SearchPage(0, "精确短语", caseSensitive: true);

LINQ 集成自然：

var hitsByPage = doc.SearchAll("keyword")
    .GroupBy(r => r.Page)
    .OrderBy(g => g.Key);

foreach (var group in hitsByPage)
{
    Console.WriteLine($"第 {group.Key} 页: 命中 {group.Count()} 次");
}

生成 PDF

using PdfOxide.Core;

// 从 Markdown
using (var pdf = Pdf.FromMarkdown("# 发票\n\n合计: **$42.00**"))
{
    pdf.Save("invoice.pdf");
}

// 从 HTML
using (var pdf = Pdf.FromHtml("<h1>报告</h1><p>生成于 2026-04-09</p>"))
{
    pdf.Save("report.pdf");
}

// 从纯文本
using (var pdf = Pdf.FromText("纯文本文档。\n\n第二段。"))
{
    pdf.Save("notes.pdf");
}

// 从图像
using (var pdf = Pdf.FromImage("scan.jpg"))
{
    pdf.Save("scan.pdf");
}

编辑——元数据与表单

using PdfOxide.Core;

using var editor = DocumentEditor.Open("form.pdf");

// 读取元数据
Console.WriteLine($"标题: {editor.Title}");
Console.WriteLine($"页数: {editor.PageCount}");

// 更新元数据
editor.Title = "季度报告";
editor.Author = "财务团队";
editor.Subject = "2026 年 Q1 业绩";

// 填写并展平表单字段
editor.SetFormFieldValue("employee.name", "Jane Doe");
editor.SetFormFieldValue("employee.email", "jane@example.com");
editor.FlattenForms();

editor.Save("edited.pdf");
// 或: await editor.SaveAsync("edited.pdf");

只读取表单字段：

using var doc = PdfDocument.Open("form.pdf");
foreach (var f in doc.GetFormFields())
{
    Console.WriteLine($"{f.Name} ({f.FieldType}) = \"{f.Value}\"");
}

注意： 当前的 .NET 绑定已提供文档打开/读取/转换/生成、图像提取、表单字段的读取/填写/展平以及元数据编辑。页面操作、注释、渲染与签名功能在 Rust 内核和其他绑定中均可用；对等的 .NET API 将在后续版本中加入。

NativeAOT 发布

PDF Oxide 的 .NET 绑定可直接用 NativeAOT 发布：

dotnet publish -c Release -r linux-x64 --self-contained -p:PublishAot=true

全部 881 条 P/Invoke 声明均使用 LibraryImport（源码生成的 P/Invoke），并设置了 IsAotCompatible=true、IsTrimmable=true。AOT 编译后的二进制只会链接用到的部分，Rust 原生内核则通过随包提供的平台特定库静态链接。

插件与扩展

PdfOxide.Plugins 包（与 PdfOxide 一起发布）为转换提取内容的处理器（分类器、后处理器、验证器）提供扩展点。扩展编写方式见插件指南。

错误处理

所有方法在失败时都会抛出带类型的异常：

using PdfOxide.Core;

try
{
    using var doc = PdfDocument.Open("document.pdf");
    string text = doc.ExtractText(0);
}
catch (PdfOxideException ex)
{
    Console.Error.WriteLine($"PDF Oxide 错误: {ex.Message}");
}
catch (FileNotFoundException)
{
    Console.Error.WriteLine("文件未找到");
}

下一步

Python 快速上手 — 从 Python 使用 PDF Oxide
C# API 参考 — 完整 API 文档
异步指南 — Task<T> + CancellationToken 模式
并发指南 — ReaderWriterLockSlim 共享模式
文本提取 — 更详细的提取选项
NuGet 包 — 发布说明与下载统计