What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Getting Started with PDF Oxide (C# / .NET)

PDF Oxide is the fastest .NET PDF library — 0.8ms mean text extraction, 5× faster than PyMuPDF, 15× faster than pypdf, 100% pass rate on 3,830 PDFs. One package for extracting, creating, and editing PDFs. NativeAOT-ready, trim-safe, with idiomatic using, async Task<T>, CancellationToken, and LINQ-friendly collections. MIT / Apache-2.0 licensed.

Installation

dotnet add package PdfOxide

Target frameworks: net8.0 and net10.0. IsAotCompatible=true and IsTrimmable=true are enabled.

Pre-built native libraries ship in the NuGet package for Windows, macOS (Intel + Apple Silicon), and Linux (x64 + ARM64). No system dependencies, no Rust toolchain required.

Opening a PDF

using PdfOxide.Core;

using var doc = PdfDocument.Open("research-paper.pdf");
Console.WriteLine($"Pages: {doc.PageCount}");
Console.WriteLine($"PDF version: {doc.Version.Major}.{doc.Version.Minor}");

From a stream:

using var stream = File.OpenRead("report.pdf");
using var doc = PdfDocument.Open(stream);

With a password:

using var doc = PdfDocument.OpenWithPassword("secure.pdf", "user-password");

AES-256 (V=5, R=6) PDFs are fully supported.

Page API

Since v0.3.34 PdfDocument exposes Pages (an IReadOnlyList<PdfPage>) and an int indexer, so you can iterate with foreach and use LINQ.

using PdfOxide.Core;

using var doc = PdfDocument.Open("paper.pdf");

foreach (var page in doc.Pages)
{
    Console.WriteLine($"--- Page {page.Index + 1} ---");
    Console.WriteLine(page.ExtractText());
}

// Or index directly
PdfPage first = doc[0];
string md = await first.ToMarkdownAsync();

Each PdfPage has a full sync + async surface: ExtractText() / ExtractTextAsync(), ToMarkdown(), ToHtml(), ToPlainText(), ExtractWords(), ExtractTextLines(), ExtractTables(), ExtractChars(), ExtractImages(), Search().

Text Extraction

Single Page

using var doc = PdfDocument.Open("report.pdf");

string text = doc.ExtractText(0);
Console.WriteLine(text);

All Pages

string allText = doc.ExtractAllText();

Walk Pages Manually

for (int i = 0; i < doc.PageCount; i++)
{
    Console.WriteLine($"--- Page {i + 1} ---");
    Console.WriteLine(doc.ExtractText(i));
}

Async Extraction

Every extraction method has an *Async counterpart returning Task<T> and accepting an optional CancellationToken.

using PdfOxide.Core;

using var doc = PdfDocument.Open("large.pdf");

string text = await doc.ExtractTextAsync(0);

// Fan-out with cancellation
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(30));
var tasks = Enumerable.Range(0, doc.PageCount)
    .Select(i => doc.ExtractTextAsync(i, cts.Token));
string[] pages = await Task.WhenAll(tasks);

See the async guide for complete patterns.

Structured Extraction

var words = doc.ExtractWords(0);
foreach (var (text, x, y, w, h) in words)
{
    Console.WriteLine($"\"{text}\" at ({x:F1}, {y:F1})");
}

// Region-based
string regionText = doc.ExtractTextInRect(0, x: 50, y: 700, width: 200, height: 50);

var tables = doc.ExtractTables(0);
foreach (var (rows, cols) in tables)
{
    Console.WriteLine($"{rows}x{cols} table");
}

Markdown Conversion

string markdown = doc.ToMarkdown(0);
string allMarkdown = doc.ToMarkdownAll();

HTML Conversion

string html = doc.ToHtml(0);
string allHtml = doc.ToHtmlAll();

Image Extraction

using PdfOxide.Core;

using var doc = PdfDocument.Open("brochure.pdf");
var images = doc.ExtractImages(0);

foreach (var img in images)
{
    Console.WriteLine($"{img.Width}x{img.Height} {img.Format} ({img.Colorspace}, {img.BitsPerComponent} bpc, {img.Data.Length} bytes)");
    File.WriteAllBytes($"image_{Array.IndexOf(images.ToArray(), img)}.{img.Format}", img.Data);
}

Indexed-color PDFs are automatically expanded to RGB (1/2/4/8 bpc with RGB, Grayscale, or CMYK base colour spaces).

Search

var results = doc.SearchAll("quarterly revenue");
foreach (var (page, text, x, y, w, h) in results)
{
    Console.WriteLine($"Page {page}: \"{text}\" at ({x}, {y})");
}

// Case-sensitive single-page
var pageResults = doc.SearchPage(0, "exact phrase", caseSensitive: true);

LINQ integrates naturally:

var hitsByPage = doc.SearchAll("keyword")
    .GroupBy(r => r.Page)
    .OrderBy(g => g.Key);

foreach (var group in hitsByPage)
{
    Console.WriteLine($"Page {group.Key}: {group.Count()} hits");
}

PDF Creation

using PdfOxide.Core;

// From Markdown
using (var pdf = Pdf.FromMarkdown("# Invoice\n\nTotal: **$42.00**"))
{
    pdf.Save("invoice.pdf");
}

// From HTML
using (var pdf = Pdf.FromHtml("<h1>Report</h1><p>Generated 2026-04-09</p>"))
{
    pdf.Save("report.pdf");
}

// From plain text
using (var pdf = Pdf.FromText("Plain text document.\n\nSecond paragraph."))
{
    pdf.Save("notes.pdf");
}

// From image
using (var pdf = Pdf.FromImage("scan.jpg"))
{
    pdf.Save("scan.pdf");
}

Editing — Metadata and Forms

using PdfOxide.Core;

using var editor = DocumentEditor.Open("form.pdf");

// Read metadata
Console.WriteLine($"Title: {editor.Title}");
Console.WriteLine($"Pages: {editor.PageCount}");

// Update metadata
editor.Title = "Quarterly Report";
editor.Author = "Finance Team";
editor.Subject = "Q1 2026 Results";

// Fill and flatten form fields
editor.SetFormFieldValue("employee.name", "Jane Doe");
editor.SetFormFieldValue("employee.email", "jane@example.com");
editor.FlattenForms();

editor.Save("edited.pdf");
// or: await editor.SaveAsync("edited.pdf");

Reading form fields without editing:

using var doc = PdfDocument.Open("form.pdf");
foreach (var f in doc.GetFormFields())
{
    Console.WriteLine($"{f.Name} ({f.FieldType}) = \"{f.Value}\"");
}

Note: The .NET binding currently exposes document open / read / convert / create, image extraction, form field read/fill/flatten, and metadata editing. Page operations, annotations, rendering, and signatures are available through the Rust core and other bindings; equivalent .NET surface will be added in a future release.

NativeAOT Publishing

PDF Oxide’s .NET binding is NativeAOT-publish-ready:

dotnet publish -c Release -r linux-x64 --self-contained -p:PublishAot=true

All 881 P/Invoke declarations use LibraryImport (source-generated P/Invoke), IsAotCompatible=true, IsTrimmable=true. Your AOT-compiled binary links only the bits it uses, and the native Rust core is statically linked in the included platform-specific library.

Plugins and Extensions

The PdfOxide.Plugins package (shipped alongside PdfOxide) exposes extension points for processors that transform extracted content — classifiers, post-processors, validators. See the plugin guide for extension authoring.

Error Handling

All methods throw typed exceptions on failure:

using PdfOxide.Core;

try
{
    using var doc = PdfDocument.Open("document.pdf");
    string text = doc.ExtractText(0);
}
catch (PdfOxideException ex)
{
    Console.Error.WriteLine($"PDF Oxide error: {ex.Message}");
}
catch (FileNotFoundException)
{
    Console.Error.WriteLine("File not found");
}

Next Steps

Python Getting Started — using PDF Oxide from Python
C# API Reference — full API documentation
Async Guide — Task<T> + CancellationToken patterns
Concurrency Guide — ReaderWriterLockSlim sharing patterns
Text Extraction — detailed extraction options
NuGet package — release notes and download stats