What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

C# / .NET API Reference

The PdfOxide NuGet package wraps the Rust core via LibraryImport-generated P/Invoke (all 881 declarations). NativeAOT-publish-ready and trim-safe. Target frameworks: net8.0, net10.0.

dotnet add package PdfOxide

using PdfOxide.Core;

For other languages see Python, Node.js, Go, or Rust.

Namespaces

using PdfOxide.Core;        // PdfDocument, Pdf, DocumentEditor
using PdfOxide.Extensions;  // LINQ-style extensions
using PdfOxide.Plugins;     // extension points

All types implement IDisposable where appropriate — use using blocks or using declarations.

PdfDocument

Read-only access.

Factory methods

static PdfDocument Open(string path)
static PdfDocument Open(Stream stream)
static PdfDocument OpenFromBytes(ReadOnlySpan<byte> data)
static PdfDocument OpenWithPassword(string path, string password)

Properties

int PageCount { get; }
PdfVersion Version { get; }  // struct { Major, Minor }
bool HasStructureTree { get; }
IReadOnlyList<PdfPage> Pages { get; }  // v0.3.34
PdfPage this[int pageIndex] { get; }   // v0.3.34

PdfPage (v0.3.34)

Lightweight per-page handle with full sync + async surface. Dispatches to the parent document.

public sealed class PdfPage
{
    public int Index { get; }

    public string ExtractText();
    public Task<string> ExtractTextAsync(CancellationToken ct = default);
    public string ToMarkdown();
    public Task<string> ToMarkdownAsync(CancellationToken ct = default);
    public string ToHtml();
    public string ToPlainText();

    public (string Text, float X, float Y, float W, float H)[] ExtractWords();
    public IReadOnlyList<TextLine> ExtractTextLines();
    public IReadOnlyList<Table> ExtractTables();
    public IReadOnlyList<Char> ExtractChars();
    public IReadOnlyList<ImageInfo> ExtractImages();
    public IReadOnlyList<SearchResult> Search(string query, bool caseSensitive = false);
}

Text extraction

string ExtractText(int pageIndex)
Task<string> ExtractTextAsync(int pageIndex, CancellationToken ct = default)
string ExtractAllText()
Task<string> ExtractAllTextAsync(CancellationToken ct = default)

string ToMarkdown(int pageIndex)
string ToMarkdownAll()
string ToHtml(int pageIndex)
string ToHtmlAll()
string ToPlainText(int pageIndex)

Structured

IReadOnlyList<Word> ExtractWords(int pageIndex)
IReadOnlyList<TextLine> ExtractTextLines(int pageIndex)
IReadOnlyList<Char> ExtractChars(int pageIndex)
IReadOnlyList<Span> ExtractSpans(int pageIndex)
IReadOnlyList<Table> ExtractTables(int pageIndex)
IReadOnlyList<Path> ExtractPaths(int pageIndex)

Region-based

string ExtractTextInRect(int pageIndex, float x, float y, float width, float height)
IReadOnlyList<Word> ExtractWordsInRect(int pageIndex, float x, float y, float width, float height)

Images & resources

IReadOnlyList<ImageInfo> ExtractImages(int pageIndex)
IReadOnlyList<FontInfo> GetFonts(int pageIndex)
IReadOnlyList<AnnotationInfo> GetAnnotations(int pageIndex)
IReadOnlyList<FormField> GetFormFields()
PageInfo GetPageInfo(int pageIndex)

Search

IReadOnlyList<SearchResult> SearchPage(int pageIndex, string query, bool caseSensitive = false)
IReadOnlyList<SearchResult> SearchAll(string query, bool caseSensitive = false)

Pdf — creation

static Pdf FromMarkdown(string markdown)
static Pdf FromHtml(string html)
static Pdf FromText(string text)
static Pdf FromImage(string path)
static Pdf FromImageBytes(ReadOnlySpan<byte> data)

void Save(string path)
Task SaveAsync(string path, CancellationToken ct = default)
byte[] ToBytes()

DocumentEditor

static DocumentEditor Open(string path)
static DocumentEditor OpenFromBytes(ReadOnlySpan<byte> data)

// Metadata — properties are get/set
string? Title { get; set; }
string? Author { get; set; }
string? Subject { get; set; }
string? Keywords { get; set; }
int PageCount { get; }

void ApplyMetadata(Metadata metadata)

// Forms
void SetFormFieldValue(string name, string value)
void FlattenForms()

// Save
void Save(string path)
Task SaveAsync(string path, CancellationToken ct = default)
void SaveEncrypted(string path, string userPassword, string ownerPassword)
byte[] ToBytes()

Coverage note: the .NET binding currently exposes document open / read / convert / create, image extraction, form field read/fill/flatten, and metadata editing. Page operations, annotations, rendering, and signatures are available through the Rust core and other bindings; equivalent .NET surface will be added in a future release.

Extensions (LINQ support)

Exported from PdfOxide.Extensions:

IEnumerable<SearchResult> WhereOnPage(this IEnumerable<SearchResult> src, int page)
IEnumerable<IGrouping<int, SearchResult>> GroupByPage(this IEnumerable<SearchResult> src)
IEnumerable<Word> WithinRect(this IEnumerable<Word> src, float x, float y, float w, float h)

Use the existing IReadOnlyList<T> results with LINQ directly:

var hitsByPage = doc.SearchAll("keyword")
    .GroupBy(r => r.Page)
    .OrderBy(g => g.Key);

See extensions guide for the full list.

Plugins

Exposed under PdfOxide.Plugins — inject classifiers, post-processors, or validators into the extraction pipeline. See plugin guide.

Data types

public readonly record struct PdfVersion(int Major, int Minor);

public readonly record struct Char(
    string Text, float X, float Y,
    float FontSize, string FontName, Rect BBox);

public readonly record struct Span(
    string Text, string FontName, float FontSize, Rect BBox);

public readonly record struct Word(
    string Text, float X, float Y, float Width, float Height);

public readonly record struct TextLine(
    string Text, float Y, IReadOnlyList<Span> Spans);

public readonly record struct SearchResult(
    int Page, string Text, float X, float Y, float Width, float Height);

public readonly record struct ImageInfo(
    int Width, int Height, string Format,
    string Colorspace, int BitsPerComponent, byte[] Data);

public readonly record struct FontInfo(
    string Name, string Type, string Encoding,
    bool IsEmbedded, bool IsSubset, float Size);

public readonly record struct AnnotationInfo(
    string Type, string Subtype, string Content,
    float X, float Y, float Width, float Height,
    string? Author, string? LinkUri);

public readonly record struct FormField(
    string Name, string FieldType, string Value, int PageIndex);

public readonly record struct Rect(
    float X, float Y, float Width, float Height);

public readonly record struct PageInfo(
    float Width, float Height, int Rotation,
    Rect MediaBox, Rect CropBox);

public sealed record Metadata(
    string? Title = null,
    string? Author = null,
    string? Subject = null,
    string? Keywords = null);

Exceptions

public class PdfOxideException : Exception
{
    public int Code { get; }
    public string NativeMessage { get; }
}

Thrown on any Rust-side failure. Wrap at system boundaries; interior code should propagate.

Standard .NET exceptions are raised for I/O (FileNotFoundException, UnauthorizedAccessException, etc.) and argument validation (ArgumentOutOfRangeException).

Thread safety

PdfDocument read-only methods are thread-safe — use a single document across threads concurrently.
DocumentEditor is not thread-safe for writes. Use ReaderWriterLockSlim or serialize to one thread.
Pdf creation instances are not intended to be shared across threads.

See the concurrency guide for patterns.

Async pattern

Every I/O-bound or CPU-heavy method has an *Async variant accepting CancellationToken. See the async guide.

NativeAOT

Publish with -p:PublishAot=true. No extra configuration — all P/Invoke is source-generated, no reflection, no dynamic code.

v0.3.38 additions

`DocumentBuilder` / `PageBuilder` / `EmbeddedFont`

using PdfOxide;

using var font = EmbeddedFont.FromFile("DejaVuSans.ttf");
// Alt: EmbeddedFont.FromBytes(byte[] data, string? name = null)

var bytes = DocumentBuilder.Create()
    .Title("Report").Author("Me")
    .RegisterEmbeddedFont("DejaVu", font)
    .LetterPage()        // or .A4Page() / .Page(width, height)
        .At(72, 720).Font("DejaVu", 12).Text("Hello")
        .Heading(1, "Title")
        .Paragraph("Body text")
        // Annotations
        .LinkUrl("https://example.com")
        .LinkPage(2)
        .LinkNamed("glossary")
        .Highlight(1.0, 1.0, 0.0)
        .Underline(0.0, 0.0, 1.0)
        .Strikeout(1.0, 0.0, 0.0)
        .Squiggly(1.0, 0.5, 0.0)
        .StickyNote("Review this")
        .StickyNoteAt(300, 720, "Positioned note")
        .Stamp(StampType.Approved)
        .FreeText(100, 500, 200, 50, "Comment")
        .Watermark("DRAFT")
        .WatermarkConfidential()
        .WatermarkDraft()
        // AcroForm widgets
        .TextField("name", 150, 400, 200, 20, defaultValue: "Jane Doe")
        .Checkbox("agree", 72, 380, 15, 15, checkedValue: true)
        .ComboBox("country", 150, 360, 200, 20, new[] { "US", "UK" }, selected: "US")
        .RadioGroup("tier", new[] { ("free", 72f, 340f, 15f, 15f), ("pro", 120f, 340f, 15f, 15f) }, selected: "pro")
        .PushButton("submit", 72, 300, 80, 25, caption: "Submit")
        // Graphics primitives
        .Rect(50, 270, 500, 2)
        .FilledRect(50, 260, 500, 2, 0.9, 0.9, 0.9)
        .Line(50, 250, 550, 250)
    .Done()
    .Build();
// Or:
// .Save("out.pdf");
// .SaveEncrypted("out.pdf", "user-pw", "owner-pw");     // AES-256
// .ToBytesEncrypted("user-pw", "owner-pw");

HTML + CSS pipeline

using var pdf = Pdf.FromHtmlCss(html, css, fontBytes);
using var pdf = Pdf.FromHtmlCssWithFonts(html, css, new[] {
    ("DejaVu Sans", font1),
    ("Noto Sans CJK", font2),
});

Signature verification

using var doc = PdfDocument.Open("signed.pdf");

foreach (var sig in doc.Signatures)
{
    Console.WriteLine(sig.SignerName);
    Console.WriteLine(sig.Reason);
    Console.WriteLine(sig.Location);
    Console.WriteLine(sig.SigningTime);       // DateTimeOffset?

    SignatureStatus status = sig.Verify();    // Valid / Invalid / Unknown
    bool ok = sig.VerifyDetached(pdfBytes);

    using var cert = sig.GetCertificate();
    Console.WriteLine($"{cert.Subject} / {cert.Issuer} / {cert.Serial}");
    Console.WriteLine($"valid {cert.NotBefore} → {cert.NotAfter} (is_valid={cert.IsValid})");
}

var ts = Timestamp.Parse(tstBytes);
Console.WriteLine($"{ts.Time} serial={ts.Serial} tsa={ts.TsaName}");

var client = new TsaClient(
    url: "https://freetsa.org/tsr",
    username: null, password: null,
    timeoutSeconds: 30, hashAlgorithm: 2,
    useNonce: true, certReq: true);
var fresh = client.RequestTimestamp(pdfBytes);

SignatureStatus: Valid, Invalid, Unknown. RSA-PSS / ECDSA return Unknown; unsupported algorithms throw UnsupportedFeatureException on strict methods.

Rendering

byte[] region = doc.RenderPageRegion(pageIndex: 0, x: 72, y: 200, width: 468, height: 300, format: RenderFormat.Png);
byte[] fitted = doc.RenderPageFit(pageIndex: 0, fitWidth: 1024, fitHeight: 768, format: RenderFormat.Png);