What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Getting Started with PDF Oxide (Dart / Flutter)

PDF Oxide is the fastest way to read PDFs from Dart and Flutter — 0.8ms mean text extraction, 100% pass rate on 3,830 PDFs. The pdf_oxide package is an idiomatic dart:ffi wrapper over the Rust core: PDF handles are freed automatically by a NativeFinalizer (and explicit close()), C strings and buffers are copied into Dart for you, and C-ABI error codes surface as a PdfOxideError exception.

Installation

Add pdf_oxide to your pubspec.yaml:

dependencies:
  pdf_oxide: ^0.3.69

Then fetch dependencies:

dart pub get

The binding loads the native library (libpdf_oxide.{so,dylib,dll}) at runtime. The resolution order is PDF_OXIDE_LIB_PATH (full path) → PDF_OXIDE_LIB_DIR → ../target/release → target/release → the system loader. For Flutter, ship the platform library with your app and point PDF_OXIDE_LIB_PATH at it.

Quick Start

Open a PDF and pull text off the first page. Always close() the document when you are done — try/finally keeps that tidy.

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('research-paper.pdf');
  try {
    print('Pages: ${doc.pageCount}');
    print('PDF version: ${doc.version}'); // e.g. 1.7
    print(doc.extractText(0));
  } finally {
    doc.close();
  }
}

To open a PDF that is already in memory (for example, downloaded over HTTP in a Flutter app), use openFromBytes:

import 'dart:typed_data';
import 'package:pdf_oxide/pdf_oxide.dart';

void render(Uint8List bytes) {
  final doc = PdfDocument.openFromBytes(bytes);
  try {
    print(doc.extractText(0));
  } finally {
    doc.close();
  }
}

Text, Markdown, and HTML

Each page can be rendered as plain text, Markdown, or HTML. Per-page methods take a 0-based page index; the …All() variants run across the whole document.

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('report.pdf');
  try {
    // Single page (index 0)
    print(doc.extractText(0));   // raw extracted text
    print(doc.toPlainText(0));   // normalized plain text
    print(doc.toMarkdown(0));    // Markdown with headings, lists, tables
    print(doc.toHtml(0));        // HTML

    // Whole document
    print(doc.toMarkdownAll());
    print(doc.toHtmlAll());
    print(doc.toPlainTextAll());
  } finally {
    doc.close();
  }
}

There is also a lightweight Page view if you prefer to work a page at a time:

final doc = PdfDocument.open('report.pdf');
final page = doc.page(0);
print(page.text());
print(page.markdown());
doc.close();

Words and Lines with Coordinates

extractWords returns every word with its bounding box, font, and weight; extractTextLines returns whole lines. Coordinates are in PDF user-space points.

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('paper.pdf');
  try {
    for (final word in doc.extractWords(0)) {
      final b = word.bbox; // Bbox(x, y, width, height)
      print("'${word.text}' at (${b.x}, ${b.y}) "
          'font=${word.fontName} size=${word.fontSize} bold=${word.bold}');
    }

    for (final line in doc.extractTextLines(0)) {
      print('${line.wordCount} words: ${line.text}');
    }
  } finally {
    doc.close();
  }
}

For glyph-level detail, extractChars returns each Char with its Unicode codepoint, bounding box, font name, and size:

final doc = PdfDocument.open('paper.pdf');
for (final ch in doc.extractChars(0)) {
  print('${String.fromCharCode(ch.character)} @ ${ch.bbox} ${ch.fontSize}pt');
}
doc.close();

Search

search looks within a single page; searchAll scans the whole document. Both take a search term and a caseSensitive flag, and return SearchResult records carrying the matched text, its page index, and bounding box.

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('manual.pdf');
  try {
    // Single page (page 0), case-insensitive
    for (final hit in doc.search(0, 'configuration', false)) {
      print("page ${hit.page}: '${hit.text}' at ${hit.bbox}");
    }

    // Across the whole document
    final hits = doc.searchAll('configuration', false);
    print('${hits.length} matches');
  } finally {
    doc.close();
  }
}

Creating a PDF

The Pdf builder turns Markdown, HTML, or plain text into a PDF. Call toBytes() to get the bytes or save() to write a file, and close() when done.

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final pdf = Pdf.fromMarkdown('# Hello World\n\nThis is a **PDF**.\n');
  try {
    pdf.save('output.pdf');
    final bytes = pdf.toBytes();
    print('Wrote ${bytes.length} bytes');
  } finally {
    pdf.close();
  }
}

Pdf.fromHtml('<h1>Invoice</h1><p>Amount: \$42</p>') and Pdf.fromText('Plain text content.') work the same way. Because a built Pdf is just bytes, you can pipe it straight into PdfDocument.openFromBytes(pdf.toBytes()) to extract it back without touching disk.

Error Handling

Every fallible call throws PdfOxideError (which implements Exception) carrying the underlying C-ABI error code:

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  try {
    final doc = PdfDocument.open('/nonexistent/nope.pdf');
    doc.close();
  } on PdfOxideError catch (e) {
    print('Failed to open PDF: $e');
  }
}

Next Steps

Rust Getting Started — the native core that powers this binding
Python Getting Started — using PDF Oxide from Python
Text Extraction — detailed extraction options and recipes
PDF Creation — advanced creation with metadata and encryption