What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Getting Started with PDF Oxide (Kotlin)

PDF Oxide is the fastest PDF library for the JVM with built-in text extraction — 0.8ms mean, 100% pass rate on 3,830 PDFs. The Kotlin binding is an idiomatic, Android-ready facade over the Java binding: it adds use { } on the closable handles and turns Java Optional<T> returns into nullable T?. One library for extracting, creating, and editing PDFs. MIT licensed, built on a Rust core.

Installation

Add the Kotlin binding to your build.gradle.kts. It transitively pulls in the Java binding that owns the JNI native bridge:

dependencies {
    implementation("fyi.oxide:pdf-oxide-kotlin:0.3.69")
}

Requirements: JDK 17+. On Android, ship the native libpdf_oxide_jni.so in jniLibs/<abi>/; on the desktop JVM the loader finds it automatically (override with -Dfyi.oxide.pdf.lib.path=<path> when needed).

Quick Start

Build a PDF from Markdown, open it, and read the text back. The Pdf and PdfDocument handles are AutoCloseable, so wrap them in use { }:

import fyi.oxide.pdf.Pdf
import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.producerOrNull

Pdf.fromMarkdown("# Hello pdf_oxide\n\nThis is a **Kotlin** binding.\n").use { pdf ->
    PdfDocument.open(pdf.save()).use { doc ->
        println("pages:    ${doc.pageCount()}")
        println("producer: ${doc.producerOrNull() ?: "(none)"}")
        println(doc.extractText(0))
    }
}

Pdf.fromMarkdown(String) returns a closable Pdf builder; pdf.save() serializes it to a ByteArray. PdfDocument.open(ByteArray) opens that for reading.

Opening a PDF

Open an existing document from bytes and inspect its metadata. producerOrNull() and creatorOrNull() are the Kotlin nullable views over the Java Optional getters:

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.producerOrNull
import fyi.oxide.pdf.creatorOrNull

PdfDocument.open(pdfBytes).use { doc ->
    println("open:     ${doc.isOpen}")
    println("pages:    ${doc.pageCount()}")
    println("producer: ${doc.producerOrNull() ?: "(none)"}")
    println("creator:  ${doc.creatorOrNull() ?: "(none)"}")
}

Text Extraction

Extract plain text from any page by its zero-based index, or loop over every page:

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(pdfBytes).use { doc ->
    // a single page
    println(doc.extractText(0))

    // every page
    for (i in 0 until doc.pageCount()) {
        println("--- Page ${i + 1} ---")
        println(doc.extractText(i))
    }
}

Page Elements

doc.page(i) returns a PdfPage exposing structured geometry — words, lines, characters, tables, images, and annotations. Each word carries its text and a bounding box:

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(pdfBytes).use { doc ->
    val page = doc.page(0)
    println("size: ${page.width()} x ${page.height()}")

    page.words().take(8).forEach { word ->
        println("${word.text()} @ ${word.bbox()}")
    }

    println("lines:       ${page.lines().size}")
    println("chars:       ${page.chars().size}")
    println("tables:      ${page.tables().size}")
    println("images:      ${page.images().size}")
    println("annotations: ${page.annotations().size}")
}

A word’s bbox() is a BBox with helpers like width() and height().

Markdown & HTML Conversion

Convert the whole document to Markdown, or render a page to HTML:

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(pdfBytes).use { doc ->
    val markdown = doc.toMarkdown()  // all pages
    println(markdown)

    val html = doc.toHtml()
    println(html)
}

Search

Search for text across the document. Each match exposes its text via text():

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(pdfBytes).use { doc ->
    val matches = doc.search("configuration")
    matches.forEach { m ->
        println("match: ${m.text()}")
    }
}

Auto-Extraction

AutoExtractor runs the full extraction pipeline in one call and returns an AutoResult with the text plus optional Markdown/HTML renderings. The markdownOrNull() / htmlOrNull() extensions turn the Java Optional returns into nullable values:

import fyi.oxide.pdf.PdfDocument
import fyi.oxide.pdf.AutoExtractor
import fyi.oxide.pdf.markdownOrNull
import fyi.oxide.pdf.htmlOrNull

PdfDocument.open(pdfBytes).use { doc ->
    val result = AutoExtractor.of(doc).extractDocument()
    println(result.text())
    result.markdownOrNull()?.let { println(it) }
    result.htmlOrNull()?.let { println(it) }
}

Editing

DocumentEditor opens a PDF for structural edits — for example, scrubbing metadata before sharing — then serializes the result back to bytes:

import fyi.oxide.pdf.DocumentEditor

DocumentEditor.open(pdfBytes).use { editor ->
    editor.scrubMetadata()
    val cleaned: ByteArray = editor.save()
    println("cleaned: ${cleaned.size} bytes")
}

Next Steps

Java Getting Started – the JVM binding the Kotlin facade wraps
Python Getting Started – using PDF Oxide from Python
Text Extraction – detailed extraction options and recipes
PDF Creation – advanced creation with builders, encryption, and metadata
Editing – modifying existing PDFs, annotations, and form fields