What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Getting Started with PDF Oxide (Elixir)

PDF Oxide is the fastest way to read and write PDFs from Elixir — 0.8ms mean text extraction, 100% pass rate on 3,830 PDFs. It’s a NIF over the same Rust core, running CPU-bound work on dirty CPU schedulers (ERL_NIF_DIRTY_JOB_CPU_BOUND) so it never blocks the BEAM scheduler.

Document and Pdf handles are NIF resources freed by the GC. Fallible functions return {:ok, value} or {:error, code}, and page indices are 0-based.

Installation

Add pdf_oxide to your mix.exs dependencies:

def deps do
  [
    {:pdf_oxide, "~> 0.3"}
  ]
end

Then fetch and compile — the NIF is built via elixir_make against the native cdylib:

mix deps.get
mix compile

Quick Start

Build a PDF from Markdown, serialize it to bytes, then open it and extract the text back out.

{:ok, pdf}   = PdfOxide.from_markdown("# Hello pdf_oxide\n\nThis is an **Elixir** binding.\n")
{:ok, bytes} = PdfOxide.to_bytes(pdf)
{:ok, doc}   = PdfOxide.open_from_bytes(bytes)

{:ok, pages} = PdfOxide.page_count(doc)
IO.puts("pages: #{pages}")

%{major: maj, minor: min} = PdfOxide.version(doc)
IO.puts("version: #{maj}.#{min}")

{:ok, text} = PdfOxide.extract_text(doc, 0)
IO.puts(text)

Opening a PDF

Open from a file path, or directly from in-memory bytes (useful when streaming from S3, HTTP, or a database):

# From a path
{:ok, doc} = PdfOxide.open("report.pdf")

# From bytes already in memory
{:ok, doc} = PdfOxide.open_from_bytes(pdf_bytes)

# Encrypted documents
{:ok, doc} = PdfOxide.open_with_password("confidential.pdf", "secret")

# Inspect
{:ok, count} = PdfOxide.page_count(doc)
encrypted? = PdfOxide.encrypted?(doc)

Close a document explicitly when you’re done (close/1 is idempotent), or let the GC reclaim it:

:ok = PdfOxide.close(doc)

Text Extraction

Extract plain text from a single page by its zero-based index, or pull the whole document at once:

{:ok, doc} = PdfOxide.open("book.pdf")

# A single page
{:ok, text} = PdfOxide.extract_text(doc, 0)

# Plain text, one page
{:ok, pt} = PdfOxide.to_plain_text(doc, 0)

# Every page, concatenated
{:ok, all} = PdfOxide.to_plain_text_all(doc)
IO.puts(all)

Markdown & HTML Conversion

Convert a page — or the entire document — to Markdown or HTML:

{:ok, doc} = PdfOxide.open("paper.pdf")

{:ok, md}    = PdfOxide.to_markdown(doc, 0)
{:ok, mdall} = PdfOxide.to_markdown_all(doc)

{:ok, html}    = PdfOxide.to_html(doc, 0)
{:ok, htmlall} = PdfOxide.to_html_all(doc)

Words & Lines

extract_words/2 returns structured PdfOxide.Word structs with a bounding box and a bold flag; extract_text_lines/2 groups them into lines.

{:ok, doc} = PdfOxide.open("paper.pdf")

{:ok, words} = PdfOxide.extract_words(doc, 0)

for w <- Enum.take(words, 10) do
  %PdfOxide.Bbox{x: x, y: y, width: width} = w.bbox
  IO.puts("#{w.text} at (#{x}, #{y}) w=#{width} bold=#{w.bold}")
end

{:ok, lines} = PdfOxide.extract_text_lines(doc, 0)

for line <- lines do
  IO.puts("#{line.word_count} words: #{line.text}")
end

Search

Search a single page, or across the whole document. The fourth argument is case_sensitive. Each result carries text, page, and a PdfOxide.Bbox.

{:ok, doc} = PdfOxide.open("manual.pdf")

# One page (page index 0), case-insensitive
{:ok, results} = PdfOxide.search(doc, 0, "configuration", false)

for r <- results do
  %PdfOxide.Bbox{x: x, y: y} = r.bbox
  IO.puts("page #{r.page}: '#{r.text}' at (#{x}, #{y})")
end

# All pages
{:ok, all} = PdfOxide.search_all(doc, "configuration", false)
IO.puts("#{length(all)} matches")

PDF Creation

The builder factory functions return a Pdf handle that you serialize with to_bytes/1 or write straight to disk with save/2:

{:ok, pdf} = PdfOxide.from_markdown("# Hello World\n\nThis is a PDF.")
:ok = PdfOxide.save(pdf, "output.pdf")

{:ok, pdf} = PdfOxide.from_html("<h1>Invoice</h1><p>Amount: $42</p>")
{:ok, bytes} = PdfOxide.to_bytes(pdf)

{:ok, pdf} = PdfOxide.from_text("Plain text content.")
:ok = PdfOxide.save(pdf, "notes.pdf")

Rendering Pages to Images

With the rendering feature, rasterize a page to a PdfOxide.RenderedImage and save it as a PNG:

{:ok, doc} = PdfOxide.open("paper.pdf")

{:ok, img} = PdfOxide.render_page(doc, 0)
IO.puts("#{img.width}x#{img.height}, #{byte_size(img.data)} bytes")
:ok = PdfOxide.save(img, "page0.png")

# Zoom factor, or a fixed-size thumbnail
{:ok, zoomed} = PdfOxide.render_page_zoom(doc, 0, 2.0)
{:ok, thumb}  = PdfOxide.render_page_thumbnail(doc, 0, 128)

Error Handling

Fallible functions return a tagged tuple — pattern-match for clean control flow:

case PdfOxide.open("/nonexistent/nope.pdf") do
  {:ok, doc} ->
    {:ok, text} = PdfOxide.extract_text(doc, 0)
    IO.puts(text)

  {:error, code} ->
    IO.puts("could not open PDF: #{inspect(code)}")
end

Next Steps

Rust Getting Started — using PDF Oxide from Rust
Python Getting Started — using PDF Oxide from Python
Text Extraction — detailed extraction options and recipes
PDF Creation — advanced creation with metadata and encryption
Editing — modifying existing PDFs, annotations, and form fields