Skip to content

Getting Started with PDF Oxide (Python)

PDF Oxide is the fastest Python PDF library — 0.8ms mean text extraction, 5× faster than PyMuPDF, 100% pass rate on 3,830 PDFs. One library for extracting, creating, and editing PDFs. MIT licensed, built on a Rust core.

Installation

pip install pdf_oxide

Requirements: Python 3.8+. Pre-built wheels are available for Linux, macOS, and Windows on both x86_64 and ARM64 architectures. No compiler or system dependencies needed.

Opening a PDF

Use PdfDocument to open and inspect any PDF file.

from pdf_oxide import PdfDocument

doc = PdfDocument("research-paper.pdf")
print(f"Pages: {doc.page_count()}")
print(f"PDF version: {doc.version()}")

Page API

Since v0.3.34 PdfDocument is iterable and indexable, returning PdfPage objects with lazy properties.

from pdf_oxide import PdfDocument

with PdfDocument("paper.pdf") as doc:
    for page in doc:            # len(doc), doc[i], doc[-1] all work
        text = page.text        # lazy — computed on access
        md = page.markdown(detect_headings=True)
        for table in page.tables:
            for row in table["rows"]:
                print([cell["text"] for cell in row["cells"]])

Page properties (all lazy): text, chars, words, lines, spans, tables, images, paths, annotations, width, height, bbox. Methods: markdown(), plain_text(), html(), render(), search(), region(x, y, w, h).

The editor page class was renamed to EditorPage in v0.3.34 to avoid collision with PdfPage.

Text Extraction

Single Page

Extract plain text from any page by its zero-based index.

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)

All Pages

from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")
for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(f"--- Page {i + 1} ---")
    print(text)

Character-Level Extraction

extract_chars() returns a list of TextChar objects with precise positioning and font metadata for every character on the page.

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
chars = doc.extract_chars(0)

for ch in chars[:10]:
    print(f"'{ch.char}' at ({ch.x:.1f}, {ch.y:.1f}) "
          f"size={ch.font_size:.1f} font={ch.font_name} "
          f"bbox={ch.bbox}")

Each TextChar has the following fields:

Field Type Description
char str The Unicode character
x float Horizontal position in points
y float Vertical position in points
font_size float Font size in points
font_name str PostScript font name
bbox tuple[float, 4] Bounding box (x0, y0, x1, y1)

Text Spans

extract_spans() groups consecutive characters that share the same font and size into spans, giving you structured text with font metadata.

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
spans = doc.extract_spans(0)

for span in spans:
    print(f"'{span.text}' font={span.font_name} size={span.font_size}")

Markdown Conversion

Convert a PDF page to Markdown with optional heading detection.

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)

HTML Conversion

Convert a PDF page to HTML.

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
html = doc.to_html(0)
print(html)

Image Extraction

extract_images() returns a list of ImageInfo objects for every image embedded on a page, including images referenced in content streams and nested Form XObjects.

from pdf_oxide import PdfDocument

doc = PdfDocument("brochure.pdf")
images = doc.extract_image_bytes(0)

for i, img in enumerate(images):
    print(f"Image {i}: {img['width']}x{img['height']} "
          f"({len(img['data'])} bytes)")
    with open(f"image_{i}.{img['format']}", "wb") as f:
        f.write(img["data"])

Each dict returned by extract_image_bytes() has the following keys:

Key Type Description
width int Image width in pixels
height int Image height in pixels
data bytes Raw image data
format str Image format (e.g. png, jpeg)

Opening from Bytes

Open a PDF from in-memory bytes — useful when downloading from S3, HTTP, or databases:

from pdf_oxide import PdfDocument

doc = PdfDocument.from_bytes(pdf_bytes)
text = doc.extract_text(0)

# Also supports password:
doc = PdfDocument.from_bytes(pdf_bytes, password="secret")

For the builder API:

from pdf_oxide import Pdf

pdf = Pdf.from_bytes(existing_pdf_bytes)
pdf.save("modified.pdf")

Password-Protected PDFs

Pass password= in the constructor to open encrypted documents.

from pdf_oxide import PdfDocument

doc = PdfDocument("confidential.pdf", password="secret")
text = doc.extract_text(0)
print(text)

You can also use doc.authenticate(password) after opening as an alternative.

PDF Creation

The Pdf class provides factory methods to create PDFs from various source formats.

From Markdown

from pdf_oxide import Pdf

pdf = Pdf.from_markdown("# Hello World\n\nThis is a PDF.")
pdf.save("output.pdf")

From HTML

from pdf_oxide import Pdf

pdf = Pdf.from_html("<h1>Invoice</h1><p>Amount due: $42.00</p>")
pdf.save("invoice.pdf")

From Plain Text

from pdf_oxide import Pdf

pdf = Pdf.from_text("Plain text document.\n\nSecond paragraph.")
pdf.save("notes.pdf")

From Images

from pdf_oxide import Pdf

pdf = Pdf.from_image("scan.jpg")
pdf.save("scan.pdf")

Search for text across the entire document or within a single page.

from pdf_oxide import PdfDocument

doc = PdfDocument("manual.pdf")

# Search all pages
results = doc.search("configuration")
for r in results:
    print(f"Page {r.page}: '{r.text}' at ({r.x:.0f}, {r.y:.0f})")

# Search a single page
page_results = doc.search_page(0, "configuration")

Error Handling

PDF Oxide raises PdfError for PDF-specific failures and standard Python exceptions for I/O problems.

from pdf_oxide import PdfDocument, PdfError

try:
    doc = PdfDocument("document.pdf")
    text = doc.extract_text(0)
except PdfError as e:
    print(f"PDF error: {e}")
except FileNotFoundError:
    print("File not found")

Next Steps