Getting Started with PDF Oxide (Python)
PDF Oxide is the fastest Python PDF library — 0.8ms mean text extraction, 5× faster than PyMuPDF, 100% pass rate on 3,830 PDFs. One library for extracting, creating, and editing PDFs. MIT licensed, built on a Rust core.
Installation
pip install pdf_oxide
Requirements: Python 3.8+. Pre-built wheels are available for Linux, macOS, and Windows on both x86_64 and ARM64 architectures. No compiler or system dependencies needed.
Opening a PDF
Use PdfDocument to open and inspect any PDF file.
from pdf_oxide import PdfDocument
doc = PdfDocument("research-paper.pdf")
print(f"Pages: {doc.page_count()}")
print(f"PDF version: {doc.version()}")
Page API
Since v0.3.34 PdfDocument is iterable and indexable, returning PdfPage objects with lazy properties.
from pdf_oxide import PdfDocument
with PdfDocument("paper.pdf") as doc:
for page in doc: # len(doc), doc[i], doc[-1] all work
text = page.text # lazy — computed on access
md = page.markdown(detect_headings=True)
for table in page.tables:
for row in table["rows"]:
print([cell["text"] for cell in row["cells"]])
Page properties (all lazy): text, chars, words, lines, spans, tables, images, paths, annotations, width, height, bbox. Methods: markdown(), plain_text(), html(), render(), search(), region(x, y, w, h).
The editor page class was renamed to
EditorPagein v0.3.34 to avoid collision withPdfPage.
Text Extraction
Single Page
Extract plain text from any page by its zero-based index.
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)
print(text)
All Pages
from pdf_oxide import PdfDocument
doc = PdfDocument("book.pdf")
for i in range(doc.page_count()):
text = doc.extract_text(i)
print(f"--- Page {i + 1} ---")
print(text)
Character-Level Extraction
extract_chars() returns a list of TextChar objects with precise positioning and font metadata for every character on the page.
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
chars = doc.extract_chars(0)
for ch in chars[:10]:
print(f"'{ch.char}' at ({ch.x:.1f}, {ch.y:.1f}) "
f"size={ch.font_size:.1f} font={ch.font_name} "
f"bbox={ch.bbox}")
Each TextChar has the following fields:
| Field | Type | Description |
|---|---|---|
char |
str |
The Unicode character |
x |
float |
Horizontal position in points |
y |
float |
Vertical position in points |
font_size |
float |
Font size in points |
font_name |
str |
PostScript font name |
bbox |
tuple[float, 4] |
Bounding box (x0, y0, x1, y1) |
Text Spans
extract_spans() groups consecutive characters that share the same font and size into spans, giving you structured text with font metadata.
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
spans = doc.extract_spans(0)
for span in spans:
print(f"'{span.text}' font={span.font_name} size={span.font_size}")
Markdown Conversion
Convert a PDF page to Markdown with optional heading detection.
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)
HTML Conversion
Convert a PDF page to HTML.
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
html = doc.to_html(0)
print(html)
Image Extraction
extract_images() returns a list of ImageInfo objects for every image embedded on a page, including images referenced in content streams and nested Form XObjects.
from pdf_oxide import PdfDocument
doc = PdfDocument("brochure.pdf")
images = doc.extract_image_bytes(0)
for i, img in enumerate(images):
print(f"Image {i}: {img['width']}x{img['height']} "
f"({len(img['data'])} bytes)")
with open(f"image_{i}.{img['format']}", "wb") as f:
f.write(img["data"])
Each dict returned by extract_image_bytes() has the following keys:
| Key | Type | Description |
|---|---|---|
width |
int |
Image width in pixels |
height |
int |
Image height in pixels |
data |
bytes |
Raw image data |
format |
str |
Image format (e.g. png, jpeg) |
Opening from Bytes
Open a PDF from in-memory bytes — useful when downloading from S3, HTTP, or databases:
from pdf_oxide import PdfDocument
doc = PdfDocument.from_bytes(pdf_bytes)
text = doc.extract_text(0)
# Also supports password:
doc = PdfDocument.from_bytes(pdf_bytes, password="secret")
For the builder API:
from pdf_oxide import Pdf
pdf = Pdf.from_bytes(existing_pdf_bytes)
pdf.save("modified.pdf")
Password-Protected PDFs
Pass password= in the constructor to open encrypted documents.
from pdf_oxide import PdfDocument
doc = PdfDocument("confidential.pdf", password="secret")
text = doc.extract_text(0)
print(text)
You can also use doc.authenticate(password) after opening as an alternative.
PDF Creation
The Pdf class provides factory methods to create PDFs from various source formats.
From Markdown
from pdf_oxide import Pdf
pdf = Pdf.from_markdown("# Hello World\n\nThis is a PDF.")
pdf.save("output.pdf")
From HTML
from pdf_oxide import Pdf
pdf = Pdf.from_html("<h1>Invoice</h1><p>Amount due: $42.00</p>")
pdf.save("invoice.pdf")
From Plain Text
from pdf_oxide import Pdf
pdf = Pdf.from_text("Plain text document.\n\nSecond paragraph.")
pdf.save("notes.pdf")
From Images
from pdf_oxide import Pdf
pdf = Pdf.from_image("scan.jpg")
pdf.save("scan.pdf")
Search
Search for text across the entire document or within a single page.
from pdf_oxide import PdfDocument
doc = PdfDocument("manual.pdf")
# Search all pages
results = doc.search("configuration")
for r in results:
print(f"Page {r.page}: '{r.text}' at ({r.x:.0f}, {r.y:.0f})")
# Search a single page
page_results = doc.search_page(0, "configuration")
Error Handling
PDF Oxide raises PdfError for PDF-specific failures and standard Python exceptions for I/O problems.
from pdf_oxide import PdfDocument, PdfError
try:
doc = PdfDocument("document.pdf")
text = doc.extract_text(0)
except PdfError as e:
print(f"PDF error: {e}")
except FileNotFoundError:
print("File not found")
Next Steps
- Rust Getting Started – using PDF Oxide from Rust
- Text Extraction – detailed extraction options and recipes
- PDF Creation – advanced creation with PdfBuilder, encryption, and metadata
- Editing – modifying existing PDFs, annotations, and form fields
- API Reference – full API documentation