Migrate from pdfminer.six to PDF Oxide
A complete guide to switching from pdfminer (pdfminer.six) to PDF Oxide, covering every API you use today and how to replace it.
Why Switch from pdfminer?
There are four compelling reasons to migrate:
- ~30x faster — pdfminer is the slowest mainstream Python PDF library. PDF Oxide averages 0.8ms per page while pdfminer takes tens of milliseconds. Batch jobs that took minutes now take seconds.
- Actively maintained — pdfminer.six receives infrequent updates and has a large backlog of open issues. PDF Oxide is actively developed with regular releases.
- All-in-one library — pdfminer only does text extraction. PDF Oxide also creates PDFs, edits them, renders pages to images, extracts images, handles forms, and converts to Markdown/HTML.
- No configuration needed — pdfminer requires manual
LAParamstuning (word_margin,line_margin,char_margin) to get decent results. PDF Oxide handles layout detection automatically.
Step 1: Install
pip install pdf_oxide
pip uninstall pdfminer.six # optional
Step 2: Replace Imports
# Before
from pdfminer.high_level import extract_text, extract_pages
from pdfminer.layout import LAParams
# After
from pdf_oxide import PdfDocument
Step 3: API Mapping
| Task | pdfminer | PDF Oxide |
|---|---|---|
| Extract text | extract_text("file.pdf") |
PdfDocument("file.pdf").extract_text(0) |
| Extract pages | extract_pages("file.pdf") |
Page-by-page with doc.extract_text(i) |
| Layout analysis | LAParams() configuration |
Built-in layout detection |
| Character positions | LTChar objects |
doc.extract_chars(0) |
| Encrypted PDF | Limited (fails on AES-256) | Full support |
| To Markdown | Not supported | doc.to_markdown(0) |
| Form fields | Not supported | doc.get_form_fields() |
Step 4: Common Pattern Changes
Basic Text Extraction
pdfminer’s extract_text processes the entire document at once. PDF Oxide gives you per-page control:
# pdfminer — entire document at once
from pdfminer.high_level import extract_text
text = extract_text("report.pdf")
print(text)
# PDF Oxide — per-page control
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
text = doc.extract_text(i)
print(text)
Layout Analysis
pdfminer requires manual LAParams configuration. PDF Oxide handles it automatically:
# pdfminer — manual layout configuration
from pdfminer.high_level import extract_text
from pdfminer.layout import LAParams
params = LAParams(
word_margin=0.1,
line_margin=0.5,
char_margin=2.0,
boxes_flow=0.5,
)
text = extract_text("report.pdf", laparams=params)
# PDF Oxide — automatic layout detection
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
text = doc.extract_text(0) # Layout handled automatically
Character-Level Extraction
pdfminer uses a complex tree of layout objects. PDF Oxide returns a flat list:
# pdfminer — traverse layout tree
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTChar, LTTextBox
for page_layout in extract_pages("report.pdf"):
for element in page_layout:
if isinstance(element, LTTextBox):
for line in element:
for char in line:
if isinstance(char, LTChar):
print(f"{char.get_text()} at ({char.x0}, {char.y0})")
# PDF Oxide — flat character list
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for c in doc.extract_chars(0):
print(f"{c.char} at ({c.x}, {c.y})")
Encrypted PDFs
pdfminer has limited encryption support and fails on AES-256 encrypted files:
# pdfminer — fails on many encrypted PDFs
from pdfminer.high_level import extract_text
text = extract_text("encrypted.pdf", password="password")
# May throw an error on AES-256 encrypted files
# PDF Oxide — full encryption support
from pdf_oxide import PdfDocument
doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0) # Works with all encryption methods
Markdown Conversion (New Capability)
pdfminer has no Markdown support. PDF Oxide makes it easy to feed PDFs into LLM pipelines:
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
md = doc.to_markdown(i)
print(md)
Form Field Extraction (New Capability)
pdfminer cannot extract form fields. PDF Oxide handles them:
from pdf_oxide import PdfDocument
doc = PdfDocument("form.pdf")
fields = doc.get_form_fields()
for f in fields:
print(f"{f.name}: {f.value}")
Page Rendering (New Capability)
pdfminer has no rendering capability. PDF Oxide can render pages to images:
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
png_bytes = doc.render_page(0, dpi=150)
with open("page.png", "wb") as f:
f.write(png_bytes)
Key Differences
- No LAParams tuning — PDF Oxide handles layout automatically. No need to configure
word_margin,line_margin, etc. - Speed — pdfminer is the slowest Python PDF library. PDF Oxide is ~30x faster.
- All-in-one — pdfminer only does extraction. PDF Oxide also creates, edits, and renders PDFs.
Step 5: Testing Your Migration
Run your existing test files through both libraries and compare output:
from pdf_oxide import PdfDocument
doc = PdfDocument("your-test-file.pdf")
# Verify text extraction
text = doc.extract_text(0)
print(text[:500])
# Verify page count
print(f"Pages: {doc.page_count()}")
# Verify form fields (if applicable)
fields = doc.get_form_fields()
for f in fields:
print(f"{f.name}: {f.value}")
Other Migration Guides
Related Pages
- Getting Started with Python — installation guide
- Extract Text from PDF — text extraction guide