Skip to content

Migrate from pdfminer.six to PDF Oxide

A complete guide to switching from pdfminer (pdfminer.six) to PDF Oxide, covering every API you use today and how to replace it.

Why Switch from pdfminer?

There are four compelling reasons to migrate:

  1. ~30x faster — pdfminer is the slowest mainstream Python PDF library. PDF Oxide averages 0.8ms per page while pdfminer takes tens of milliseconds. Batch jobs that took minutes now take seconds.
  2. Actively maintained — pdfminer.six receives infrequent updates and has a large backlog of open issues. PDF Oxide is actively developed with regular releases.
  3. All-in-one library — pdfminer only does text extraction. PDF Oxide also creates PDFs, edits them, renders pages to images, extracts images, handles forms, and converts to Markdown/HTML.
  4. No configuration needed — pdfminer requires manual LAParams tuning (word_margin, line_margin, char_margin) to get decent results. PDF Oxide handles layout detection automatically.

Step 1: Install

pip install pdf_oxide
pip uninstall pdfminer.six  # optional

Step 2: Replace Imports

# Before
from pdfminer.high_level import extract_text, extract_pages
from pdfminer.layout import LAParams

# After
from pdf_oxide import PdfDocument

Step 3: API Mapping

Task pdfminer PDF Oxide
Extract text extract_text("file.pdf") PdfDocument("file.pdf").extract_text(0)
Extract pages extract_pages("file.pdf") Page-by-page with doc.extract_text(i)
Layout analysis LAParams() configuration Built-in layout detection
Character positions LTChar objects doc.extract_chars(0)
Encrypted PDF Limited (fails on AES-256) Full support
To Markdown Not supported doc.to_markdown(0)
Form fields Not supported doc.get_form_fields()

Step 4: Common Pattern Changes

Basic Text Extraction

pdfminer’s extract_text processes the entire document at once. PDF Oxide gives you per-page control:

# pdfminer — entire document at once
from pdfminer.high_level import extract_text
text = extract_text("report.pdf")
print(text)

# PDF Oxide — per-page control
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
    text = doc.extract_text(i)
    print(text)

Layout Analysis

pdfminer requires manual LAParams configuration. PDF Oxide handles it automatically:

# pdfminer — manual layout configuration
from pdfminer.high_level import extract_text
from pdfminer.layout import LAParams

params = LAParams(
    word_margin=0.1,
    line_margin=0.5,
    char_margin=2.0,
    boxes_flow=0.5,
)
text = extract_text("report.pdf", laparams=params)

# PDF Oxide — automatic layout detection
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
text = doc.extract_text(0)  # Layout handled automatically

Character-Level Extraction

pdfminer uses a complex tree of layout objects. PDF Oxide returns a flat list:

# pdfminer — traverse layout tree
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTChar, LTTextBox

for page_layout in extract_pages("report.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextBox):
            for line in element:
                for char in line:
                    if isinstance(char, LTChar):
                        print(f"{char.get_text()} at ({char.x0}, {char.y0})")

# PDF Oxide — flat character list
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
for c in doc.extract_chars(0):
    print(f"{c.char} at ({c.x}, {c.y})")

Encrypted PDFs

pdfminer has limited encryption support and fails on AES-256 encrypted files:

# pdfminer — fails on many encrypted PDFs
from pdfminer.high_level import extract_text
text = extract_text("encrypted.pdf", password="password")
# May throw an error on AES-256 encrypted files

# PDF Oxide — full encryption support
from pdf_oxide import PdfDocument
doc = PdfDocument("encrypted.pdf", password="password")
text = doc.extract_text(0)  # Works with all encryption methods

Markdown Conversion (New Capability)

pdfminer has no Markdown support. PDF Oxide makes it easy to feed PDFs into LLM pipelines:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
for i in range(doc.page_count()):
    md = doc.to_markdown(i)
    print(md)

Form Field Extraction (New Capability)

pdfminer cannot extract form fields. PDF Oxide handles them:

from pdf_oxide import PdfDocument

doc = PdfDocument("form.pdf")
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name}: {f.value}")

Page Rendering (New Capability)

pdfminer has no rendering capability. PDF Oxide can render pages to images:

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
png_bytes = doc.render_page(0, dpi=150)
with open("page.png", "wb") as f:
    f.write(png_bytes)

Key Differences

  1. No LAParams tuning — PDF Oxide handles layout automatically. No need to configure word_margin, line_margin, etc.
  2. Speed — pdfminer is the slowest Python PDF library. PDF Oxide is ~30x faster.
  3. All-in-one — pdfminer only does extraction. PDF Oxide also creates, edits, and renders PDFs.

Step 5: Testing Your Migration

Run your existing test files through both libraries and compare output:

from pdf_oxide import PdfDocument

doc = PdfDocument("your-test-file.pdf")

# Verify text extraction
text = doc.extract_text(0)
print(text[:500])

# Verify page count
print(f"Pages: {doc.page_count()}")

# Verify form fields (if applicable)
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name}: {f.value}")

Other Migration Guides