What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

Image Extraction

PDF Oxide extracts images from PDF pages by parsing the content stream, resolving XObject references via Do operators, recursing into nested Form XObjects, and decoding inline images. Use extract_images() to get image objects in memory, or extract_images_to_files() to save them directly to disk as PNG or JPEG files.

Since v0.3.5, image extraction processes the full page content stream rather than only scanning the XObject dictionary. This correctly handles images placed via Do operators, nested Form XObjects with cycle detection, and inline images embedded with BI/ID/EI sequences.

Color-space support

Extracted images are decoded and delivered in their original colour space — no lossy round-tripping:

DeviceRGB / DeviceGray / DeviceCMYK — returned as-is.
Indexed (1, 2, 4, 8 bits per component) — palette resolved via resolve_indexed_palette and expanded through expand_indexed_to_rgb. Supports Indexed palettes built on RGB, Grayscale, and CMYK base colour spaces. Previously emitted Invalid RGB image dimensions errors on many real-world PDFs.
CalRGB / CalGray / ICCBased — converted to RGB during decode.

Palette expansion is hardened against malicious inputs with a checked_mul overflow guard and a 256 MiB allocation cap; truncated streams are rejected cleanly instead of producing garbage pixels.

Malformed-image tolerance

Images with missing /ColorSpace entries, zero dimensions, or invalid streams are skipped with a warning — they no longer panic the page render. The same tolerance applies to malformed images nested inside Form XObjects.

Quick Example

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for img in images:
    print(f"{img['width']}x{img['height']}")

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const images = doc.getEmbeddedImages(0);
for (const img of images) {
    console.log(`${img.width}x${img.height}`);
}

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
images, _ := doc.Images(0)
for _, img := range images {
    fmt.Printf("%dx%d\n", img.Width, img.Height)
}

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");
var images = doc.ExtractImages(0);
foreach (var img in images)
{
    Console.WriteLine($"{img.Width}x{img.Height}");
}

WASM

const doc = new WasmPdfDocument(bytes);
const images = doc.extractImages(0);
for (const img of images) {
    console.log(`${img.width}x${img.height}`);
}

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;
for img in &images {
    println!("{}x{} {:?}", img.width(), img.height(), img.color_space());
}

API Reference

`extract_images(page_index) -> Vec<PdfImage>`

Extract all images from a page. Parses the page content stream to find:

XObject images referenced via Do operators
Form XObjects containing nested images (recursive, with cycle detection)
Inline images embedded with BI/ID/EI sequences

CTM (Current Transformation Matrix) tracking provides bounding boxes for each image.

Parameter	Type	Description
`page_index`	`int` / `usize`	Zero-based page index

Returns: A vector of PdfImage objects.

PdfImage Fields and Methods

Method / Field	Type	Description
`width()`	`u32`	Image width in pixels
`height()`	`u32`	Image height in pixels
`color_space()`	`&ColorSpace`	Color space (DeviceRGB, DeviceGray, DeviceCMYK, etc.)
`bits_per_component()`	`u8`	Bits per color component (typically 8)
`data()`	`&ImageData`	Raw image data (JPEG bytes or raw pixels)
`bbox()`	`Option<&Rect>`	Bounding box in PDF user space (if CTM was tracked)
`save_as_png(path)`	`Result<()>`	Save image as PNG file
`save_as_jpeg(path)`	`Result<()>`	Save image as JPEG file
`to_png_bytes()`	`Result<Vec<u8>>`	Encode as PNG bytes in memory
`to_jpeg_bytes()`	`Result<Vec<u8>>`	Encode as JPEG bytes in memory

ColorSpace Variants

Variant	Description
`DeviceRGB`	3-channel RGB
`DeviceGray`	Single-channel grayscale
`DeviceCMYK`	4-channel CMYK
`Indexed`	Palette-based color
`ICCBased`	ICC profile-based color
`CalGray`	Calibrated grayscale
`CalRGB`	Calibrated RGB
`Lab`	CIE Lab* color

ImageData Variants

Variant	Description
`Jpeg(Vec<u8>)`	JPEG-compressed data (DCT pass-through)
`Raw { pixels, format }`	Decoded pixel data with `PixelFormat` (RGB, Gray, CMYK, RGBA)

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

for (i, image) in images.iter().enumerate() {
    println!(
        "Image {}: {}x{} {:?} {}bpc",
        i, image.width(), image.height(),
        image.color_space(), image.bits_per_component(),
    );

    if let Some(bbox) = image.bbox() {
        println!("  Position: ({:.1}, {:.1})", bbox.x, bbox.y);
    }

    image.save_as_png(&format!("output/image_{}.png", i))?;
}

`extract_images_to_files(page_index, output_dir, prefix, start_index) -> Vec<ExtractedImageRef>`

Extract images from a page and save them directly to files. JPEG images are saved in their original format (zero re-encoding loss); other images are saved as PNG.

Parameter	Type	Default	Description
`page_index`	`usize`	–	Zero-based page index
`output_dir`	`impl AsRef<Path>`	–	Directory to save images (created if absent)
`prefix`	`Option<&str>`	`"img"`	Filename prefix
`start_index`	`Option<usize>`	`1`	Starting index for filenames

Returns: A vector of ExtractedImageRef describing saved files.

ExtractedImageRef Fields

Field	Type	Description
`filename`	`String`	Saved filename (e.g., `"img_001.png"`)
`format`	`ImageFormat`	`Png` or `Jpeg`
`width`	`u32`	Image width in pixels
`height`	`u32`	Image height in pixels

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let refs = doc.extract_images_to_files(0, "output/images", Some("fig"), Some(1))?;

for img_ref in &refs {
    println!("Saved: {} ({}x{}, {:?})", img_ref.filename, img_ref.width, img_ref.height, img_ref.format);
}

Advanced Examples

Extract all images from all pages

use pdf_oxide::PdfDocument;
use std::path::Path;

let mut doc = PdfDocument::open("book.pdf")?;
let page_count = doc.page_count()?;
let mut total = 0;

for page in 0..page_count {
    let refs = doc.extract_images_to_files(
        page,
        "output/images",
        Some(&format!("page{}", page + 1)),
        Some(1),
    )?;
    total += refs.len();
    println!("Page {}: {} images", page + 1, refs.len());
}
println!("Total: {} images extracted", total);

Get image bytes in memory (no disk I/O)

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

for image in &images {
    let png_bytes = image.to_png_bytes()?;
    println!("PNG size: {} bytes", png_bytes.len());

    // Use png_bytes with an HTTP response, database, etc.
}

Filter images by size

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

// Only keep images larger than 100x100 pixels
let large_images: Vec<_> = images.iter()
    .filter(|img| img.width() > 100 && img.height() > 100)
    .collect();

println!("{} large images on page 1", large_images.len());
for img in &large_images {
    println!("  {}x{} {:?}", img.width(), img.height(), img.color_space());
}

Distinguish JPEG pass-through from re-encoded images

use pdf_oxide::extractors::ImageData;

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

for (i, image) in images.iter().enumerate() {
    match image.data() {
        ImageData::Jpeg(bytes) => {
            // Original JPEG data -- save directly for zero quality loss
            std::fs::write(format!("image_{}.jpg", i), bytes)?;
            println!("Image {}: JPEG pass-through ({} bytes)", i, bytes.len());
        }
        ImageData::Raw { pixels, format } => {
            // Raw pixels -- must encode to a file format
            image.save_as_png(&format!("image_{}.png", i))?;
            println!("Image {}: raw {:?} ({}x{})", i, format, image.width(), image.height());
        }
    }
}

Text Extraction – Extract text alongside images
HTML Conversion – Embed extracted images in HTML output
Markdown Conversion – Include images in Markdown output

Image Extraction

Color-space support

Malformed-image tolerance

Quick Example

API Reference

extract_images(page_index) -> Vec<PdfImage>

PdfImage Fields and Methods

ColorSpace Variants

ImageData Variants

extract_images_to_files(page_index, output_dir, prefix, start_index) -> Vec<ExtractedImageRef>

ExtractedImageRef Fields

Advanced Examples

Extract all images from all pages

Get image bytes in memory (no disk I/O)

Filter images by size

Distinguish JPEG pass-through from re-encoded images

Related Pages

`extract_images(page_index) -> Vec<PdfImage>`

`extract_images_to_files(page_index, output_dir, prefix, start_index) -> Vec<ExtractedImageRef>`