What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

图片操作

PDF Oxide 提供两个级别的图片操作：通过 DocumentEditor 的底层操作（按 XObject 名称重新定位和调整图片大小），以及通过 PdfPage 的 DOM 级别访问（查询图片元数据）。两种方式都适用于已嵌入 PDF 中的图片。

获取页面图片

DocumentEditor：底层图片信息

获取页面上所有图片的详细放置信息，包括 XObject 名称和变换矩阵。

from pdf_oxide import PdfDocument

doc = PdfDocument("brochure.pdf")
images = doc.page_images(0)

for img in images:
    print(f"Name: {img['name']}")
    print(f"Position: ({img['x']:.1f}, {img['y']:.1f})")
    print(f"Size: {img['width']:.1f} x {img['height']:.1f}")
    print(f"Matrix: {img['matrix']}")

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const images = doc.pageImages(0);

for (const img of images) {
  console.log(`Name: ${img.name}`);
  console.log(`Position: (${img.x}, ${img.y})`);
  console.log(`Size: ${img.width} x ${img.height}`);
}
doc.free();

use pdf_oxide::editor::DocumentEditor;

let mut editor = DocumentEditor::open("brochure.pdf")?;
let images = editor.get_page_images(0)?;

for img in &images {
    println!("Name: {}", img.name);
    println!("Bounds: {:?}", img.bounds);  // [x, y, width, height]
    println!("Matrix: {:?}", img.matrix);  // [a, b, c, d, e, f]
}

返回的 ImageInfo 结构体包含：

字段	类型	描述
`name`	`String`	XObject 名称 (e.g., `"Im0"`, `"Image1"`)
`bounds`	`[f32; 4]`	Position and size `[x, y, width, height]`
`matrix`	`[f32; 6]`	Full transformation matrix `[a, b, c, d, e, f]`

PdfPage：DOM 级别图片信息

DOM API 提供每张图片更丰富的元数据。

doc = PdfDocument("brochure.pdf")
page = doc.page(0)

for img in page.find_images():
    print(f"BBox: {img.bbox}")
    print(f"Format: {img.format}")
    print(f"Dimensions: {img.dimensions}")

let mut doc = Pdf::open("brochure.pdf")?;
let page = doc.page(0)?;

for img in page.find_images() {
    println!("BBox: {:?}", img.bbox());
    println!("Format: {:?}", img.format());
    println!("Dimensions: {:?}", img.dimensions());
    println!("Aspect ratio: {:.2}", img.aspect_ratio());
    println!("Grayscale: {}", img.is_grayscale());

    if let Some(alt) = img.alt_text() {
        println!("Alt text: {}", alt);
    }
    if let Some((h_dpi, v_dpi)) = img.resolution() {
        println!("Resolution: {:.0} x {:.0} DPI", h_dpi, v_dpi);
    }
}

重新定位图片

将图片移动到页面上的新位置而不改变其大小。

doc = PdfDocument("input.pdf")
images = doc.page_images(0)

# Move the first image to position (100, 500)
doc.reposition_image(0, images[0]["name"], 100, 500)
doc.save("moved.pdf")

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const images = doc.pageImages(0);

// Move the first image to position (100, 500)
doc.repositionImage(0, images[0].name, 100, 500);
const output = doc.save();
doc.free();

let mut editor = DocumentEditor::open("input.pdf")?;
let images = editor.get_page_images(0)?;

// Move the first image
editor.reposition_image(0, &images[0].name, 100.0, 500.0)?;
editor.save("moved.pdf")?;

调整图片大小

改变图片的尺寸而不移动其位置。

doc = PdfDocument("input.pdf")
images = doc.page_images(0)

# Resize the first image to 300x200 points
doc.resize_image(0, images[0]["name"], 300, 200)
doc.save("resized.pdf")

import { WasmPdfDocument } from "pdf-oxide-wasm";

const doc = new WasmPdfDocument(bytes);
const images = doc.pageImages(0);

// Resize the first image to 300x200 points
doc.resizeImage(0, images[0].name, 300, 200);
const output = doc.save();
doc.free();

let mut editor = DocumentEditor::open("input.pdf")?;
let images = editor.get_page_images(0)?;

editor.resize_image(0, &images[0].name, 300.0, 200.0)?;
editor.save("resized.pdf")?;

设置完整图片边界

在一次操作中同时设置位置和大小。

doc = PdfDocument("input.pdf")
images = doc.page_images(0)

# Set position (72, 600) and size (468, 200)
doc.set_image_bounds(0, images[0]["name"], 72, 600, 468, 200)
doc.save("adjusted.pdf")

let mut editor = DocumentEditor::open("input.pdf")?;
let images = editor.get_page_images(0)?;

// x, y, width, height
editor.set_image_bounds(0, &images[0].name, 72.0, 600.0, 468.0, 200.0)?;
editor.save("adjusted.pdf")?;

管理图片修改

清除修改

在保存之前丢弃页面上所有待处理的图片修改。

doc.clear_image_modifications(0)

editor.clear_image_modifications(0);

检查待处理的修改

if doc.has_image_modifications(0):
    print("Page 0 has pending image changes")

if editor.has_image_modifications(0) {
    println!("Page 0 has pending image changes");
}

PdfImage DOM API（Rust）

DOM 级别的 PdfImage 为页面上找到的每张图片提供丰富的元数据。

方法	返回值	描述
`id()`	`ElementId`	唯一元素标识符
`bbox()`	`Rect`	页面上的位置和大小
`format()`	`ImageFormat`	图像格式 (JPEG, PNG, etc.)
`dimensions()`	`(u32, u32)`	宽度和高度（像素）
`aspect_ratio()`	`f32`	宽高比
`is_grayscale()`	`bool`	如果是灰度图片则为 true
`alt_text()`	`Option<&str>`	无障碍替代文本
`set_alt_text(text)`	`()`	设置无障碍替代文本
`resolution()`	`Option<(f32, f32)>`	DPI，格式为（水平，垂直）
`horizontal_dpi()`	`Option<f32>`	水平 DPI
`vertical_dpi()`	`Option<f32>`	垂直 DPI
`is_high_resolution()`	`bool`	分辨率 >= 300 DPI
`is_medium_resolution()`	`bool`	分辨率在 150-300 DPI 之间
`is_low_resolution()`	`bool`	分辨率 < 150 DPI

在区域中查找图片

use pdf_oxide::geometry::Rect;

let page = doc.page(0)?;

// Find images in the top half of the page
let region = Rect::new(0.0, 396.0, 612.0, 396.0);
let top_images = page.find_images_in_region(region);
println!("Found {} images in top half", top_images.len());

设置无障碍替代文本

let mut page = doc.page(0)?;
let images = page.find_images();

for img in &images {
    if img.alt_text().is_none() {
        page.set_image_alt_text(img.id(), "Descriptive alt text")?;
    }
}

doc.save_page(page)?;

完整 API 参考

DocumentEditor 图片方法

方法	返回值	描述
`get_page_images(page)`	`Result<Vec<ImageInfo>>`	列出页面上的所有图片
`reposition_image(page, name, x, y)`	`Result<()>`	移动图片到新位置
`resize_image(page, name, w, h)`	`Result<()>`	更改图片尺寸
`set_image_bounds(page, name, x, y, w, h)`	`Result<()>`	设置位置和大小
`clear_image_modifications(page)`	`()`	丢弃待处理的更改
`has_image_modifications(page)`	`bool`	检查待处理的更改

高级示例：居中所有图片

use pdf_oxide::editor::DocumentEditor;

let mut editor = DocumentEditor::open("input.pdf")?;
let count = editor.current_page_count();

for page_idx in 0..count {
    let media_box = editor.get_page_media_box(page_idx)?;
    let page_width = media_box[2] - media_box[0];

    let images = editor.get_page_images(page_idx)?;
    for img in &images {
        let img_width = img.bounds[2];
        let centered_x = (page_width - img_width) / 2.0;

        editor.reposition_image(page_idx, &img.name, centered_x, img.bounds[1])?;
    }
}

editor.save("centered.pdf")?;

高级示例：缩放至适配宽度

from pdf_oxide import PdfDocument

doc = PdfDocument("photos.pdf")

for page_idx in range(doc.page_count()):
    media_box = doc.page_media_box(page_idx)
    page_width = media_box[2] - media_box[0]
    margin = 72  # 1 inch

    images = doc.page_images(page_idx)
    for img in images:
        target_width = page_width - 2 * margin
        scale = target_width / img["width"]
        new_height = img["height"] * scale

        doc.set_image_bounds(
            page_idx, img["name"],
            margin, img["y"],
            target_width, new_height
        )

doc.save("fitted.pdf")