What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

이미지 추출

PDF Oxide는 콘텐츠 스트림을 파싱하고, Do 연산자를 통해 XObject 참조를 해석하며, 중첩된 Form XObject를 재귀적으로 처리하고, 인라인 이미지를 디코딩하여 PDF 페이지에서 이미지를 추출합니다. extract_images()를 사용해 메모리에서 이미지 객체를 가져오거나, extract_images_to_files()를 사용해 PNG 또는 JPEG 파일로 바로 저장할 수 있습니다.

v0.3.5부터 이미지 추출은 XObject 딕셔너리만 스캔하는 것이 아니라 전체 페이지 콘텐츠 스트림을 처리합니다. 이를 통해 Do 연산자로 배치된 이미지, 사이클 감지가 포함된 중첩 Form XObject, BI/ID/EI 시퀀스로 내장된 인라인 이미지를 올바르게 처리합니다.

색상 공간 지원

추출된 이미지는 원본 색상 공간 그대로 디코딩되며, 손실 변환 없이 제공됩니다.

DeviceRGB / DeviceGray / DeviceCMYK — 그대로 반환됩니다.
Indexed (컴포넌트당 1, 2, 4, 8비트) — resolve_indexed_palette로 팔레트를 해석하고 expand_indexed_to_rgb로 확장합니다. RGB, 그레이스케일, CMYK 기반 색상 공간으로 구성된 인덱스 팔레트를 지원합니다. 이전에는 많은 실제 PDF에서 Invalid RGB image dimensions 오류가 발생했습니다.
CalRGB / CalGray / ICCBased — 디코딩 시 RGB로 변환됩니다.

팔레트 확장은 checked_mul 오버플로 가드와 256 MiB 할당 상한으로 악의적인 입력에 대해 강화되어 있으며, 잘린 스트림은 잘못된 픽셀을 생성하는 대신 깔끔하게 거부됩니다.

손상된 이미지 허용 처리

/ColorSpace 항목 누락, 0 크기, 또는 잘못된 스트림을 가진 이미지는 경고와 함께 건너뜁니다. 페이지 렌더링이 패닉을 일으키지 않으며, Form XObject 내부에 중첩된 손상된 이미지에도 동일하게 적용됩니다.

빠른 예제

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
images = doc.extract_image_bytes(0)
for img in images:
    print(f"{img['width']}x{img['height']}")

Node.js

const { PdfDocument } = require("pdf-oxide");

const doc = new PdfDocument("report.pdf");
const images = doc.getEmbeddedImages(0);
for (const img of images) {
    console.log(`${img.width}x${img.height}`);
}

import pdfoxide "github.com/yfedoseev/pdf_oxide/go"

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
images, _ := doc.Images(0)
for _, img := range images {
    fmt.Printf("%dx%d\n", img.Width, img.Height)
}

using PdfOxide.Core;

using var doc = PdfDocument.Open("report.pdf");
var images = doc.ExtractImages(0);
foreach (var img in images)
{
    Console.WriteLine($"{img.Width}x{img.Height}");
}

WASM

const doc = new WasmPdfDocument(bytes);
const images = doc.extractImages(0);
for (const img of images) {
    console.log(`${img.width}x${img.height}`);
}

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;
for img in &images {
    println!("{}x{} {:?}", img.width(), img.height(), img.color_space());
}

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.image.ExtractedImage;
import java.nio.file.Path;
import java.util.List;

try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
    List<ExtractedImage> images = doc.page(0).images();
    for (ExtractedImage img : images) {
        System.out.println(img.width() + "x" + img.height());
    }
}

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(java.nio.file.Path.of("report.pdf")).use { doc ->
    for (img in doc.page(0).images()) {
        println("${img.width()}x${img.height()}")
    }
}

Scala

import fyi.oxide.pdf.{PdfDocument, imagesSeq}
import scala.util.Using

Using.resource(PdfDocument.open("report.pdf")) { doc =>
  for (img <- doc.page(0).imagesSeq) {
    println(s"${img.width}x${img.height}")
  }
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "report.pdf")]
  (doseq [img (pdf/images (pdf/page doc 0))]
    (println (str (.width img) "x" (.height img)))))

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("report.pdf");
for (const auto& img : doc.embedded_images(0)) {
    std::printf("%dx%d\n", img.width, img.height);
}

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
for img in try doc.embeddedImages(0) {
    print("\(img.width)x\(img.height)")
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('report.pdf');
for (final img in doc.embeddedImages(0)) {
    print('${img.width}x${img.height}');
}

library(pdfoxide)

doc <- pdf_open("report.pdf")
for (img in pdf_embedded_images(doc, 0)) {
    cat(sprintf("%dx%d\n", img$width, img$height))
}

Julia

using PdfOxide

doc = open_document("report.pdf")
for img in embedded_images(doc, 0)
    println("$(img.width)x$(img.height)")
end

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("report.pdf");
const images = try doc.embeddedImages(a, 0);
for (images) |img| {
    std.debug.print("{d}x{d}\n", .{ img.width, img.height });
}

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
for (POXImage *img in [doc embeddedImages:0 error:&err]) {
    NSLog(@"%ldx%ld", (long)img.width, (long)img.height);
}

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, images} = PdfOxide.embedded_images(doc, 0)
for img <- images do
  IO.puts("#{img.width}x#{img.height}")
end

API 레퍼런스

`extract_images(page_index) -> Vec<PdfImage>`

페이지에서 모든 이미지를 추출합니다. 페이지 콘텐츠 스트림을 파싱하여 다음을 찾습니다.

XObject 이미지 — Do 연산자로 참조되는 것
Form XObject — 중첩된 이미지를 포함하는 것 (재귀적, 사이클 감지 포함)
인라인 이미지 — BI/ID/EI 시퀀스로 내장된 것

CTM(현재 변환 행렬) 추적으로 각 이미지의 바운딩 박스를 제공합니다.

파라미터	타입	설명
`page_index`	`int` / `usize`	0부터 시작하는 페이지 인덱스

반환값: PdfImage 객체의 벡터.

PdfImage 필드 및 메서드

메서드 / 필드	타입	설명
`width()`	`u32`	이미지 너비(픽셀)
`height()`	`u32`	이미지 높이(픽셀)
`color_space()`	`&ColorSpace`	색상 공간 (DeviceRGB, DeviceGray, DeviceCMYK 등)
`bits_per_component()`	`u8`	색상 컴포넌트당 비트 수 (일반적으로 8)
`data()`	`&ImageData`	원시 이미지 데이터 (JPEG 바이트 또는 원시 픽셀)
`bbox()`	`Option<&Rect>`	PDF 사용자 공간의 바운딩 박스 (CTM 추적 시)
`save_as_png(path)`	`Result<()>`	PNG 파일로 저장
`save_as_jpeg(path)`	`Result<()>`	JPEG 파일로 저장
`to_png_bytes()`	`Result<Vec<u8>>`	메모리에서 PNG 바이트로 인코딩
`to_jpeg_bytes()`	`Result<Vec<u8>>`	메모리에서 JPEG 바이트로 인코딩

ColorSpace 변형

변형	설명
`DeviceRGB`	3채널 RGB
`DeviceGray`	단일 채널 그레이스케일
`DeviceCMYK`	4채널 CMYK
`Indexed`	팔레트 기반 색상
`ICCBased`	ICC 프로파일 기반 색상
`CalGray`	보정된 그레이스케일
`CalRGB`	보정된 RGB
`Lab`	CIE Lab* 색상

ImageData 변형

변형	설명
`Jpeg(Vec<u8>)`	JPEG 압축 데이터 (DCT 패스스루)
`Raw { pixels, format }`	`PixelFormat`(RGB, Gray, CMYK, RGBA)이 포함된 디코딩된 픽셀 데이터

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

for (i, image) in images.iter().enumerate() {
    println!(
        "Image {}: {}x{} {:?} {}bpc",
        i, image.width(), image.height(),
        image.color_space(), image.bits_per_component(),
    );

    if let Some(bbox) = image.bbox() {
        println!("  Position: ({:.1}, {:.1})", bbox.x, bbox.y);
    }

    image.save_as_png(&format!("output/image_{}.png", i))?;
}

`extract_images_to_files(page_index, output_dir, prefix, start_index) -> Vec<ExtractedImageRef>`

페이지에서 이미지를 추출해 파일로 직접 저장합니다. JPEG 이미지는 원본 형식 그대로 저장되며(재인코딩 손실 없음), 그 외 이미지는 PNG로 저장됩니다.

파라미터	타입	기본값	설명
`page_index`	`usize`	–	0부터 시작하는 페이지 인덱스
`output_dir`	`impl AsRef<Path>`	–	이미지를 저장할 디렉터리 (없으면 생성)
`prefix`	`Option<&str>`	`"img"`	파일명 접두사
`start_index`	`Option<usize>`	`1`	파일명 시작 인덱스

반환값: 저장된 파일을 설명하는 ExtractedImageRef의 벡터.

ExtractedImageRef 필드

필드	타입	설명
`filename`	`String`	저장된 파일명 (예: `"img_001.png"`)
`format`	`ImageFormat`	`Png` 또는 `Jpeg`
`width`	`u32`	이미지 너비(픽셀)
`height`	`u32`	이미지 높이(픽셀)

Rust

let mut doc = PdfDocument::open("report.pdf")?;
let refs = doc.extract_images_to_files(0, "output/images", Some("fig"), Some(1))?;

for img_ref in &refs {
    println!("Saved: {} ({}x{}, {:?})", img_ref.filename, img_ref.width, img_ref.height, img_ref.format);
}

심화 예제

모든 페이지에서 이미지 추출하기

use pdf_oxide::PdfDocument;
use std::path::Path;

let mut doc = PdfDocument::open("book.pdf")?;
let page_count = doc.page_count()?;
let mut total = 0;

for page in 0..page_count {
    let refs = doc.extract_images_to_files(
        page,
        "output/images",
        Some(&format!("page{}", page + 1)),
        Some(1),
    )?;
    total += refs.len();
    println!("Page {}: {} images", page + 1, refs.len());
}
println!("Total: {} images extracted", total);

메모리에서 이미지 바이트 가져오기 (디스크 I/O 없음)

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

for image in &images {
    let png_bytes = image.to_png_bytes()?;
    println!("PNG size: {} bytes", png_bytes.len());

    // Use png_bytes with an HTTP response, database, etc.
}

크기로 이미지 필터링하기

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

// Only keep images larger than 100x100 pixels
let large_images: Vec<_> = images.iter()
    .filter(|img| img.width() > 100 && img.height() > 100)
    .collect();

println!("{} large images on page 1", large_images.len());
for img in &large_images {
    println!("  {}x{} {:?}", img.width(), img.height(), img.color_space());
}

JPEG 패스스루와 재인코딩 이미지 구분하기

use pdf_oxide::extractors::ImageData;

let mut doc = PdfDocument::open("report.pdf")?;
let images = doc.extract_images(0)?;

for (i, image) in images.iter().enumerate() {
    match image.data() {
        ImageData::Jpeg(bytes) => {
            // Original JPEG data -- save directly for zero quality loss
            std::fs::write(format!("image_{}.jpg", i), bytes)?;
            println!("Image {}: JPEG pass-through ({} bytes)", i, bytes.len());
        }
        ImageData::Raw { pixels, format } => {
            // Raw pixels -- must encode to a file format
            image.save_as_png(&format!("image_{}.png", i))?;
            println!("Image {}: raw {:?} ({}x{})", i, format, image.width(), image.height());
        }
    }
}

내장 이미지 접근자 (`embedded_images`)

extract_images()는 풍부한 인메모리 Rust API입니다. 크로스 언어 바인딩은 동일한 콘텐츠 스트림 탐색을 기반으로 한 더 가벼운 내장 이미지 접근자를 제공하며, 각 이미지의 픽셀 크기, 형식, 색상 공간, 비트/컴포넌트, 원시 디코딩 바이트를 반환합니다. C ABI 함수 pdf_document_get_embedded_images와 pdf_oxide_image_* 접근자 패밀리로 구현됩니다.

바인딩으로 내장 이미지를 나열하는 방법

import (
    "fmt"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()

images, _ := doc.Images(0) // []pdfoxide.Image
for _, img := range images {
    fmt.Printf("%dx%d %s/%s %dbpc, %d bytes\n",
        img.Width, img.Height, img.Format, img.Colorspace,
        img.BitsPerComponent, len(img.Data))
}

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
let images = try doc.embeddedImages(0) // [Image]
for img in images {
    print("\(img.width)x\(img.height) \(img.format)/\(img.colorspace) "
        + "\(img.bitsPerComponent)bpc, \(img.data.count) bytes")
}

C ABI

#include "pdf_oxide.h"

int32_t err = 0;
FfiImageList *images = pdf_document_get_embedded_images(doc, /*page=*/0, &err);
int32_t n = pdf_oxide_image_count(images);
for (int32_t i = 0; i < n; i++) {
    int32_t w = pdf_oxide_image_get_width(images, i, &err);
    int32_t h = pdf_oxide_image_get_height(images, i, &err);
    char *fmt = pdf_oxide_image_get_format(images, i, &err);
    char *cs  = pdf_oxide_image_get_colorspace(images, i, &err);
    printf("%dx%d %s/%s\n", w, h, fmt, cs);
    free_string(fmt);
    free_string(cs);
}
pdf_oxide_image_list_free(images);

Java

import fyi.oxide.pdf.PdfDocument;
import fyi.oxide.pdf.image.ExtractedImage;
import java.nio.file.Path;

try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
    for (ExtractedImage img : doc.page(0).images()) {
        System.out.printf("%dx%d %s, %d bytes%n",
            img.width(), img.height(), img.format(), img.bytes().length);
    }
}

Kotlin

import fyi.oxide.pdf.PdfDocument

PdfDocument.open(java.nio.file.Path.of("report.pdf")).use { doc ->
    for (img in doc.page(0).images()) {
        println("${img.width()}x${img.height()} ${img.format()}, ${img.bytes().size} bytes")
    }
}

Scala

import fyi.oxide.pdf.{PdfDocument, imagesSeq}
import scala.util.Using

Using.resource(PdfDocument.open("report.pdf")) { doc =>
  for (img <- doc.page(0).imagesSeq) {
    println(s"${img.width}x${img.height} ${img.format}, ${img.bytes.length} bytes")
  }
}

Clojure

(require '[pdf-oxide.core :as pdf])

(with-open [doc (pdf/open "report.pdf")]
  (doseq [img (pdf/images (pdf/page doc 0))]
    (println (format "%dx%d %s, %d bytes"
                     (.width img) (.height img) (.format img) (count (.bytes img))))))

C++

#include <pdf_oxide/pdf_oxide.hpp>

auto doc = pdf_oxide::Document::open("report.pdf");
for (const auto& img : doc.embedded_images(0)) {
    std::printf("%dx%d %s/%s %dbpc, %zu bytes\n",
        img.width, img.height, img.format.c_str(), img.colorspace.c_str(),
        img.bits_per_component, img.data.size());
}

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('report.pdf');
for (final img in doc.embeddedImages(0)) {
    print('${img.width}x${img.height} ${img.format}/${img.colorspace} '
        '${img.bitsPerComponent}bpc, ${img.data.length} bytes');
}

library(pdfoxide)

doc <- pdf_open("report.pdf")
for (img in pdf_embedded_images(doc, 0)) {
    cat(sprintf("%dx%d %s/%s %dbpc, %d bytes\n",
        img$width, img$height, img$format, img$colorspace,
        img$bits_per_component, length(img$data)))
}

Julia

using PdfOxide

doc = open_document("report.pdf")
for img in embedded_images(doc, 0)
    println("$(img.width)x$(img.height) $(img.format)/$(img.colorspace) " *
            "$(img.bitsPerComponent)bpc, $(length(img.data)) bytes")
end

Zig

const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;

var doc = try pdf_oxide.Document.open("report.pdf");
const images = try doc.embeddedImages(a, 0);
for (images) |img| {
    std.debug.print("{d}x{d} {s}/{s} {d}bpc, {d} bytes\n", .{
        img.width, img.height, img.format, img.colorspace,
        img.bits_per_component, img.data.len,
    });
}

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
for (POXImage *img in [doc embeddedImages:0 error:&err]) {
    NSLog(@"%ldx%ld %@/%@ %ldbpc, %lu bytes",
        (long)img.width, (long)img.height, img.format, img.colorspace,
        (long)img.bitsPerComponent, (unsigned long)img.data.length);
}

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, images} = PdfOxide.embedded_images(doc, 0)
for img <- images do
  IO.puts("#{img.width}x#{img.height} #{img.format}/#{img.colorspace} " <>
          "#{img.bits_per_component}bpc, #{byte_size(img.data)} bytes")
end

이미지 접근자 필드

필드 (Go / Swift)	타입	설명
`Width` / `width`	`int`	이미지 너비(픽셀)
`Height` / `height`	`int`	이미지 높이(픽셀)
`Format` / `format`	`string`	소스 형식 문자열 (예: `"jpeg"`, `"raw"`)
`Colorspace` / `colorspace`	`string`	색상 공간 이름 (예: `"DeviceRGB"`)
`BitsPerComponent` / `bitsPerComponent`	`int`	색상 컴포넌트당 비트 수
`Data` / `data`	`[]byte` / `[UInt8]`	원시 디코딩된 이미지 바이트

바인딩 지원 범위. 내장 이미지 접근자는 Go(doc.Images(page)), Swift(doc.embeddedImages(page)), C ABI(pdf_document_get_embedded_images)에서 사용할 수 있습니다. Rust에서는 위에서 설명한 더 풍부한 extract_images()를 사용하세요. WASM 타겟에서는 컴파일되지 않습니다.

페이지 요소 접근자 (`page_elements`)

page_elements는 페이지의 모든 배치된 요소(텍스트 스팬과 해당 타입, 텍스트, 바운딩 박스)를 단일 목록으로 반환합니다. 바인딩은 pdf_oxide_elements_to_json을 통해 단 한 번의 FFI 호출로 전체 목록을 마샬링하므로, 영역별로 텍스트 추출을 재실행하지 않고 페이지 레이아웃을 탐색하는 가장 효율적인 방법입니다. C ABI 함수 pdf_page_get_elements와 pdf_oxide_element_* 접근자 패밀리로 구현됩니다.

페이지 레이아웃 요소를 탐색하는 방법

import (
    "fmt"
    pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)

doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()

elements, _ := doc.PageElements(0) // []pdfoxide.Element
for _, el := range elements {
    fmt.Printf("[%s] %q at (%.1f, %.1f) %.1fx%.1f\n",
        el.Type, el.Text, el.X, el.Y, el.Width, el.Height)
}

Swift

import PdfOxide

let doc = try Document.open("report.pdf")
let elements = try doc.pageElements(0) // ElementList
for el in try elements.all() {
    print("[\(el.type)] \(el.text) at "
        + "(\(el.rect.x), \(el.rect.y)) \(el.rect.width)x\(el.rect.height)")
}

// Serialize the whole list to JSON in one call:
let json = try elements.toJson()

C ABI

#include "pdf_oxide.h"

int32_t err = 0;
FfiElementList *els = pdf_page_get_elements(doc, /*page=*/0, &err);

// One-shot JSON serialization (caller frees with free_string):
char *json = pdf_oxide_elements_to_json(els, &err);
printf("%s\n", json);
free_string(json);

pdf_oxide_elements_free(els);

Dart

import 'package:pdf_oxide/pdf_oxide.dart';

final doc = PdfDocument.open('report.pdf');
final elements = doc.pageElements(0); // ElementList
for (final el in elements.toList()) {
    print('[${el.type}] ${el.text} at '
        '(${el.rect.x}, ${el.rect.y}) ${el.rect.width}x${el.rect.height}');
}

// Serialize the whole list to JSON in one call:
final json = elements.toJson();

Objective-C

#import "POXPdfOxide.h"
NSError *err = nil;

POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
POXElementList *els = [doc pageElements:0 error:&err];
for (int32_t i = 0; i < [els count]; i++) {
    NSString *type = [els typeAtIndex:i error:&err];
    NSString *text = [els textAtIndex:i error:&err];
    POXBbox rect = [els rectAtIndex:i error:&err];
    NSLog(@"[%@] %@ at (%.1f, %.1f) %.1fx%.1f",
        type, text, rect.x, rect.y, rect.width, rect.height);
}

// One-shot JSON serialization:
NSString *json = [els toJsonWithError:&err];

Elixir

{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, els} = PdfOxide.page_elements(doc, 0)
for i <- 0..(PdfOxide.element_count(els) - 1) do
  {:ok, type} = PdfOxide.element_type(els, i)
  {:ok, text} = PdfOxide.element_text(els, i)
  {:ok, rect} = PdfOxide.element_rect(els, i)
  IO.puts("[#{type}] #{text} at (#{rect.x}, #{rect.y}) #{rect.width}x#{rect.height}")
end

# Serialize the whole list to JSON in one call:
{:ok, json} = PdfOxide.elements_to_json(els)

요소 필드

필드 (Go / Swift)	타입	설명
`Type` / `type`	`string`	요소 타입 (예: `"text"`)
`Text` / `text`	`string`	요소 텍스트 내용
`X`, `Y` / `rect.x`, `rect.y`	`float`	PDF 사용자 공간의 바운딩 박스 원점
`Width`, `Height` / `rect.width`, `rect.height`	`float`	바운딩 박스 크기

바인딩 지원 범위. page_elements는 Go(doc.PageElements(page)), Swift(doc.pageElements(page) → ElementList), C ABI(pdf_page_get_elements + pdf_oxide_elements_to_json)에서 사용할 수 있습니다. WASM 타겟에서는 컴파일되지 않습니다.

자주 묻는 질문

extract_images()와 내장 이미지 접근자의 차이는 무엇인가요? extract_images()(Rust)는 save_as_png, to_jpeg_bytes, CTM 바운딩 박스, 타입이 지정된 ColorSpace/ImageData 열거형을 갖춘 풍부한 PdfImage 객체를 반환합니다. 내장 이미지 접근자(doc.Images / doc.embeddedImages / pdf_document_get_embedded_images)는 동일한 콘텐츠 스트림 탐색에 대한 크로스 언어 경로로, 크기, 형식, 색상 공간, 원시 바이트의 플랫 목록을 반환합니다.

이미지 추출 속도는 빠른가요? 네. PDF Oxide의 추출 코어는 벤치마크 코퍼스에서 평균 0.8ms / p99 9ms로 실행되며, 100% 통과율을 보이고, 손실 없이 원본 색상 공간으로 이미지를 디코딩합니다.

내장 이미지 접근자가 JPEG를 재인코딩하나요? 아니요. JPEG 기반 이미지는 원본 DCT 바이트(format == "jpeg")로 반환되며, 원시 픽셀 데이터만 디코딩됩니다. 더 풍부한 extract_images() API는 ImageData::Jpeg와 ImageData::Raw로 동일한 구분을 제공합니다.

일부 이미지에서 data가 비어 있는 이유는 무엇인가요? 손상된 이미지(/ColorSpace 누락, 0 크기, 잘린 스트림)는 페이지를 패닉 상태로 만들지 않고 경고와 함께 건너뛰기 때문에 바이트 버퍼가 비어 있을 수 있습니다.

이미지 추출

색상 공간 지원

손상된 이미지 허용 처리

빠른 예제

API 레퍼런스

extract_images(page_index) -> Vec<PdfImage>

PdfImage 필드 및 메서드

ColorSpace 변형

ImageData 변형

extract_images_to_files(page_index, output_dir, prefix, start_index) -> Vec<ExtractedImageRef>

ExtractedImageRef 필드

심화 예제

모든 페이지에서 이미지 추출하기

메모리에서 이미지 바이트 가져오기 (디스크 I/O 없음)

크기로 이미지 필터링하기

JPEG 패스스루와 재인코딩 이미지 구분하기

내장 이미지 접근자 (embedded_images)

바인딩으로 내장 이미지를 나열하는 방법

이미지 접근자 필드

페이지 요소 접근자 (page_elements)

페이지 레이아웃 요소를 탐색하는 방법

요소 필드

자주 묻는 질문

관련 페이지

`extract_images(page_index) -> Vec<PdfImage>`

`extract_images_to_files(page_index, output_dir, prefix, start_index) -> Vec<ExtractedImageRef>`

내장 이미지 접근자 (`embedded_images`)

페이지 요소 접근자 (`page_elements`)