What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 시작하기 (C++)

PDF Oxide는 Rust 코어 위에 자연스러운 헤더 전용 C++17 바인딩을 제공합니다 — 평균 0.8ms 텍스트 추출, 3,830개 PDF에서 100% 통과율. 핸들은 이동 전용(move-only) RAII 래퍼이며, 네이티브 문자열과 버퍼는 자동으로 std::string / std::vector<std::uint8_t>로 복사되고, C ABI 에러 코드는 pdf_oxide::Error로 던져집니다. v0.3.69에서 새로 추가되었습니다.

설치

이 바인딩은 네이티브 cdylib와 링크되는 단일 헤더(cpp/include/pdf_oxide/pdf_oxide.hpp)입니다. 저장소 루트에서 라이브러리를 한 번 빌드한 다음, CMake가 이를 가리키도록 설정하세요:

# 1. build the native library (shipped binding feature set)
cargo build --release --lib \
  --features ocr,rendering,signatures,barcodes,tsa-client,system-fonts

# 2. configure + build with the header-only wrapper
cmake -S cpp -B cpp/build -DCMAKE_BUILD_TYPE=Release \
  -DPDF_OXIDE_LIB_DIR="$PWD/target/release"
cmake --build cpp/build -j

그런 다음 여러분의 번역 단위(translation unit)에 헤더를 포함하세요:

#include <pdf_oxide/pdf_oxide.hpp>

C 헤더는 전역 Pdf 타입을 선언하므로, using namespace pdf_oxide;를 사용하지 마세요. 이름을 정규화하여 사용하거나(pdf_oxide::Pdf, pdf_oxide::Document) 필요한 것만 골라 using 선언으로 가져오세요.

빠른 시작

PDF를 열고 페이지에서 읽기 순서대로 텍스트를 추출합니다. 실패할 수 있는 모든 호출은 pdf_oxide::Error를 던지므로, 작업을 try/catch로 감싸세요.

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    try {
        auto doc = pdf_oxide::Document::open("research-paper.pdf");

        std::cout << "pages: " << doc.page_count() << "\n";

        pdf_oxide::Version v = doc.version();
        std::cout << "version: " << static_cast<int>(v.major) << "."
                  << static_cast<int>(v.minor) << "\n";

        std::string text = doc.extract_text(0);   // 0-based page index
        std::cout << text << "\n";
        return 0;
    } catch (const pdf_oxide::Error& e) {
        std::cerr << "error: " << e.what() << "\n";
        return 1;
    }
}

이미 메모리에 있는 PDF를 열려면 Document::open_from_bytes를 사용하세요:

std::vector<std::uint8_t> bytes = load_pdf_bytes();   // from S3, HTTP, a DB…
auto doc = pdf_oxide::Document::open_from_bytes(bytes);
std::string text = doc.extract_text(0);

Markdown 및 HTML 변환

한 페이지 — 또는 문서 전체 — 를 Markdown이나 HTML로 변환합니다.

auto doc = pdf_oxide::Document::open("paper.pdf");

std::string page_md = doc.to_markdown(0);   // one page
std::string all_md   = doc.to_markdown_all(); // every page

std::string page_html = doc.to_html(0);
std::string all_html  = doc.to_html_all();

std::cout << all_md << "\n";

단어 단위 추출

extract_words(page_index)는 페이지의 모든 단어에 대한 텍스트, 경계 상자(bounding box), 폰트 메타데이터를 담은 std::vector<pdf_oxide::Word>를 반환합니다.

auto doc   = pdf_oxide::Document::open("paper.pdf");
auto words = doc.extract_words(0);

for (const auto& w : words) {
    std::cout << "'" << w.text << "'"
              << " at (" << w.bbox.x << ", " << w.bbox.y << ")"
              << " size=" << w.font_size
              << " font=" << w.font_name
              << (w.bold ? " [bold]" : "") << "\n";
}

pdf_oxide::Word 필드:

필드	타입	설명
`text`	`std::string`	단어 텍스트
`bbox`	`Bbox`	경계 상자 (`x`, `y`, `width`, `height`)
`font_name`	`std::string`	PostScript 폰트 이름
`font_size`	`float`	폰트 크기 (포인트 단위)
`bold`	`bool`	해당 런이 굵게(bold)인지 여부

문자 단위 및 줄 단위 추출도 같은 형태를 따릅니다: extract_chars(0)는 Char 레코드(유니코드 코드포인트 + bbox)를 반환하고, extract_text_lines(0)는 TextLine 레코드(text, bbox, word_count)를 반환합니다.

검색

search(page_index, term, case_sensitive)로 한 페이지를 검색하거나, search_all(term, case_sensitive)로 문서 전체를 검색하세요. 둘 다 std::vector<pdf_oxide::SearchResult>를 반환합니다.

auto doc = pdf_oxide::Document::open("manual.pdf");

// One page
auto hits = doc.search(0, "configuration", /*case_sensitive=*/false);

// Every page
auto all_hits = doc.search_all("configuration", /*case_sensitive=*/false);
for (const auto& r : all_hits) {
    std::cout << "page " << r.page << ": '" << r.text << "'"
              << " at (" << r.bbox.x << ", " << r.bbox.y << ")\n";
}

PDF 생성

pdf_oxide::Pdf 빌더는 Markdown, HTML 또는 일반 텍스트로부터 문서를 생성합니다. to_bytes()로 직렬화하거나 save()로 디스크에 바로 기록하세요.

// From Markdown
auto pdf = pdf_oxide::Pdf::from_markdown("# Hello World\n\nThis is a PDF.\n");
pdf.save("output.pdf");

// From HTML
auto invoice = pdf_oxide::Pdf::from_html("<h1>Invoice</h1><p>Amount: $42</p>");
invoice.save("invoice.pdf");

// From plain text, or grab the bytes for in-memory use
auto notes = pdf_oxide::Pdf::from_text("Plain text body.");
std::vector<std::uint8_t> bytes = notes.to_bytes();

방금 생성한 PDF를 다시 Document로 왕복(round-trip)시킬 수도 있습니다:

auto pdf  = pdf_oxide::Pdf::from_markdown("# Title\n\nbody\n");
auto doc  = pdf_oxide::Document::open_from_bytes(pdf.to_bytes());
std::cout << doc.to_markdown_all() << "\n";

에러 처리

실패할 수 있는 모든 작업은 pdf_oxide::Error를 던지며, 여기에는 네이티브 에러 메시지(what())와 원본 C ABI 에러 코드(code())가 담겨 있습니다. 핸들은 명시적으로 닫을 수 있으며 멱등적입니다: doc.close()는 네이티브 핸들을 일찍 해제하고, 닫은 뒤에 사용하면 예외가 발생합니다.

#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>

int main() {
    try {
        auto doc = pdf_oxide::Document::open("missing.pdf");
        std::cout << doc.extract_text(0) << "\n";
        doc.close();   // optional — happens automatically at scope exit
    } catch (const pdf_oxide::Error& e) {
        std::cerr << "pdf error (" << e.code() << "): " << e.what() << "\n";
        return 1;
    }
}

다음 단계

Rust 시작하기 – Rust에서 PDF Oxide 사용하기
Python 시작하기 – Python에서 PDF Oxide 사용하기
텍스트 추출 – 상세한 추출 옵션과 레시피
PDF 생성 – 메타데이터와 암호화를 포함한 고급 생성
편집 – 기존 PDF, 주석, 폼 필드 수정하기