What is the fastest Python PDF library?

PDF Oxide is the fastest Python PDF library, with 0.8ms mean text extraction time — 5.8× faster than PyMuPDF (4.6ms) and 15× faster than pypdf (12.1ms). Benchmarked on 3,830 real-world PDFs with 100% pass rate.

Is PDF Oxide free for commercial use?

Yes. PDF Oxide is MIT licensed — free for all uses including commercial products, SaaS, and proprietary software. No license fees, no sales calls, no AGPL restrictions.

Can PDF Oxide handle scanned PDFs with OCR?

Yes. PDF Oxide includes built-in OCR via PaddleOCR and ONNX Runtime. No Tesseract installation needed — just pip install pdf_oxide and use extract_text_ocr(). Supports PP-OCRv3, v4, and v5 models.

Does PDF Oxide support XFA forms?

Yes. PDF Oxide is the only Python PDF library that can detect, analyze, and extract data from XFA forms (XML Forms Architecture). PyMuPDF, pypdf, pdfplumber, and pdfminer cannot read XFA form data.

How does PDF Oxide compare to PyMuPDF?

PDF Oxide is 5.8× faster than PyMuPDF (0.8ms vs 4.6ms mean), has a 100% pass rate vs 99.3%, and is MIT licensed vs PyMuPDF's AGPL-3.0. PDF Oxide also has built-in Markdown/HTML output and XFA form support that PyMuPDF lacks.

Can PDF Oxide convert PDF to Markdown?

Yes. PDF Oxide has built-in PDF to Markdown conversion with heading detection, table preservation, and list formatting — ideal for LLM and RAG pipelines. No separate package needed, unlike PyMuPDF which requires pymupdf4llm (69× slower).

PDF Oxide 시작하기 (Dart / Flutter)

PDF Oxide는 Dart와 Flutter에서 PDF를 읽는 가장 빠른 방법입니다 — 평균 0.8ms 텍스트 추출, 3,830개 PDF에서 100% 통과율. pdf_oxide 패키지는 Rust 코어를 감싼 관용적인 dart:ffi 래퍼입니다. PDF 핸들은 NativeFinalizer(및 명시적인 close())로 자동 해제되고, C 문자열과 버퍼는 알아서 Dart로 복사되며, C-ABI 에러 코드는 PdfOxideError 예외로 표면화됩니다.

설치

pubspec.yaml에 pdf_oxide를 추가하세요:

dependencies:
  pdf_oxide: ^0.3.69

그런 다음 의존성을 가져옵니다:

dart pub get

바인딩은 런타임에 네이티브 라이브러리(libpdf_oxide.{so,dylib,dll})를 로드합니다. 탐색 순서는 PDF_OXIDE_LIB_PATH(전체 경로) → PDF_OXIDE_LIB_DIR → ../target/release → target/release → 시스템 로더입니다. Flutter에서는 플랫폼 라이브러리를 앱과 함께 배포하고 PDF_OXIDE_LIB_PATH가 이를 가리키도록 설정하세요.

빠른 시작

PDF를 열고 첫 페이지에서 텍스트를 가져옵니다. 작업이 끝나면 항상 문서를 close() 하세요 — try/finally로 깔끔하게 정리할 수 있습니다.

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('research-paper.pdf');
  try {
    print('Pages: ${doc.pageCount}');
    print('PDF version: ${doc.version}'); // e.g. 1.7
    print(doc.extractText(0));
  } finally {
    doc.close();
  }
}

이미 메모리에 올라와 있는 PDF(예: Flutter 앱에서 HTTP로 내려받은 PDF)를 열려면 openFromBytes를 사용하세요:

import 'dart:typed_data';
import 'package:pdf_oxide/pdf_oxide.dart';

void render(Uint8List bytes) {
  final doc = PdfDocument.openFromBytes(bytes);
  try {
    print(doc.extractText(0));
  } finally {
    doc.close();
  }
}

텍스트, Markdown, HTML

각 페이지는 일반 텍스트, Markdown, 또는 HTML로 렌더링할 수 있습니다. 페이지별 메서드는 0부터 시작하는 페이지 인덱스를 받고, …All() 변형은 문서 전체에 걸쳐 동작합니다.

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('report.pdf');
  try {
    // Single page (index 0)
    print(doc.extractText(0));   // raw extracted text
    print(doc.toPlainText(0));   // normalized plain text
    print(doc.toMarkdown(0));    // Markdown with headings, lists, tables
    print(doc.toHtml(0));        // HTML

    // Whole document
    print(doc.toMarkdownAll());
    print(doc.toHtmlAll());
    print(doc.toPlainTextAll());
  } finally {
    doc.close();
  }
}

한 번에 한 페이지씩 다루는 편이 좋다면, 가벼운 Page 뷰도 있습니다:

final doc = PdfDocument.open('report.pdf');
final page = doc.page(0);
print(page.text());
print(page.markdown());
doc.close();

좌표가 포함된 단어와 줄

extractWords는 모든 단어를 경계 상자(bounding box), 폰트, 굵기와 함께 반환하고, extractTextLines는 줄 전체를 반환합니다. 좌표는 PDF 사용자 공간(user-space) 포인트 단위입니다.

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('paper.pdf');
  try {
    for (final word in doc.extractWords(0)) {
      final b = word.bbox; // Bbox(x, y, width, height)
      print("'${word.text}' at (${b.x}, ${b.y}) "
          'font=${word.fontName} size=${word.fontSize} bold=${word.bold}');
    }

    for (final line in doc.extractTextLines(0)) {
      print('${line.wordCount} words: ${line.text}');
    }
  } finally {
    doc.close();
  }
}

글리프 수준의 세부 정보가 필요하다면, extractChars는 각 Char를 유니코드 코드포인트, 경계 상자, 폰트 이름, 크기와 함께 반환합니다:

final doc = PdfDocument.open('paper.pdf');
for (final ch in doc.extractChars(0)) {
  print('${String.fromCharCode(ch.character)} @ ${ch.bbox} ${ch.fontSize}pt');
}
doc.close();

검색

search는 단일 페이지 안에서 찾고, searchAll은 문서 전체를 훑습니다. 둘 다 검색어와 caseSensitive 플래그를 받으며, 일치한 텍스트와 그 페이지 인덱스, 경계 상자를 담은 SearchResult 레코드를 반환합니다.

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final doc = PdfDocument.open('manual.pdf');
  try {
    // Single page (page 0), case-insensitive
    for (final hit in doc.search(0, 'configuration', false)) {
      print("page ${hit.page}: '${hit.text}' at ${hit.bbox}");
    }

    // Across the whole document
    final hits = doc.searchAll('configuration', false);
    print('${hits.length} matches');
  } finally {
    doc.close();
  }
}

PDF 만들기

Pdf 빌더는 Markdown, HTML, 또는 일반 텍스트를 PDF로 변환합니다. 바이트를 얻으려면 toBytes()를, 파일로 저장하려면 save()를 호출하고, 작업이 끝나면 close()를 호출하세요.

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  final pdf = Pdf.fromMarkdown('# Hello World\n\nThis is a **PDF**.\n');
  try {
    pdf.save('output.pdf');
    final bytes = pdf.toBytes();
    print('Wrote ${bytes.length} bytes');
  } finally {
    pdf.close();
  }
}

Pdf.fromHtml('<h1>Invoice</h1><p>Amount: \$42</p>')와 Pdf.fromText('Plain text content.')도 같은 방식으로 동작합니다. 빌드된 Pdf는 결국 바이트 덩어리일 뿐이므로, PdfDocument.openFromBytes(pdf.toBytes())로 곧바로 연결하면 디스크를 거치지 않고 다시 추출할 수 있습니다.

에러 처리

실패할 수 있는 모든 호출은 기저의 C-ABI 에러 코드를 담은 PdfOxideError(Exception을 implements 함)를 던집니다:

import 'package:pdf_oxide/pdf_oxide.dart';

void main() {
  try {
    final doc = PdfDocument.open('/nonexistent/nope.pdf');
    doc.close();
  } on PdfOxideError catch (e) {
    print('Failed to open PDF: $e');
  }
}

다음 단계

Rust 시작하기 — 이 바인딩을 구동하는 네이티브 코어
Python 시작하기 — Python에서 PDF Oxide 사용하기
텍스트 추출 — 자세한 추출 옵션과 레시피
PDF 생성 — 메타데이터와 암호화를 활용한 고급 생성