用Python从PDF中提取文本
PDF文本提取是文档处理流水线中最常见的任务之一——从构建搜索索引、喂数据给RAG系统,到数据挖掘和合规审查。本指南全面介绍如何使用PDF Oxide在Python、JavaScript和Rust中提取PDF文本,内容涵盖纯文本提取、字符级坐标定位、带样式的文本段、扫描件OCR、加密文件处理以及批量流水线性能调优。
只需三行代码即可从任意PDF中提取文本:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("document.pdf")
text = doc.extract_text(0) # page 0
print(text)
WASM
import { WasmPdfDocument } from "pdf-oxide-wasm";
const bytes = new Uint8Array(buffer);
const doc = new WasmPdfDocument(bytes);
const text = doc.extractText(0); // page 0
console.log(text);
doc.free();
Rust
use pdf_oxide::PdfDocument;
let mut doc = PdfDocument::open("document.pdf")?;
let text = doc.extract_text(0)?;
println!("{}", text);
Go
package main
import (
"fmt"
"log"
pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)
func main() {
doc, err := pdfoxide.Open("document.pdf")
if err != nil { log.Fatal(err) }
defer doc.Close()
text, err := doc.ExtractText(0) // page 0
if err != nil { log.Fatal(err) }
fmt.Println(text)
}
C#
using PdfOxide;
using var doc = PdfDocument.Open("document.pdf");
var text = doc.ExtractText(0); // page 0
Console.WriteLine(text);
Java
import fyi.oxide.pdf.PdfDocument;
import java.nio.file.Path;
try (PdfDocument doc = PdfDocument.open(Path.of("document.pdf"))) {
String text = doc.extractText(0); // page 0
System.out.println(text);
}
Kotlin
import fyi.oxide.pdf.PdfDocument
import java.nio.file.Path
PdfDocument.open(Path.of("document.pdf")).use { doc ->
val text = doc.extractText(0) // page 0
println(text)
}
Scala
import fyi.oxide.pdf.PdfDocument
import scala.util.Using
Using.resource(PdfDocument.open("document.pdf")) { doc =>
val text = doc.extractText(0) // page 0
println(text)
}
Clojure
(require '[pdf-oxide.core :as pdf])
(with-open [doc (pdf/open "document.pdf")]
(println (pdf/extract-text doc 0))) ; page 0
PHP
use PdfOxide\PdfDocument;
$doc = PdfDocument::open('document.pdf');
$text = $doc->extractText(0); // page 0
echo $text;
$doc->close();
Ruby
require 'pdf_oxide'
PdfOxide::PdfDocument.open('document.pdf') do |doc|
text = doc.extract_text(0) # page 0
puts text
end
C++
#include <pdf_oxide/pdf_oxide.hpp>
#include <iostream>
auto doc = pdf_oxide::Document::open("document.pdf");
auto text = doc.extract_text(0); // page 0
std::cout << text << '\n';
Swift
import PdfOxide
let doc = try Document.open("document.pdf")
let text = try doc.extractText(0) // page 0
print(text)
Dart
import 'package:pdf_oxide/pdf_oxide.dart';
final doc = PdfDocument.open('document.pdf');
final text = doc.extractText(0); // page 0
print(text);
doc.close();
R
library(pdfoxide)
doc <- pdf_open("document.pdf")
text <- pdf_extract_text(doc, 0) # page 0
cat(text)
Julia
using PdfOxide
doc = open_document("document.pdf")
text = extract_text(doc, 0) # page 0
println(text)
Zig
const pdf_oxide = @import("pdf_oxide");
const a = std.heap.page_allocator;
var doc = try pdf_oxide.Document.open("document.pdf");
const text = try doc.extractText(a, 0); // page 0
std.debug.print("{s}\n", .{text});
Objective-C
#import "POXPdfOxide.h"
NSError *err = nil;
POXDocument *doc = [POXDocument openPath:@"document.pdf" error:&err];
NSString *text = [doc extractText:0 error:&err]; // page 0
NSLog(@"%@", text);
Elixir
{:ok, doc} = PdfOxide.open("document.pdf")
{:ok, text} = PdfOxide.extract_text(doc, 0) # page 0
IO.puts(text)
PDF Oxide每页平均提取耗时0.8ms——比PyMuPDF快5倍,比pypdf快15倍——在3,830个测试PDF上通过率达100%。
为什么PDF文本提取很难
PDF是一种视觉格式,而非文本格式。与HTML或Markdown不同,PDF文件中并不存储"段落"或"句子",存储的是页面上特定坐标处的单个字符。要提取可读文本,需要完成以下工作:
- 字体解码 — PDF字体通过编码表(WinAnsi、MacRoman、Unicode CMap、Type 1、TrueType、CIDFont)将字符码映射到字形。同一个字符码
0x41在某个字体中可能表示"A",在另一个字体中可能表示"α"。 - 文本流解析 —
Tj、TJ、'、"等文本操作符将字符放置到页面上。TJ数组中的字距调整会将字符移动若干点。缺失的空格必须从字符位置间距中推断。 - 版面重建 — 页面上的字符没有明确的阅读顺序。双栏布局、页眉、页脚、表格和侧边栏必须经过空间分析,才能还原出线性文本流。
- 编码边界情况 — CJK文本(中文、日文、韩文)使用包含数千字形的CIDFont/CMap编码。阿拉伯语和希伯来语需要从右到左重排。连字(fi、fl、ffi)必须拆分。
- 嵌入子集 — 许多PDF只嵌入实际用到的字形,并使用自定义编码向量。一个字体可能将字形索引1映射为"T"、2映射为"h"、3映射为"e",而不采用任何标准编码。
这就是为什么不同PDF库对同一文件会产生不同的文本输出,也是为什么某些库在复杂文档上会完全失败。PDF Oxide使用基于Rust的解析器处理所有这些情况,在3,830个真实PDF上经过测试,通过率100%。
安装
Python(PyPI):
pip install pdf_oxide
提供适用于Linux(x86_64、aarch64)、macOS(Intel和Apple Silicon)以及Windows(x86_64)的预编译wheel包。Python 3.8+。无系统依赖——Rust核心已编译进wheel,无需安装Poppler、MuPDF或任何C库。
JavaScript(npm):
npm install pdf-oxide-wasm
支持Node.js 18+和现代浏览器。WASM二进制文件已打包进npm包。
Rust(Cargo):
cargo add pdf_oxide
需要Rust 1.70+。除标准Rust工具链外无任何系统依赖。
提取所有页面
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
full_text = []
for i in range(doc.page_count()):
text = doc.extract_text(i)
full_text.append(text)
print("\n".join(full_text))
WASM
const doc = new WasmPdfDocument(bytes);
const fullText = doc.extractAllText();
console.log(fullText);
doc.free();
Rust
let mut doc = PdfDocument::open("report.pdf")?;
let mut full_text = Vec::new();
for i in 0..doc.page_count()? {
full_text.push(doc.extract_text(i)?);
}
println!("{}", full_text.join("\n"));
Go
doc, err := pdfoxide.Open("report.pdf")
if err != nil { log.Fatal(err) }
defer doc.Close()
full, err := doc.ExtractAllText()
if err != nil { log.Fatal(err) }
fmt.Println(full)
C#
using var doc = PdfDocument.Open("report.pdf");
var parts = new List<string>();
for (int i = 0; i < doc.PageCount; i++)
parts.Add(doc.ExtractText(i));
Console.WriteLine(string.Join("\n", parts));
Java
try (PdfDocument doc = PdfDocument.open(Path.of("report.pdf"))) {
StringBuilder all = new StringBuilder();
for (int i = 0; i < doc.pageCount(); i++)
all.append(doc.extractText(i));
System.out.println(all);
}
Kotlin
PdfDocument.open(Path.of("report.pdf")).use { doc ->
val all = (0 until doc.pageCount()).joinToString("") { doc.extractText(it) }
println(all)
}
Scala
Using.resource(PdfDocument.open("report.pdf")) { doc =>
val all = (0 until doc.pageCount()).map(doc.extractText).mkString
println(all)
}
Clojure
(with-open [doc (pdf/open "report.pdf")]
(println (apply str (map #(pdf/extract-text doc %)
(range (pdf/page-count doc))))))
PHP
$doc = PdfDocument::open('report.pdf');
$all = '';
for ($i = 0; $i < $doc->pageCount(); $i++) { $all .= $doc->extractText($i); }
echo $all;
$doc->close();
Ruby
PdfOxide::PdfDocument.open('report.pdf') do |doc|
all = (0...doc.page_count).map { |i| doc.extract_text(i) }.join
puts all
end
C++
auto doc = pdf_oxide::Document::open("report.pdf");
auto all = doc.extract_all_text();
std::cout << all << '\n';
Swift
let doc = try Document.open("report.pdf")
let all = try doc.extractAllText()
print(all)
Dart
final doc = PdfDocument.open('report.pdf');
final all = doc.extractAllText();
print(all);
doc.close();
R
doc <- pdf_open("report.pdf")
all <- pdf_extract_all_text(doc)
cat(all)
Julia
doc = open_document("report.pdf")
all = extract_all_text(doc)
println(all)
Zig
var doc = try pdf_oxide.Document.open("report.pdf");
const all = try doc.extractAllText(a);
std.debug.print("{s}\n", .{all});
Objective-C
POXDocument *doc = [POXDocument openPath:@"report.pdf" error:&err];
NSString *all = [doc extractAllTextWithError:&err];
NSLog(@"%@", all);
Elixir
{:ok, doc} = PdfOxide.open("report.pdf")
{:ok, n} = PdfOxide.page_count(doc)
all = 0..(n - 1)
|> Enum.map(fn i -> {:ok, t} = PdfOxide.extract_text(doc, i); t end)
|> Enum.join()
IO.puts(all)
提取带字符坐标的文本
获取每个字符的精确坐标、字体名称和字体大小:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
chars = doc.extract_chars(0)
for ch in chars[:20]:
print(f"'{ch.char}' at ({ch.x:.1f}, {ch.y:.1f}) "
f"font={ch.font_name} size={ch.font_size:.1f}")
WASM
const doc = new WasmPdfDocument(bytes);
const chars = doc.extractChars(0);
for (const ch of chars.slice(0, 20)) {
console.log(`'${ch.char}' at (${ch.x.toFixed(1)}, ${ch.y.toFixed(1)}) font=${ch.fontName} size=${ch.fontSize.toFixed(1)}`);
}
doc.free();
Rust
let mut doc = PdfDocument::open("paper.pdf")?;
let chars = doc.extract_chars(0)?;
for ch in chars.iter().take(20) {
println!("'{}' at ({:.1}, {:.1}) font={} size={:.1}",
ch.char, ch.x, ch.y, ch.font_name, ch.font_size);
}
Go
doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()
chars, _ := doc.ExtractChars(0)
for _, ch := range chars[:20] {
fmt.Printf("%q at (%.1f, %.1f) font=%s size=%.1f\n",
ch.Char, ch.X, ch.Y, ch.FontName, ch.FontSize)
}
C#
using var doc = PdfDocument.Open("paper.pdf");
var chars = doc.ExtractChars(0);
foreach (var ch in chars.Take(20))
Console.WriteLine($"'{ch.Char}' at ({ch.X:F1}, {ch.Y:F1}) font={ch.FontName} size={ch.FontSize:F1}");
Java
import fyi.oxide.pdf.text.TextChar;
try (PdfDocument doc = PdfDocument.open(Path.of("paper.pdf"))) {
for (TextChar ch : doc.page(0).chars().subList(0, 20)) {
System.out.printf("'%s' at (%.1f, %.1f)%n",
ch.asString(), ch.bbox().x0(), ch.bbox().y0());
}
}
Kotlin
PdfDocument.open(Path.of("paper.pdf")).use { doc ->
doc.page(0).chars().take(20).forEach { ch ->
println("'${ch.asString()}' at (${ch.bbox().x0()}, ${ch.bbox().y0()})")
}
}
Scala
import fyi.oxide.pdf.charsSeq
Using.resource(PdfDocument.open("paper.pdf")) { doc =>
doc.page(0).charsSeq.take(20).foreach { ch =>
println(f"'${ch.asString}' at (${ch.bbox.x0}%.1f, ${ch.bbox.y0}%.1f)")
}
}
Clojure
(with-open [doc (pdf/open "paper.pdf")]
(doseq [ch (take 20 (pdf/chars (pdf/page doc 0)))]
(let [b (.bbox ch)]
(println (format "'%s' at (%.1f, %.1f)"
(.asString ch) (.x0 b) (.y0 b))))))
C++
auto doc = pdf_oxide::Document::open("paper.pdf");
auto chars = doc.extract_chars(0);
int shown = 0;
for (const auto& ch : chars) {
if (shown++ >= 20) break;
std::printf("U+%04X at (%.1f, %.1f) font=%s size=%.1f\n",
ch.character, ch.bbox.x, ch.bbox.y,
ch.font_name.c_str(), ch.font_size);
}
Swift
let doc = try Document.open("paper.pdf")
let chars = try doc.extractChars(0)
for ch in chars.prefix(20) {
let s = String(UnicodeScalar(ch.character) ?? " ")
print("'\(s)' at (\(ch.bbox.x), \(ch.bbox.y)) font=\(ch.fontName) size=\(ch.fontSize)")
}
Dart
final doc = PdfDocument.open('paper.pdf');
final chars = doc.extractChars(0);
for (final ch in chars.take(20)) {
final s = String.fromCharCode(ch.character);
print("'$s' at (${ch.bbox.x}, ${ch.bbox.y}) "
"font=${ch.fontName} size=${ch.fontSize}");
}
doc.close();
R
doc <- pdf_open("paper.pdf")
chars <- pdf_extract_chars(doc, 0)
for (ch in head(chars, 20)) {
cat(sprintf("'%s' at (%.1f, %.1f) font=%s size=%.1f\n",
intToUtf8(ch$character), ch$bbox$x, ch$bbox$y,
ch$font_name, ch$font_size))
}
Julia
doc = open_document("paper.pdf")
chars = extract_chars(doc, 0)
for ch in chars[1:min(20, end)]
println("'$(Char(ch.character))' at ($(ch.bbox.x), $(ch.bbox.y)) ",
"font=$(ch.font_name) size=$(ch.font_size)")
end
Zig
var doc = try pdf_oxide.Document.open("paper.pdf");
const chars = try doc.extractChars(a, 0);
defer pdf_oxide.Document.freeChars(a, chars);
for (chars[0..@min(20, chars.len)]) |ch| {
std.debug.print("U+{X:0>4} at ({d:.1}, {d:.1}) font={s} size={d:.1}\n",
.{ ch.character, ch.bbox.x, ch.bbox.y, ch.fontName, ch.fontSize });
}
Objective-C
POXDocument *doc = [POXDocument openPath:@"paper.pdf" error:&err];
NSArray<POXChar*> *chars = [doc extractChars:0 error:&err];
for (POXChar *ch in [chars subarrayWithRange:NSMakeRange(0, MIN(20, chars.count))]) {
NSLog(@"U+%04X at (%.1f, %.1f) font=%@ size=%.1f",
ch.character, ch.bbox.x, ch.bbox.y, ch.fontName, ch.fontSize);
}
Elixir
{:ok, doc} = PdfOxide.open("paper.pdf")
{:ok, chars} = PdfOxide.extract_chars(doc, 0)
chars
|> Enum.take(20)
|> Enum.each(fn ch ->
IO.puts("'#{<<ch.character::utf8>>}' at (#{ch.bbox.x}, #{ch.bbox.y}) " <>
"font=#{ch.font_name} size=#{ch.font_size}")
end)
每个字符包含以下字段:
| 字段 | 类型 | 说明 |
|---|---|---|
char |
str |
Unicode字符 |
x, y |
float |
坐标(单位:点) |
font_size |
float |
字体大小(单位:点) |
font_name |
str |
PostScript字体名称 |
bbox |
tuple |
边界框 (x0, y0, x1, y1) |
字符级提取适用于重建表格、按字体大小检测标题,或为文本区域生成边界框。例如,可以按y坐标将字符分组为行,再通过x坐标间距检测列边界。
提取带样式的文本段
按字体和字号将连续字符分组:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
spans = doc.extract_spans(0)
for span in spans:
print(f"'{span.text}' font={span.font_name} size={span.font_size:.1f}")
WASM
const doc = new WasmPdfDocument(bytes);
const spans = doc.extractSpans(0);
for (const span of spans) {
console.log(`'${span.text}' font=${span.fontName} size=${span.fontSize.toFixed(1)}`);
}
doc.free();
Rust
let mut doc = PdfDocument::open("paper.pdf")?;
let spans = doc.extract_spans(0)?;
for span in &spans {
println!("'{}' font={} size={:.1}", span.text, span.font_name, span.font_size);
}
适用于检测标题、粗体文本,或生成结构化输出。
批量处理
一次处理数百或数千个PDF:
from pdf_oxide import PdfDocument, PdfError
from pathlib import Path
pdf_dir = Path("documents/")
for pdf_path in pdf_dir.glob("*.pdf"):
try:
doc = PdfDocument(str(pdf_path))
for i in range(doc.page_count()):
text = doc.extract_text(i)
# Process text...
except PdfError as e:
print(f"Skipped {pdf_path.name}: {e}")
每页0.8ms,处理3,830个PDF只需约3.1秒。生产流水线的并行处理方案请参阅批量处理指南,其中包含multiprocessing和async I/O的使用示例。
处理扫描PDF(OCR)
如果PDF包含的是扫描图像而非文本,extract_text()将返回空内容或极少的文字。此时请使用PDF Oxide内置的OCR:
from pdf_oxide import PdfDocument
doc = PdfDocument("scanned.pdf")
text = doc.extract_text(0)
if not text.strip():
# Page is likely scanned — use OCR
text = doc.extract_text_ocr(0)
print(text)
PDF Oxide通过ONNX Runtime调用PaddleOCR,无需安装Tesseract。模型选择与配置请参阅OCR指南。
处理加密PDF
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("protected.pdf", password="secret")
text = doc.extract_text(0)
print(text)
WASM
const doc = new WasmPdfDocument(bytes);
doc.authenticate("secret");
const text = doc.extractText(0);
console.log(text);
doc.free();
Rust
let mut doc = PdfDocument::open_with_password("protected.pdf", "secret")?;
let text = doc.extract_text(0)?;
println!("{}", text);
Go
doc, _ := pdfoxide.Open("protected.pdf")
defer doc.Close()
if _, err := doc.Authenticate("secret"); err != nil { log.Fatal(err) }
text, _ := doc.ExtractText(0)
fmt.Println(text)
C#
using var doc = PdfDocument.OpenWithPassword("protected.pdf", "secret");
Console.WriteLine(doc.ExtractText(0));
Java
try (PdfDocument doc = PdfDocument.open("protected.pdf", "secret")) {
System.out.println(doc.extractText(0));
}
Kotlin
PdfDocument.open("protected.pdf", "secret").use { doc ->
println(doc.extractText(0))
}
Scala
Using.resource(PdfDocument.open("protected.pdf", "secret")) { doc =>
println(doc.extractText(0))
}
Clojure
(with-open [doc (pdf/open "protected.pdf" "secret")]
(println (pdf/extract-text doc 0)))
Ruby
PdfOxide::PdfDocument.open('protected.pdf', password: 'secret') do |doc|
puts doc.extract_text(0)
end
C++
auto doc = pdf_oxide::Document::open_with_password("protected.pdf", "secret");
std::cout << doc.extract_text(0) << '\n';
Swift
let doc = try Document.openWithPassword("protected.pdf", password: "secret")
print(try doc.extractText(0))
Dart
final doc = PdfDocument.openWithPassword('protected.pdf', 'secret');
print(doc.extractText(0));
doc.close();
R
doc <- pdf_open_with_password("protected.pdf", "secret")
cat(pdf_extract_text(doc, 0))
Julia
doc = open_with_password("protected.pdf", "secret")
println(extract_text(doc, 0))
Zig
var doc = try pdf_oxide.Document.openWithPassword("protected.pdf", "secret");
const text = try doc.extractText(a, 0);
std.debug.print("{s}\n", .{text});
Objective-C
POXDocument *doc = [POXDocument openWithPassword:@"protected.pdf"
password:@"secret" error:&err];
NSLog(@"%@", [doc extractText:0 error:&err]);
Elixir
{:ok, doc} = PdfOxide.open_with_password("protected.pdf", "secret")
{:ok, text} = PdfOxide.extract_text(doc, 0)
IO.puts(text)
支持AES-256、AES-128和RC4加密的PDF。pdfplumber完全无法打开加密文件,pdfminer在AES-256上会报错,而PDF Oxide能透明处理所有标准PDF加密方式。
输出为Markdown
获取带标题和格式的结构化输出:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("paper.pdf")
md = doc.to_markdown(0, detect_headings=True)
print(md)
WASM
const doc = new WasmPdfDocument(bytes);
const md = doc.toMarkdown(0);
console.log(md);
doc.free();
Rust
let mut doc = PdfDocument::open("paper.pdf")?;
let md = doc.to_markdown(0, true)?;
println!("{}", md);
Go
doc, _ := pdfoxide.Open("paper.pdf")
defer doc.Close()
md, _ := doc.ToMarkdown(0)
fmt.Println(md)
C#
using var doc = PdfDocument.Open("paper.pdf");
Console.WriteLine(doc.ToMarkdown(0));
Java
try (PdfDocument doc = PdfDocument.open(Path.of("paper.pdf"))) {
System.out.println(doc.toMarkdown(0));
}
Kotlin
PdfDocument.open(Path.of("paper.pdf")).use { doc ->
println(doc.toMarkdown(0))
}
Scala
Using.resource(PdfDocument.open("paper.pdf")) { doc =>
println(doc.toMarkdown(0))
}
Clojure
(with-open [doc (pdf/open "paper.pdf")]
(println (pdf/to-markdown doc 0)))
PHP
$doc = PdfDocument::open('paper.pdf');
echo $doc->toMarkdown(0);
$doc->close();
Ruby
PdfOxide::PdfDocument.open('paper.pdf') do |doc|
puts doc.to_markdown(0)
end
C++
auto doc = pdf_oxide::Document::open("paper.pdf");
std::cout << doc.to_markdown(0) << '\n';
Swift
let doc = try Document.open("paper.pdf")
print(try doc.toMarkdown(0))
Dart
final doc = PdfDocument.open('paper.pdf');
print(doc.toMarkdown(0));
doc.close();
R
doc <- pdf_open("paper.pdf")
cat(pdf_to_markdown(doc, 0))
Julia
doc = open_document("paper.pdf")
println(to_markdown(doc, 0))
Zig
var doc = try pdf_oxide.Document.open("paper.pdf");
const md = try doc.toMarkdown(a, 0);
std.debug.print("{s}\n", .{md});
Objective-C
POXDocument *doc = [POXDocument openPath:@"paper.pdf" error:&err];
NSLog(@"%@", [doc toMarkdown:0 error:&err]);
Elixir
{:ok, doc} = PdfOxide.open("paper.pdf")
{:ok, md} = PdfOxide.to_markdown(doc, 0)
IO.puts(md)
RAG和LLM集成方案请参阅PDF转Markdown指南。
在PDF中搜索文本
带位置信息地跨所有页面搜索文本:
Python
from pdf_oxide import PdfDocument
doc = PdfDocument("manual.pdf")
results = doc.search("configuration")
for r in results:
print(f"Page {r.page}: '{r.text}' at ({r.x:.0f}, {r.y:.0f})")
WASM
const doc = new WasmPdfDocument(bytes);
const results = doc.search("configuration", false);
for (const r of results) {
console.log(`Page ${r.page}: '${r.text}' at (${r.x.toFixed(0)}, ${r.y.toFixed(0)})`);
}
doc.free();
Rust
let mut pdf = Pdf::open("manual.pdf")?;
let results = pdf.search("configuration")?;
for r in &results {
println!("Page {}: '{}' at ({:.0}, {:.0})", r.page, r.text, r.bbox.x, r.bbox.y);
}
Go
doc, _ := pdfoxide.Open("manual.pdf")
defer doc.Close()
results, _ := doc.SearchAll("configuration", false)
for _, r := range results {
fmt.Printf("Page %d: %q at (%.0f, %.0f)\n", r.PageIndex, r.Text, r.X, r.Y)
}
C#
using var doc = PdfDocument.Open("manual.pdf");
foreach (var r in doc.SearchAll("configuration", caseSensitive: false))
Console.WriteLine($"Page {r.PageIndex}: '{r.Text}' at ({r.X:F0}, {r.Y:F0})");
Java
import fyi.oxide.pdf.search.SearchMatch;
try (PdfDocument doc = PdfDocument.open(Path.of("manual.pdf"))) {
for (SearchMatch m : doc.search("configuration")) {
System.out.printf("Page %d: '%s' at (%.0f, %.0f)%n",
m.pageIndex(), m.text(), m.bbox().x0(), m.bbox().y0());
}
}
Kotlin
PdfDocument.open(Path.of("manual.pdf")).use { doc ->
for (m in doc.search("configuration")) {
println("Page ${m.pageIndex()}: '${m.text()}' at (${m.bbox().x0()}, ${m.bbox().y0()})")
}
}
Scala
import fyi.oxide.pdf.searchSeq
Using.resource(PdfDocument.open("manual.pdf")) { doc =>
for (m <- doc.searchSeq("configuration"))
println(f"Page ${m.pageIndex}: '${m.text}' at (${m.bbox.x0}%.0f, ${m.bbox.y0}%.0f)")
}
Clojure
(with-open [doc (pdf/open "manual.pdf")]
(doseq [m (pdf/search doc "configuration")]
(let [b (.bbox m)]
(println (format "Page %d: '%s' at (%.0f, %.0f)"
(.pageIndex m) (.text m) (.x0 b) (.y0 b))))))
Ruby
PdfOxide::PdfDocument.open('manual.pdf') do |doc|
doc.search('configuration').each do |r|
puts "Page #{r[:page]}: '#{r[:text]}' at (#{r[:bbox][:x].round}, #{r[:bbox][:y].round})"
end
end
C++
auto doc = pdf_oxide::Document::open("manual.pdf");
for (const auto& r : doc.search_all("configuration", /*case_sensitive=*/false)) {
std::printf("Page %d: '%s' at (%.0f, %.0f)\n",
r.page, r.text.c_str(), r.bbox.x, r.bbox.y);
}
Swift
let doc = try Document.open("manual.pdf")
for r in try doc.searchAll("configuration", false) {
print("Page \(r.page): '\(r.text)' at (\(r.bbox.x), \(r.bbox.y))")
}
Dart
final doc = PdfDocument.open('manual.pdf');
for (final r in doc.searchAll('configuration', false)) {
print("Page ${r.page}: '${r.text}' at (${r.bbox.x}, ${r.bbox.y})");
}
doc.close();
R
doc <- pdf_open("manual.pdf")
for (r in pdf_search_all(doc, "configuration", case_sensitive = FALSE)) {
cat(sprintf("Page %d: '%s' at (%.0f, %.0f)\n",
r$page, r$text, r$bbox$x, r$bbox$y))
}
Julia
doc = open_document("manual.pdf")
for r in search_all(doc, "configuration", false)
println("Page $(r.page): '$(r.text)' at ($(r.bbox.x), $(r.bbox.y))")
end
Zig
var doc = try pdf_oxide.Document.open("manual.pdf");
const hits = try doc.searchAll(a, "configuration", false);
defer pdf_oxide.Document.freeSearchResults(a, hits);
for (hits) |r| {
std.debug.print("Page {d}: '{s}' at ({d:.0}, {d:.0})\n",
.{ r.page, r.text, r.bbox.x, r.bbox.y });
}
Objective-C
POXDocument *doc = [POXDocument openPath:@"manual.pdf" error:&err];
for (POXSearchResult *r in [doc searchAll:@"configuration" caseSensitive:NO error:&err]) {
NSLog(@"Page %ld: '%@' at (%.0f, %.0f)",
(long)r.page, r.text, r.bbox.x, r.bbox.y);
}
Elixir
{:ok, doc} = PdfOxide.open("manual.pdf")
{:ok, hits} = PdfOxide.search_all(doc, "configuration", false)
Enum.each(hits, fn r ->
IO.puts("Page #{r.page}: '#{r.text}' at (#{r.bbox.x}, #{r.bbox.y})")
end)
与其他Python PDF库的对比
目前有多个Python库可用于PDF文本提取,以下是各自的对比:
- pypdf — 纯Python,无C依赖。安装简便,但速度较慢(每页12ms),且受字体和编码支持有限,约1.6%的PDF会失败。无字符位置数据。适合对速度要求不高的简单PDF。
- pdfplumber — 基于pdfminer,提供细粒度的字符和表格提取。速度非常慢(每页23ms),且无法打开加密PDF。适合需要单元格级数据、对性能无要求的表格提取场景。
- PyMuPDF (fitz) — MuPDF C库的Python绑定。速度快(每页4.6ms),可靠性高(通过率99.3%)。需要安装C库,使用AGPL许可。如果许可证合适,是一个稳健的选择。
- pypdfium2 — Google PDFium引擎的Python绑定。速度快(每页4.1ms),但复杂文档的p99延迟较高(42ms)。API覆盖面比PyMuPDF窄。
- pdfminer.six — 纯Python,具备详细的版面分析能力。速度极慢,项目几乎已停止维护。在AES-256加密PDF上会失败。大部分使用场景已被pdfplumber取代。
- PDF Oxide — Rust内核,通过PyO3提供Python绑定。速度最快(每页0.8ms),通过率100%,支持所有加密方式,内置OCR。MIT许可,无系统依赖。
PDF Oxide的设计目标正是填补现有库的不足:纯Python解析器的速度瓶颈、MuPDF的许可证限制,以及现有库在遇到异常字体、损坏的交叉引用表或非标准编码时的可靠性问题。
性能:PDF Oxide有多快?
基于来自三个独立公开测试套件的3,830个PDF进行基准测试:
| 库 | 均值 | p99 | 通过率 |
|---|---|---|---|
| PDF Oxide | 0.8ms | 9ms | 100% |
| PyMuPDF | 4.6ms | 28ms | 99.3% |
| pypdfium2 | 4.1ms | 42ms | 99.2% |
| pypdf | 12.1ms | 97ms | 98.4% |
| pdfplumber | 23.2ms | 189ms | 98.8% |
处理10,000个PDF的流水线耗时对比:
- PDF Oxide: 8秒
- PyMuPDF: 46秒
- pypdf: 2分钟
- pdfplumber: 3.9分钟
测试方法和复现步骤请参阅完整基准测试报告。
常见问题与排查
文本输出为空
如果extract_text()返回空字符串,该页面很可能包含的是扫描图像而非文字。请改用extract_text_ocr()。配置说明请参阅扫描PDF的OCR。
乱码或字符错误
这通常意味着字体使用了非标准编码向量,或缺少ToUnicode CMap。PDF Oxide能处理绝大多数编码边界情况,但部分故意混淆的PDF(DRM保护内容)可能产生错误输出。
缺少空格或单词粘连
PDF文本操作符逐个放置字符,空格推断依赖字符间距与字体空格宽度的比例关系。如果单词出现粘连,请尝试使用extract_chars()并根据字符坐标自行实现间距逻辑。
与其他库输出不同
不同库在空格推断、换行和阅读顺序上采用不同的启发式算法。PDF Oxide在3,830个PDF上与PyMuPDF达到99.5%的文本一致性,0.5%的差异主要来自空白规范化和连字处理。
真实应用场景
搜索索引 — 从文档库中所有PDF的每一页提取文本,输入Elasticsearch、Typesense或向量数据库,构建全文检索。PDF Oxide的速度使按需重新索引数千份文档成为可能。
RAG流水线(检索增强生成) — 提取并切分PDF文本,再用OpenAI、Cohere或开源模型生成嵌入向量。使用extract_spans()保留标题层级,使切片与文档章节对齐。LLM优化输出请参阅PDF转Markdown指南。
合规与审计 — 扫描合同、发票和监管申报文件,查找特定条款或关键词。使用doc.search()定位全文中各词条的精确位置,或提取全文供NLP进行条款识别。
数据提取 — 从发票、收据、银行对账单和表单中提取结构化数据。将extract_chars()的位置数据与特定业务规则结合,定位"总金额"或"发票日期"等字段,并提取相邻的值。
学术研究 — 处理大量研究论文,用于文献综述、引用提取或荟萃分析。PDF Oxide覆盖学术出版中常见的各类PDF生成工具(LaTeX、Word、InDesign、Quark)及其字体编码。
相关页面
- 文本提取API — 完整API参考
- PDF转Markdown — 结构化转换
- 批量处理 — 并行处理方案
- 扫描PDF的OCR — OCR配置与使用
- 性能基准测试 — 测试方法与结果