Concurrency — Thread-Safe PDF Reads
PdfDocument has been Send + Sync on the Rust side since v0.3.22. A single document can be shared across OS threads, goroutines, worker threads, or asyncio tasks for parallel page extraction. Write operations still need serialisation — that’s what DocumentEditor is for.
What changed in v0.3.22
All 16 RefCell<T> wrappers inside PdfDocument were replaced with Mutex<T>, and Cell<usize> became AtomicUsize. The language bindings dropped the unsendable marker on Python classes (PdfDocument, PdfPage, FormField), which previously raised RuntimeError the moment they crossed a thread boundary.
Net effect: thread pools, async runtimes, and free-threaded Python all now just work.
Rust
Rust
use pdf_oxide::PdfDocument;
use std::sync::Arc;
use std::thread;
let doc = Arc::new(PdfDocument::open("report.pdf")?);
let page_count = doc.page_count();
let handles: Vec<_> = (0..page_count)
.map(|i| {
let doc = Arc::clone(&doc);
thread::spawn(move || doc.extract_text(i))
})
.collect();
for h in handles {
let text = h.join().unwrap()?;
println!("{}", text);
}
With tokio:
Rust
use std::sync::Arc;
use tokio::task;
let doc = Arc::new(pdf_oxide::PdfDocument::open("report.pdf")?);
let tasks: Vec<_> = (0..doc.page_count())
.map(|i| {
let doc = Arc::clone(&doc);
task::spawn_blocking(move || doc.extract_text(i))
})
.collect();
for t in tasks {
let text = t.await??;
}
Python
Python
from concurrent.futures import ThreadPoolExecutor
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
with ThreadPoolExecutor(max_workers=8) as pool:
pages = list(pool.map(doc.extract_text, range(doc.page_count())))
Under stock CPython the GIL still serialises Python-level work, but the extraction itself releases the GIL during Rust execution — so this is genuinely parallel on the Rust side. Under cp314t (free-threaded Python 3.14+), the GIL is optional and the bindings declare gil_used = false so there is no implicit serialisation at all.
With asyncio:
Python
import asyncio
from pdf_oxide import PdfDocument
doc = PdfDocument("report.pdf")
async def main():
pages = await asyncio.gather(
*[asyncio.to_thread(doc.extract_text, i) for i in range(doc.page_count())]
)
Or use the ready-made AsyncPdfDocument from the async guide.
Go
Reads on *PdfDocument are protected by an internal sync.RWMutex — goroutine-safe by construction.
Go
package main
import (
"sync"
pdfoxide "github.com/yfedoseev/pdf_oxide/go"
)
func main() {
doc, _ := pdfoxide.Open("report.pdf")
defer doc.Close()
count, _ := doc.PageCount()
results := make([]string, count)
var wg sync.WaitGroup
for i := 0; i < count; i++ {
wg.Add(1)
go func(page int) {
defer wg.Done()
text, _ := doc.ExtractText(page)
results[page] = text
}(i)
}
wg.Wait()
}
*DocumentEditor serialises writes internally, but do not pipeline independent edits from multiple goroutines — collect mutations on one goroutine.
C#
C#
using PdfOxide.Core;
using var doc = PdfDocument.Open("report.pdf");
var tasks = Enumerable.Range(0, doc.PageCount)
.Select(i => Task.Run(() => doc.ExtractText(i)));
string[] pages = await Task.WhenAll(tasks);
If you need fine-grained reader/writer semantics around a DocumentEditor:
C#
var locker = new ReaderWriterLockSlim();
locker.EnterReadLock();
try
{
string text = doc.ExtractText(0);
}
finally
{
locker.ExitReadLock();
}
Node.js
A PdfDocument can be passed to worker threads by transferring the backing handle. The simpler pattern is to let the *Async methods do the dispatching:
Node.js
const { PdfDocument } = require("pdf-oxide");
const doc = new PdfDocument("report.pdf");
try {
const pageCount = doc.getPageCount();
const pages = await Promise.all(
Array.from({ length: pageCount }, (_, i) => doc.extractTextAsync(i))
);
} finally {
doc.close();
}
Each *Async call runs on the libuv thread pool.
Writer serialisation
Writes (DocumentEditor, Pdf, PdfCreator) are not lock-free. If multiple threads need to modify the same document, funnel mutations through one writer goroutine / task and fan out the reads.
A common pattern:
- 1 reader
PdfDocumentshared across N reader threads. - 1 writer
DocumentEditorowned by a single coordinator task that collects edits from a channel or queue.
Related
- Async Processing — awaitable wrappers and
CancellationTokenplumbing. - Batch Processing — processing many files concurrently.
- Node.js Getting Started — worker-thread patterns.