Getting Started with PDF Oxide (Scala)
PDF Oxide is the fastest PDF library for the JVM with built-in text extraction — 0.8ms mean, 100% pass rate on 3,830 PDFs. The Scala 3 binding is a thin, idiomatic facade over the mature Java binding: it adds zero native code and layers Scala extension methods that turn java.util.Optional[T] into Option[T] and java.util.List[T] into Seq[T]. The AutoCloseable handles work directly with scala.util.Using.
Installation
Add the dependency to your build.sbt:
libraryDependencies += "fyi.oxide" % "pdf-oxide" % "0.3.69"
The Scala facade depends on the fyi.oxide:pdf-oxide Java binding, which owns the single JNI native bridge. Scala 3.3+ is required.
Quick Start
Build a PDF from Markdown, then open it and extract text back out. Using.resource closes each handle for you.
import fyi.oxide.pdf.{Pdf, PdfDocument, producerOption}
import scala.util.Using
Using.resource(Pdf.fromMarkdown("# Hello pdf_oxide\n\nThis is a **Scala** binding.\n")): pdf =>
Using.resource(PdfDocument.open(pdf.save())): doc =>
println(s"pages: ${doc.pageCount()}")
println(s"producer: ${doc.producerOption.getOrElse("(none)")}")
println(doc.extractText(0))
Pdf.fromMarkdown returns a Pdf handle; pdf.save() serializes it to an Array[Byte]. PdfDocument.open accepts those bytes and exposes the document API.
Text Extraction
Plain Text
Extract plain text from any page by its zero-based index.
import fyi.oxide.pdf.PdfDocument
import scala.util.Using
Using.resource(PdfDocument.open(pdfBytes)): doc =>
assert(doc.isOpen)
val text = doc.extractText(0)
println(text)
Markdown and HTML
Convert the whole document to Markdown or HTML in a single call.
import fyi.oxide.pdf.PdfDocument
import scala.util.Using
Using.resource(PdfDocument.open(pdfBytes)): doc =>
println(doc.toMarkdown())
println(doc.toHtml())
Page Elements
doc.page(i) returns a PdfPage. The facade exposes each element extractor as a Scala Seq via the *Seq extension methods: wordsSeq, linesSeq, charsSeq, tablesSeq, imagesSeq, and annotationsSeq. Each TextWord carries its text and a bbox.
import fyi.oxide.pdf.{PdfDocument, wordsSeq, linesSeq, charsSeq, tablesSeq, imagesSeq, annotationsSeq}
import scala.util.Using
Using.resource(PdfDocument.open(pdfBytes)): doc =>
val page = doc.page(0)
println(s"size: ${page.width()} x ${page.height()}")
page.wordsSeq.take(8).foreach { w =>
println(s" ${w.text} @ ${w.bbox} (w=${w.bbox.width})")
}
println(s"lines: ${page.linesSeq.size}")
println(s"chars: ${page.charsSeq.size}")
println(s"tables: ${page.tablesSeq.size}")
println(s"images: ${page.imagesSeq.size}")
println(s"annotations: ${page.annotationsSeq.size}")
You can also iterate every page as a Seq with doc.pagesSeq (its size matches doc.pageCount()).
import fyi.oxide.pdf.{PdfDocument, pagesSeq, wordsSeq}
import scala.util.Using
Using.resource(PdfDocument.open(pdfBytes)): doc =>
doc.pagesSeq.zipWithIndex.foreach { (page, i) =>
println(s"page $i: ${page.wordsSeq.size} words")
}
Search
doc.searchSeq(query) returns a Seq[SearchMatch]. Each match exposes its text.
import fyi.oxide.pdf.{PdfDocument, searchSeq}
import scala.util.Using
Using.resource(PdfDocument.open(pdfBytes)): doc =>
val matches = doc.searchSeq("Hello")
println(s"${matches.size} match(es)")
matches.foreach(m => println(s" ${m.text}"))
Metadata as Option
Nullable document metadata surfaces as Option[String] through producerOption and creatorOption, so you handle absent values the Scala way.
import fyi.oxide.pdf.{PdfDocument, producerOption, creatorOption}
import scala.util.Using
Using.resource(PdfDocument.open(pdfBytes)): doc =>
println(doc.producerOption.getOrElse("(unknown producer)"))
println(doc.creatorOption.getOrElse("(unknown creator)"))
// Form fields come back as a Seq too:
println(s"form fields: ${doc.formFieldsSeq.size}")
Rendering
doc.render(i) rasterizes a page and returns the encoded image bytes.
import fyi.oxide.pdf.PdfDocument
import scala.util.Using
Using.resource(PdfDocument.open(pdfBytes)): doc =>
val png = doc.render(0)
java.nio.file.Files.write(java.nio.file.Path.of("page-0.png"), png)
Auto Extraction
AutoExtractor.of(doc).extractDocument() returns an AutoResult with the extracted text, optional markdown/html renderings, and the list of pages that still need OCR — all exposed idiomatically via the facade (markdownOption, htmlOption, pagesNeedingOcrSeq).
import fyi.oxide.pdf.{PdfDocument, AutoExtractor, markdownOption, htmlOption, pagesNeedingOcrSeq}
import scala.util.Using
Using.resource(PdfDocument.open(pdfBytes)): doc =>
val result = AutoExtractor.of(doc).extractDocument()
println(result.text)
result.markdownOption.foreach(println)
result.htmlOption.foreach(println)
println(s"pages needing OCR: ${result.pagesNeedingOcrSeq}")
Editing
DocumentEditor.open opens an existing PDF for structural edits. Here we scrub metadata and serialize the result back to bytes.
import fyi.oxide.pdf.DocumentEditor
import scala.util.Using
Using.resource(DocumentEditor.open(pdfBytes)): editor =>
assert(editor.isOpen)
editor.scrubMetadata()
val cleaned: Array[Byte] = editor.save()
java.nio.file.Files.write(java.nio.file.Path.of("scrubbed.pdf"), cleaned)
Next Steps
- Rust Getting Started – using PDF Oxide from Rust
- Python Getting Started – using PDF Oxide from Python
- Text Extraction – detailed extraction options and recipes
- PDF Creation – advanced creation, encryption, and metadata
- Editing – modifying existing PDFs, annotations, and form fields