English · 简体中文 · 繁體中文 · Español
Auditable LLM extraction for Java. Parse documents, extract structured data, and attach field-level citations, confidence, and provenance.
DocTruth is for teams that need to answer one question reliably:
Where did this extracted value come from?
It is not an agent framework, chain framework, vector database wrapper, or UI. It is a small Java library for the extraction boundary: source document in, validated structured output plus evidence trail out.
Requires Java 25+. Verify Maven Central availability:
mvn dependency:get -Dartifact=ai.doctruth:doctruth-java:0.2.0-alphaUse in a Maven project:
<dependency>
<groupId>ai.doctruth</groupId>
<artifactId>doctruth-java</artifactId>
<version>0.2.0-alpha</version>
</dependency>Gradle uses the same coordinate: ai.doctruth:doctruth-java:0.2.0-alpha.
Upgrade to the latest release:
mvn versions:use-latest-releases -Dincludes=ai.doctruth:doctruth-java -DgenerateBackupPoms=falseimport ai.doctruth.DocTruth;
import ai.doctruth.OpenAiProvider;
import ai.doctruth.PdfDocumentParser;
import java.math.BigDecimal;
import java.nio.file.Path;
import java.time.LocalDate;
record Contract(String partyA, String partyB, LocalDate effectiveDate, BigDecimal totalValue) {}
var doc = PdfDocumentParser.parse(Path.of("contract.pdf"));
var result = DocTruth.from(new OpenAiProvider(System.getenv("OPENAI_API_KEY")))
.extract("Extract the contract terms", Contract.class)
.withProvenance()
.withConfidence()
.withBitemporal()
.run(doc);
Contract contract = result.value();
var partyACitation = result.citations().get("partyA");See examples/quickstart for a runnable example.
- Parses PDF, DOCX, XLSX, and CSV into sections with source locations.
- Extracts Java records or JSON Schema-bound objects through LLM providers.
- Validates structured output locally and retries repairable failures.
- Matches extracted fields back to exact source quotes.
- Returns per-field
Citation,Confidence, andProvenance. - Exports W3C PROV-O JSON-LD audit files with
toAuditJson(...).
Java records and simple POJOs are the native path. DocTruth turns the target Java type into the same JSON Schema contract it sends to providers and validates locally before deserializing the response.
Supported Java-native schema shapes include nested records/classes, List<T>,
Map<String, T>, enums, String, booleans, integer and decimal numbers,
BigDecimal, LocalDate, and Jackson property annotations such as
@JsonProperty and @JsonIgnore. Optional<T> is treated as an optional field:
it is omitted from required, while the wrapped value type is still reflected in
the generated schema. Raw Object and unbounded shapes fail fast instead of
becoming unauditable catch-all objects.
JSON Schema remains the interoperability path for external schema producers such as Pydantic.
var schema = JsonSchema.from(Path.of("contract.schema.json"));
var result = DocTruth.from(provider)
.extractJson("Extract contract terms", schema)
.requireCitation("partyA")
.requireCitation("totalValue")
.withMaxRetries(2)
.runJson(doc);DocTruth supports common Pydantic v2 JSON Schema exports, including local $defs / $ref, nullable unions, nested objects, arrays, enums, required fields, scalar constraints, and additionalProperties=false.
Build-time helper:
java -jar target/doctruth-java-0.2.0-alpha.jar \
migrate pydantic myapp.schemas:ResumeExtraction \
--out schemas/resume.schema.json \
--checkProduction Java extraction only needs the exported schema file and the DocTruth jar.
OpenAI-compatible chat completions are the primary path because many hosted, gateway, and local models expose that API shape.
| Provider | Structured output mode |
|---|---|
| OpenAI / OpenAI-compatible | response_format: json_schema |
| Anthropic | tool-use forcing |
| Gemini | responseMimeType + responseSchema |
| DeepSeek | OpenAI-compatible JSON mode plus local validation |
Provider clients use JDK java.net.http.HttpClient; no vendor SDKs are on the classpath.
java -jar target/doctruth-java-0.2.0-alpha.jar parse contract.pdf
java -jar target/doctruth-java-0.2.0-alpha.jar migrate pydantic myapp.schemas:Model --out schema.json --check- Quickstart example
- Pydantic interop example
- Architecture
- Error handling
- Release process
- Contributing
- Changelog
0.2.0-alpha is an early public alpha. The API is usable, tested, and published for feedback, but may still change before 1.0.
Current verification baseline: 645 unit tests and 16 integration tests passing, with 2 external smoke tests skipped, coverage gates at 90% line / 80% branch, single jar about 205 KB.
Code is licensed under Apache License 2.0.
DocTruth, doctruth.ai, and the DocTruth logo are trademarks of doctruthhq. See NOTICE.

