Updated: 2026-04-08
This app has two distinct layers:
- semantic interpretation
- deterministic PDF writing and release gating
That split is deliberate. Gemini is used where meaning is hard. The PDF writer stays deterministic.
The visible product model is also split cleanly:
- release-ready output
- manual remediation when trustworthiness is not high enough
- optional advanced review only for human-legible visible output
flowchart TD
A["Input PDF"] --> B["Classify"]
B --> C["OCR when needed"]
C --> D["Structure extraction"]
D --> E["Canonical document model"]
E --> F["Semantic unit builder"]
F --> G["Grounding evidence"]
G --> G1["page image"]
G --> G2["crop image"]
G --> G3["native text"]
G --> G4["OCR text"]
G --> G5["nearby context"]
F --> H["Direct Gemini structured outputs\nFiles API + context cache"]
H --> I["Resolved semantic decisions"]
I --> J["Pretag rationalization\n(widgets, figures, structure)"]
J --> K["Deterministic tagger/remediator"]
K --> L["veraPDF + fidelity gate"]
L --> M{"Release-ready"}
M -->|Yes| N["Release-ready PDF"]
M -->|No| O["Manual remediation"]
N --> P["Optional visible review surface"]
The product uses anonymous browser sessions instead of user accounts.
- FastAPI middleware assigns an HTTP-only cookie to each browser
- every job row is owned by a hash of that session token
- list, detail, download, preview, SSE progress, and review routes are all scoped to the current browser session
- jobs and their files are ephemeral and expire after
JOB_TTL_HOURS, which defaults to12
This keeps the app login-free while preventing one browser session from seeing another session's PDFs through the product API. It does not change the fact that semantic adjudication still uses the configured external LLM provider.
The semantic layer no longer treats text, tables, forms, figures, and TOC candidates as unrelated flows. It normalizes them into local regions with shared evidence:
- page number
- bounding box
- kind candidate
- native text candidate
- OCR text candidate
- image crop
- nearby structure context
- confidence and provenance
Current semantic-unit families:
- suspicious text blocks
- reading-order pages
- tables
- forms
- figures
- TOC groups
Gemini is the primary semantic judge for hard units.
It decides things like:
- what assistive tech should hear for a garbled block
- which table rows are headers
- what a form field should be labeled
- whether a figure candidate is actually a figure, a table, or a form region
- whether a page region is a TOC group
Gemini is not allowed to write PDF objects directly.
The deterministic layer is responsible for:
- pretag rationalization of suspicious widgets and under-described visual figures
- PDF/UA tag tree construction
/ActualText- form
/TU - artifacts
- bookmarks and TOC structure
- font remediation
- metadata
- final validation and fidelity gating
Main implementation files:
- backend/app/pipeline/orchestrator.py
- backend/app/pipeline/tagger.py
- backend/app/pipeline/validator.py
- backend/app/pipeline/fidelity.py
- backend/app/services/intelligence_gemini_pages.py
- backend/app/services/intelligence_gemini_tables.py
- backend/app/services/intelligence_gemini_forms.py
- backend/app/services/intelligence_gemini_figures.py
- backend/app/services/intelligence_gemini_toc.py
- backend/app/services/llm_client.py
- backend/app/services/intelligence_llm_utils.py
- backend/app/services/gemini_direct.py
The target transport is direct Gemini for PDF-understanding lanes.
The decision rule is Docling-first:
- trust Docling-native title, language, hyperlink/widget metadata, and native TOC when present
- escalate to Gemini only when the extracted document evidence is missing, weak, or semantically ambiguous
- build Docling-derived ambiguity plans first so Gemini sees only unresolved units, not whole lanes
Important properties:
- Gemini Files API / cached PDF context for reusable document slices
- native
response_json_schemastructured output - candidate-ID adjudication for bookmark and navigation decisions
- retry and timeout bounds
- audit-grade token and cost tracking
The intended semantic transport is Gemini directly. Where the chat-completions compatibility endpoint is still used, it should point at Google rather than a proxy.
A document is release-ready only when all three are true:
veraPDFsays compliant- fidelity says faithful enough
- the run ends
complete, notmanual_remediation
Optional visible review items do not block release. Hidden structural blockers still do.
- complex tables still require stronger extraction or manual remediation in some cases
- visual WCAG issues such as contrast are not yet a first-class audit layer
- math support is conservative formula tagging plus speakable formula text, not rich equation semantics
- rich media remains partial
- semantic adjudication still depends on good local page/crop evidence