DocStruct reconstructs structured document outputs from PDF inputs with a dual-source strategy:
- Parser source: PDF-native text extraction (
pdftotext) - OCR source: rendered-page recognition (
tesseractvia Python bridge)
Primary goals:
- maximize text/block recovery on mixed-content PDFs
- retain provenance and confidence for downstream debugging
- support multilingual workflows, with Korean-specific normalization paths
Operational constraints:
- external runtime dependencies (
poppler-utils,tesseract, Python) - different error characteristics between parser and OCR tracks
- per-page processing with deterministic export formats
pipeline::build_documentopens input PDF and iterates pages.- Per page:
- render page image (
ocr::renderer::PageRenderer) - run parser track (
parser::layout_builder::ParserLayoutBuilder) - run OCR track (
ocr::layout_builder::OcrLayoutBuilder) - run fusion (
fusion::SimpleFusionEngine)
- render page image (
- Attach debug hypotheses (
parser_blocks,ocr_blocks) to final page model. - Export final document into JSON/Markdown/Text/Debug HTML.
Responsibilities:
- geometry primitives (
BBox): intersection/union/area/center distance - confidence scoring (
confidence.rs) - domain model (
model.rs):DocumentFinal,PageFinal,PageHypothesisBlockvariants (TextBlock,TableBlock,FigureBlock,MathBlock)- provenance enum (
Parser,Ocr,Fused)
- page-level classification heuristics (
page_classifier.rs)
Design note:
- model types are serialization-ready (
serde) and used by all layers.
Responsibilities:
- read PDF metadata/page count (
pdf_reader.rs) - extract parser text (
text_extractor.rs) - Korean normalization (
hangul.rs): combine decomposed jamo into syllables - parser hypothesis construction (
layout_builder.rs)
Current behavior:
- parser text is coarse-grained at block level in many cases
- quality gates suppress severely degraded Korean parser outputs
Failure profile:
- parser can miss rendered-only text/figures
- parser may emit decomposed/noisy Unicode depending on PDF internals
Responsibilities:
- render PDF page images (
renderer.rs) - call Python OCR bridge (
bridge.rs) - map OCR tokens to Rust block model (
layout_builder.rs)
Python bridge (ocr/bridge/ocr_bridge.py) pipeline:
- image preprocessing and block detection (OpenCV morphology)
- block OCR (
pytesseract) with language config - block type classification (
text/table/figure/math) - post-processing:
- Hangul normalization
- CJK/Hanja noise suppression
- token deduplication
- adjacent text-block merge
- short Korean split-ending fixes
- optional fallback full-page OCR if recall is too low
Failure profile:
- OCR can hallucinate symbols/characters under dense math or low contrast
- segmentation may fragment lines or over-merge neighboring content
Submodules:
align.rs: geometric matching between parser and OCR blockscompare.rs: text similarity scoringresolve.rs: conflict resolution and filteringfinalize.rs: page class decision (digital/scanned/hybrid)
Fusion process:
- Align parser/OCR blocks by IoU/center distance.
- Resolve matched pairs:
- compare text similarity
- choose parser/ocr/fused lines based on class and quality heuristics
- Promote unmatched blocks with source-aware confidence.
- Apply filters:
- remove degraded parser Korean when OCR is clearly better
- remove redundant OCR text under parser-dominant pages
- remove low-quality OCR noise
- suppress Korean OCR text when parser Korean is reliable (accuracy-first)
Design intent:
- parser is generally trusted for clean digital text
- OCR is used for coverage gaps and scanned content
- provenance is preserved for auditability
Exporters:
json_export.rsmarkdown_export.rstext_export.rshtml_debug_export.rs
Debug HTML includes per-block metadata:
- block type
- provenance
- confidence
- parser/ocr/final text and similarity (when available)
CLI entry points (src/main.rs):
convert: one PDFbatch: multiple PDFsinfo: PDF metadata only
Primary runtime path:
convert->pipeline::build_document->pipeline::export_document
Inputs:
- parser glyph count
- OCR glyph count
- OCR density and coverage proxy
Output:
Digital,Scanned, orHybrid
This class controls fusion aggressiveness and source preference.
Implemented across parser/OCR/fusion:
- Hangul composition normalization
- decomposed-jamo degradation scoring
- Hanja/CJK removal in OCR normalization path
- strict OCR Korean suppression when parser Korean is reliable
Tradeoff:
- this can reduce OCR-only Korean recall in ambiguous regions
- but significantly improves character-level precision on noisy pages
When parser text is strong and broad:
- duplicated OCR snippets are removed by similarity + overlap checks
Goal:
- prevent repeated content in final exports
- keep non-overlapping OCR additions when they contain distinct content
Current tests cover:
- geometry and similarity primitives
- Hangul normalization behaviors
- fusion filtering/selection cases
- parser/OCR integration smoke tests with fixtures
Recommended additions:
- fixture-based regression tests for per-language precision/recall
- page-class regression checks (
digital/scanned/hybrid) - OCR bridge snapshot tests for post-processing outputs
- Formula-heavy regions can still produce OCR symbol noise.
- Coverage metric is heuristic, not semantic segmentation.
- Parser track may collapse layout details for some PDFs.
- OCR fallback can still introduce short low-confidence fragments on edge cases.
Common extension paths:
- replace OCR bridge model stack while keeping JSON token contract
- add stronger table/math-specific structural modeling
- add confidence calibration per block type/source
- add benchmark harness for fixture-level metric tracking
src/
core/
parser/
ocr/
fusion/
export/
ocr/bridge/
ocr_bridge.py
tests/
fixtures/
docs/
ARCHITECTURE.md