DocStruct Architecture

1. Goals and Constraints

DocStruct reconstructs structured document outputs from PDF inputs with a dual-source strategy:

Parser source: PDF-native text extraction (pdftotext)
OCR source: rendered-page recognition (tesseract via Python bridge)

Primary goals:

maximize text/block recovery on mixed-content PDFs
retain provenance and confidence for downstream debugging
support multilingual workflows, with Korean-specific normalization paths

Operational constraints:

external runtime dependencies (poppler-utils, tesseract, Python)
different error characteristics between parser and OCR tracks
per-page processing with deterministic export formats

2. Top-Level Data Flow

pipeline::build_document opens input PDF and iterates pages.
Per page:
- render page image (ocr::renderer::PageRenderer)
- run parser track (parser::layout_builder::ParserLayoutBuilder)
- run OCR track (ocr::layout_builder::OcrLayoutBuilder)
- run fusion (fusion::SimpleFusionEngine)
Attach debug hypotheses (parser_blocks, ocr_blocks) to final page model.
Export final document into JSON/Markdown/Text/Debug HTML.

3. Module Breakdown

3.1 `src/core`

Responsibilities:

geometry primitives (BBox): intersection/union/area/center distance
confidence scoring (confidence.rs)
domain model (model.rs):
- DocumentFinal, PageFinal, PageHypothesis
- Block variants (TextBlock, TableBlock, FigureBlock, MathBlock)
- provenance enum (Parser, Ocr, Fused)
page-level classification heuristics (page_classifier.rs)

Design note:

model types are serialization-ready (serde) and used by all layers.

3.2 `src/parser`

Responsibilities:

read PDF metadata/page count (pdf_reader.rs)
extract parser text (text_extractor.rs)
Korean normalization (hangul.rs): combine decomposed jamo into syllables
parser hypothesis construction (layout_builder.rs)

Current behavior:

parser text is coarse-grained at block level in many cases
quality gates suppress severely degraded Korean parser outputs

Failure profile:

parser can miss rendered-only text/figures
parser may emit decomposed/noisy Unicode depending on PDF internals

3.3 `src/ocr` + `ocr/bridge`

Responsibilities:

render PDF page images (renderer.rs)
call Python OCR bridge (bridge.rs)
map OCR tokens to Rust block model (layout_builder.rs)

Python bridge (ocr/bridge/ocr_bridge.py) pipeline:

image preprocessing and block detection (OpenCV morphology)
block OCR (pytesseract) with language config
block type classification (text/table/figure/math)
post-processing:
- Hangul normalization
- CJK/Hanja noise suppression
- token deduplication
- adjacent text-block merge
- short Korean split-ending fixes
optional fallback full-page OCR if recall is too low

Failure profile:

OCR can hallucinate symbols/characters under dense math or low contrast
segmentation may fragment lines or over-merge neighboring content

3.4 `src/fusion`

Submodules:

align.rs: geometric matching between parser and OCR blocks
compare.rs: text similarity scoring
resolve.rs: conflict resolution and filtering
finalize.rs: page class decision (digital/scanned/hybrid)

Fusion process:

Align parser/OCR blocks by IoU/center distance.
Resolve matched pairs:
- compare text similarity
- choose parser/ocr/fused lines based on class and quality heuristics
Promote unmatched blocks with source-aware confidence.
Apply filters:
- remove degraded parser Korean when OCR is clearly better
- remove redundant OCR text under parser-dominant pages
- remove low-quality OCR noise
- suppress Korean OCR text when parser Korean is reliable (accuracy-first)

Design intent:

parser is generally trusted for clean digital text
OCR is used for coverage gaps and scanned content
provenance is preserved for auditability

3.5 `src/export`

Exporters:

json_export.rs
markdown_export.rs
text_export.rs
html_debug_export.rs

Debug HTML includes per-block metadata:

block type
provenance
confidence
parser/ocr/final text and similarity (when available)

4. Runtime Entry Points

CLI entry points (src/main.rs):

convert: one PDF
batch: multiple PDFs
info: PDF metadata only

Primary runtime path:

convert -> pipeline::build_document -> pipeline::export_document

5. Key Heuristics

5.1 Page Classification

Inputs:

parser glyph count
OCR glyph count
OCR density and coverage proxy

Output:

Digital, Scanned, or Hybrid

This class controls fusion aggressiveness and source preference.

5.2 Korean Accuracy Controls

Implemented across parser/OCR/fusion:

Hangul composition normalization
decomposed-jamo degradation scoring
Hanja/CJK removal in OCR normalization path
strict OCR Korean suppression when parser Korean is reliable

Tradeoff:

this can reduce OCR-only Korean recall in ambiguous regions
but significantly improves character-level precision on noisy pages

5.3 Redundancy Control

When parser text is strong and broad:

duplicated OCR snippets are removed by similarity + overlap checks

Goal:

prevent repeated content in final exports
keep non-overlapping OCR additions when they contain distinct content

6. Testing Strategy

Current tests cover:

geometry and similarity primitives
Hangul normalization behaviors
fusion filtering/selection cases
parser/OCR integration smoke tests with fixtures

Recommended additions:

fixture-based regression tests for per-language precision/recall
page-class regression checks (digital/scanned/hybrid)
OCR bridge snapshot tests for post-processing outputs

7. Known Limitations

Formula-heavy regions can still produce OCR symbol noise.
Coverage metric is heuristic, not semantic segmentation.
Parser track may collapse layout details for some PDFs.
OCR fallback can still introduce short low-confidence fragments on edge cases.

8. Extension Points

Common extension paths:

replace OCR bridge model stack while keeping JSON token contract
add stronger table/math-specific structural modeling
add confidence calibration per block type/source
add benchmark harness for fixture-level metric tracking

9. File Map

src/
  core/
  parser/
  ocr/
  fusion/
  export/
ocr/bridge/
  ocr_bridge.py
tests/
  fixtures/
docs/
  ARCHITECTURE.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DocStruct Architecture

1. Goals and Constraints

2. Top-Level Data Flow

3. Module Breakdown

3.1 `src/core`

3.2 `src/parser`

3.3 `src/ocr` + `ocr/bridge`

3.4 `src/fusion`

3.5 `src/export`

4. Runtime Entry Points

5. Key Heuristics

5.1 Page Classification

5.2 Korean Accuracy Controls

5.3 Redundancy Control

6. Testing Strategy

7. Known Limitations

8. Extension Points

9. File Map

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

DocStruct Architecture

1. Goals and Constraints

2. Top-Level Data Flow

3. Module Breakdown

3.1 src/core

3.2 src/parser

3.3 src/ocr + ocr/bridge

3.4 src/fusion

3.5 src/export

4. Runtime Entry Points

5. Key Heuristics

5.1 Page Classification

5.2 Korean Accuracy Controls

5.3 Redundancy Control

6. Testing Strategy

7. Known Limitations

8. Extension Points

9. File Map

3.1 `src/core`

3.2 `src/parser`

3.3 `src/ocr` + `ocr/bridge`

3.4 `src/fusion`

3.5 `src/export`