Data Models

Every DTO is a Pydantic v2 model. Fully JSON-serialisable, validated on construction, shipping with py.typed so your editor gives you autocomplete and type errors.

For the canonical machine-readable spec, see docs/WORKBOOK_GRAPH_SPEC.md.

High-level map

Model	Role
`ParseResult`	Entry-point wrapper: workbook + chunks + serializer + verification.
`WorkbookDTO`	Root: sheets, tables, charts, named ranges, dependency graph, errors.
`SheetDTO`	Cells, merged regions, conditional formatting, data validation.
`CellDTO`	Value, formula, style, coordinates, annotations.
`TableDTO`	Excel ListObject: name, columns, range, style.
`ChartDTO`	Chart metadata, series, axis labels, chart type.
`BlockDTO`	Logical block (`HEADER` / `DATA` / `TABLE` / …) with bounding box + hash.
`ChunkDTO`	LLM chunk: HTML + text rendering, token count, source URI, content hash.
`DependencyGraph`	Directed graph of formula dependencies with traversal helpers.
`DependencyEdge`	One edge in the graph.
`CellCoord`	`(row, col)` pair with `.to_a1()` and `.from_a1()` helpers.
`CellRange`	Inclusive top-left → bottom-right pair.
`TableStructure`	Assembled table with header / data regions (post-segmentation).
`TreeNode`	Hierarchical node from tree building.
`TemplateNode`	Template node with degree-of-freedom annotations.
`ParseError`	Structured error / warning attached to a workbook, sheet, or cell.

`ParseResult`

class ParseResult:
    workbook: WorkbookDTO
    chunks: list[ChunkDTO]
    serializer: WorkbookSerializer

    def to_json(self) -> dict[str, Any]: ...

to_json() returns a nested dict, fully JSON-serialisable via json.dumps(result.to_json(), default=str).

`WorkbookDTO`

Key fields:

Field	Type	Description
`workbook_id`	`str`	Deterministic ID derived from `file_path` + `workbook_hash`.
`filename`	`str`	Display name.
`file_path`	`str \| None`	Original path if `parse_workbook(path=…)` was used.
`workbook_hash`	`str`	xxhash64 of the file bytes.
`sheets`	`list[SheetDTO]`	All sheets, in Excel tab order.
`tables`	`list[TableDTO]`	All Excel ListObjects across sheets.
`charts`	`list[ChartDTO]`	All charts across sheets.
`named_ranges`	`list[NamedRangeDTO]`	Workbook- and sheet-scoped.
`dependency_graph`	`DependencyGraph`	Global cell-level dep graph.
`kpi_catalog`	`list[KpiDTO]`	Workbook-level KPIs surfaced during annotation.
`errors`	`list[ParseError]`	Non-fatal warnings / errors.
`total_sheets` / `total_cells` / `total_formulas`	`int`	Aggregates.
`parse_duration_ms`	`float`	Wall time for the parse pipeline.

`CellDTO`

class CellDTO(BaseModel):
    address: str          # "A1"
    row: int
    col: int
    value: Any            # str | int | float | bool | datetime | None
    formula: str | None   # "=SUM(A1:A5)"
    data_type: str        # "n", "s", "f", "d", "b", "e"
    number_format: str | None
    font: FontDTO | None
    fill: FillDTO | None
    border: BorderDTO | None
    alignment: AlignmentDTO | None
    hyperlink: str | None
    comment: str | None
    is_merged_master: bool
    is_merged_slave: bool
    merge_master: CellCoord | None
    merge_extent: int | None       # row span
    merge_col_extent: int | None   # col span
    is_empty: bool

`ChunkDTO`

The object you'll spend most of your LLM-integration time with:

class ChunkDTO(BaseModel):
    chunk_id: str                  # stable, deterministic across runs
    chunk_index: int
    sheet_name: str
    block_type: str                # "HEADER", "DATA", "TABLE", "CHART_ANCHOR", ...
    top_left_cell: str             # "A1"
    bottom_right_cell: str         # "F18"
    cell_range: CellRange | None
    key_cells: list[str]           # highlighted refs the block is "about"
    named_ranges: list[str]        # named ranges touching the block
    dependency_summary: DependencySummary
    render_html: str               # proper colspan/rowspan
    render_text: str               # pipe-delimited LLM-friendly
    token_count: int               # via tiktoken
    source_uri: str                # "file.xlsx#Sheet!A1:F18"
    content_hash: str              # xxhash64 of render_text
    prev_chunk_id: str | None
    next_chunk_id: str | None
    metadata: dict[str, Any]

`DependencyGraph`

Relevant methods:

graph.get_upstream(sheet: str, coord: CellCoord, max_depth: int = 2) -> list[DependencyEdge]
graph.get_downstream(sheet: str, coord: CellCoord, max_depth: int = 2) -> list[DependencyEdge]
graph.detect_circular_refs() -> set[str]     # {"Sheet!A1", "Sheet!B3", ...}
graph.edges_out_of(sheet: str, coord: CellCoord) -> list[DependencyEdge]
graph.edges_into(sheet: str, coord: CellCoord) -> list[DependencyEdge]

Each DependencyEdge carries source_sheet, source_coord, target_sheet, target_coord, target_ref_string, and an edge_type enum (WITHIN_SHEET, CROSS_SHEET, TABLE_REF, EXTERNAL, NAMED_RANGE).

`ParseError`

Non-fatal diagnostics live on workbook.errors (workbook-level) and sheet.errors (sheet-level):

class ParseError(BaseModel):
    severity: Severity           # INFO | WARNING | ERROR
    stage: str                   # "load", "parse", "annotate", "segment", ...
    message: str
    sheet_name: str | None
    cell_address: str | None

Anything that would traditionally raise an exception becomes a structured record here, so a single pathological file can't crash a batch processing pipeline.

JSON shape

Full shape is recursive but roughly:

{
  "workbook": {
    "workbook_id": "...",
    "filename": "workbook.xlsx",
    "workbook_hash": "...",
    "sheets": [
      {
        "sheet_name": "Sheet1",
        "cells": { "A1": { /* CellDTO */ }, ... },
        "merged_regions": [ ... ],
        "conditional_format_rules": [ ... ],
        "data_validations": [ ... ],
        "properties": { ... }
      }
    ],
    "tables": [ ... ],
    "charts": [ ... ],
    "named_ranges": [ ... ],
    "dependency_edges": [ ... ],
    "kpi_catalog": [ ... ],
    "errors": [ ... ],
    "total_sheets": 3,
    "total_cells": 1240,
    "total_formulas": 87
  },
  "chunks": [ /* ChunkDTO[] */ ]
}

For the exact field-by-field breakdown including every optional field, see docs/WORKBOOK_GRAPH_SPEC.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Models

High-level map

`ParseResult`

`WorkbookDTO`

`CellDTO`

`ChunkDTO`

`DependencyGraph`

`ParseError`

JSON shape

FilesExpand file tree

Data-Models.md

Latest commit

History

Data-Models.md

File metadata and controls

Data Models

High-level map

ParseResult

WorkbookDTO

CellDTO

ChunkDTO

DependencyGraph

ParseError

JSON shape

`ParseResult`

`WorkbookDTO`

`CellDTO`

`ChunkDTO`

`DependencyGraph`

`ParseError`