|
| 1 | +# Token Metrics Design Note |
| 2 | + |
| 3 | +## Problem Statement |
| 4 | + |
| 5 | +AgentsView now surfaces token usage in both the session UI and the |
| 6 | +analytics summary, but legacy stored data did not preserve an important |
| 7 | +distinction: "the provider reported a token count of `0`" versus "the |
| 8 | +provider never reported this metric at all." |
| 9 | + |
| 10 | +Without explicit coverage flags, the product cannot tell whether a zero |
| 11 | +value should participate in token analytics, whether a session badge |
| 12 | +should show a real zero, or whether the UI should show a missing-data |
| 13 | +placeholder instead. This especially affects `Output Tokens`, because |
| 14 | +treating non-reporting sessions as zero-token sessions undercounts the |
| 15 | +uncertainty and overstates the completeness of analytics. |
| 16 | + |
| 17 | +## Goals |
| 18 | + |
| 19 | +- Persist token coverage explicitly at both message and session level. |
| 20 | +- Preserve the meaning of reported zero values. |
| 21 | +- Exclude non-reporting sessions from token aggregates instead of |
| 22 | + silently treating them as zero. |
| 23 | +- Repair legacy rows when existing stored signal is sufficient. |
| 24 | +- Force a full resync when parser behavior changed in ways that make |
| 25 | + existing stored token semantics stale. |
| 26 | +- Keep rollout non-destructive: preserve existing session history, |
| 27 | + orphaned sessions, and excluded-session metadata. |
| 28 | + |
| 29 | +## Non-Goals |
| 30 | + |
| 31 | +- Reconstruct token metrics that were never present in source data. |
| 32 | +- Infer missing token coverage from zeros alone. |
| 33 | +- Normalize all providers to emit identical token payload shapes. |
| 34 | +- Retroactively make old unrecoverable rows distinguishable. |
| 35 | +- Change unrelated analytics semantics outside token reporting. |
| 36 | +- Provide a per-row PostgreSQL provenance model or a first-class |
| 37 | + convergence status surface in v1. |
| 38 | + |
| 39 | +## User-Facing Semantics |
| 40 | + |
| 41 | +### Output Tokens |
| 42 | + |
| 43 | +`Output Tokens` in analytics is the sum of `sessions.total_output_tokens` |
| 44 | +only for sessions where `sessions.has_total_output_tokens = true`. |
| 45 | +Sessions that did not report output tokens do not contribute `0`; they |
| 46 | +are excluded from the sum entirely. |
| 47 | + |
| 48 | +This same exclusion rule applies to all `output_tokens` analytics: |
| 49 | + |
| 50 | +- the summary total only includes reporting sessions |
| 51 | +- the heatmap only accumulates output tokens from reporting sessions for |
| 52 | + each day |
| 53 | +- `Top Sessions` with `metric=output_tokens` ranks only reporting |
| 54 | + sessions, ordered by `total_output_tokens DESC, id ASC` |
| 55 | +- an empty state means "no reporting sessions in range", not "all |
| 56 | + sessions reported zero" |
| 57 | +- newer clients should not show `output_tokens` heatmap/top-session |
| 58 | + controls unless the server exposes token-summary capability |
| 59 | + |
| 60 | +### Reporting Sessions |
| 61 | + |
| 62 | +`Reporting Sessions` is the count of sessions included in token-output |
| 63 | +analytics. In practice, this is the number of sessions where output |
| 64 | +token coverage is known, not the number of sessions in the filtered |
| 65 | +result set overall. |
| 66 | + |
| 67 | +On newer clients talking to older servers, `Output Tokens` and |
| 68 | +`Reporting Sessions` may be absent entirely. In that case the UI should |
| 69 | +render `—` rather than silently falling back to `0`, because the server |
| 70 | +capability is unknown. |
| 71 | + |
| 72 | +### `—` Placeholders |
| 73 | + |
| 74 | +Session-level token badges use explicit placeholders when one side of |
| 75 | +the pair is missing: |
| 76 | + |
| 77 | +- `— ctx / 180 out` means output tokens were reported, context tokens |
| 78 | + were not. |
| 79 | +- `2.4k ctx / — out` means context tokens were reported, output tokens |
| 80 | + were not. |
| 81 | +- No badge is rendered when neither metric was reported. |
| 82 | + |
| 83 | +The placeholder means "metric not reported," not zero. |
| 84 | + |
| 85 | +### Visibility Rules |
| 86 | + |
| 87 | +- Render a token badge only when at least one relevant metric is |
| 88 | + reportable. |
| 89 | +- Use `—` only for a single missing side of an otherwise reportable |
| 90 | + token summary. |
| 91 | +- Hide the badge entirely when neither side is reportable. |
| 92 | +- Treat message/session badges, subagent badges, summary cards, heatmap, |
| 93 | + top sessions, and CSV export as different views over the same coverage |
| 94 | + semantics, not separate interpretations. |
| 95 | + |
| 96 | +## Data Model Semantics |
| 97 | + |
| 98 | +Coverage flags are presence markers, not non-zero markers: |
| 99 | + |
| 100 | +- `messages.has_context_tokens` and `messages.has_output_tokens` mean |
| 101 | + the provider payload reported those message-level metrics, even when |
| 102 | + the numeric value was `0`. |
| 103 | +- `sessions.has_peak_context_tokens` and |
| 104 | + `sessions.has_total_output_tokens` mean the session has authoritative |
| 105 | + evidence that those aggregates are reportable, either from parser-owned |
| 106 | + session aggregates or from recoverable message/session coverage during |
| 107 | + legacy repair. |
| 108 | + |
| 109 | +Parser-owned message flags are authoritative. For older rows that |
| 110 | +predate those flags, fallback inspection of `token_usage` keys is |
| 111 | +best-effort only. |
| 112 | + |
| 113 | +## Migration And Rollout Semantics |
| 114 | + |
| 115 | +### Full Resync |
| 116 | + |
| 117 | +SQLite `dataVersion` changes force a full resync when parser behavior |
| 118 | +changed in a way that invalidates previously stored token semantics. |
| 119 | +That path rebuilds a fresh database, reprocesses source files, copies |
| 120 | +orphaned sessions and other preserved metadata, and atomically swaps the |
| 121 | +rebuilt database into place. |
| 122 | + |
| 123 | +Use full resync when: |
| 124 | + |
| 125 | +- parser extraction rules changed |
| 126 | +- aggregate token semantics changed |
| 127 | +- previously stored rows cannot be trusted without reparsing source data |
| 128 | + |
| 129 | +The token-metrics rollout deliberately uses both paths: |
| 130 | + |
| 131 | +- parser/data-version changes force a full SQLite resync before the new |
| 132 | + parser semantics can be trusted for existing local archives |
| 133 | +- current-schema databases that only lack explicit `has_*` flags can use |
| 134 | + one-time repair as a non-destructive bridge until or unless a full |
| 135 | + resync is required |
| 136 | + |
| 137 | +### One-Time Repair |
| 138 | + |
| 139 | +The one-time token coverage repair is for current-schema databases whose |
| 140 | +stored rows are still usable but missing explicit token coverage flags. |
| 141 | + |
| 142 | +- SQLite runs the repair once per database until a persisted repair |
| 143 | + marker is stored. |
| 144 | +- PostgreSQL runs the repair when token coverage columns are first added, |
| 145 | + or when the repair marker is absent and the schema already contains |
| 146 | + sessions that may need backfill. |
| 147 | + |
| 148 | +The repair only backfills flags; it does not invent new token counts. |
| 149 | +It is not a substitute for full resync when parser semantics change. |
| 150 | + |
| 151 | +### PostgreSQL Convergence |
| 152 | + |
| 153 | +PostgreSQL does not reparse source files itself. Its convergence model is: |
| 154 | + |
| 155 | +- schema upgrade adds token-coverage columns and can perform one-time |
| 156 | + repair on already-synced PG rows when enough stored signal exists |
| 157 | +- if parser semantics changed, SQLite becomes the source of truth after |
| 158 | + local full resync |
| 159 | +- PG only becomes equally trustworthy after a repaired/resynced local |
| 160 | + database pushes fresh session/message rows again |
| 161 | +- until that push happens, PG-backed views may remain best-effort for |
| 162 | + older rows even though the schema is upgraded |
| 163 | + |
| 164 | +Operationally, PG-backed deployments should assume: |
| 165 | + |
| 166 | +- `pg serve` is fully trustworthy for newly pushed or newly repaired PG |
| 167 | + rows |
| 168 | +- historical PG rows produced under older parser semantics may lag until |
| 169 | + a local client republishes them |
| 170 | +- the recovery path is `pg push --full` from a machine whose SQLite DB |
| 171 | + has already completed the required repair or full resync |
| 172 | +- v1 does not provide a measurable per-row convergence/provenance marker; |
| 173 | + operators must treat PG token analytics as operationally best-effort |
| 174 | + until the relevant repaired/resynced SQLite sources have republished |
| 175 | + them |
| 176 | + |
| 177 | +### Best-Effort Only |
| 178 | + |
| 179 | +Legacy repair is intentionally limited to rows with enough surviving |
| 180 | +signal to infer coverage: |
| 181 | + |
| 182 | +- non-zero `context_tokens` or `output_tokens` |
| 183 | +- `token_usage` payloads that still contain token keys, including |
| 184 | + explicit zero-valued keys such as `{"output_tokens":0}` |
| 185 | +- session aggregates that already stored non-zero totals or peaks |
| 186 | + |
| 187 | +If a legacy row has empty `token_usage`, zero numeric counts, and no |
| 188 | +stored coverage flags, there is no reliable way to distinguish "not |
| 189 | +reported" from "reported zero." Those rows remain unrecoverable without |
| 190 | +reparsing source files, and even a full resync only helps if the source |
| 191 | +files still preserve the original token signal. |
| 192 | + |
| 193 | +## Provider Differences |
| 194 | + |
| 195 | +Providers do not report token usage uniformly: |
| 196 | + |
| 197 | +- Some providers emit explicit per-message token keys, including keys |
| 198 | + whose value is `0`. These are recoverable because presence is visible. |
| 199 | +- Some providers only contribute session-level aggregates or uneven |
| 200 | + message/session coverage, so session flags may be known even when one |
| 201 | + message field is absent. |
| 202 | +- Older ingested rows may predate parser-owned coverage flags entirely, |
| 203 | + leaving only raw numeric values or partial `token_usage` blobs. |
| 204 | + |
| 205 | +The system therefore treats token coverage as provider-specific metadata |
| 206 | +that must be preserved, not re-derived from a single universal rule. |
| 207 | + |
| 208 | +## Acceptance Criteria |
| 209 | + |
| 210 | +- Reported zero token values survive end-to-end without being mistaken |
| 211 | + for missing data. |
| 212 | +- Analytics `Output Tokens` excludes sessions with unknown output-token |
| 213 | + coverage. |
| 214 | +- Analytics `Reporting Sessions` matches the count of sessions included |
| 215 | + in token-output totals. |
| 216 | +- Session token badges show `—` only for genuinely missing metrics. |
| 217 | +- SQLite and PostgreSQL both perform one-time coverage repair when |
| 218 | + appropriate, and skip it once the persisted marker is present. |
| 219 | +- Parser data-version bumps require a full resync instead of silently |
| 220 | + trusting stale token semantics. |
| 221 | +- Unrecoverable legacy rows remain excluded rather than being |
| 222 | + misclassified. |
| 223 | +- SQLite instances with stale parser semantics clearly enter |
| 224 | + `NeedsResync()` / resync-required state. |
| 225 | +- One-time repair emits enough logging/state for operators to tell when |
| 226 | + it ran and whether it updated anything. |
| 227 | +- PG-backed deployments have an explicit documented expectation that |
| 228 | + repaired/resynced local data must be pushed before PG analytics are |
| 229 | + fully converged. |
| 230 | + |
| 231 | +## Compatibility Expectations |
| 232 | + |
| 233 | +### SQLite |
| 234 | + |
| 235 | +- Existing SQLite databases may open in one of two modes: |
| 236 | + - repair-only, when stored token data is still semantically valid but |
| 237 | + explicit `has_*` flags are missing |
| 238 | + - full-resync-required, when `dataVersion` indicates parser semantics |
| 239 | + changed and existing stored token meaning may be stale |
| 240 | +- During the repair-only path, historical rows may be best-effort until |
| 241 | + the one-time repair completes. |
| 242 | + |
| 243 | +### PostgreSQL |
| 244 | + |
| 245 | +- Existing PostgreSQL schemas accept additive `has_*` columns and a |
| 246 | + one-time repair step without dropping stored session history. |
| 247 | +- `pg serve` and SQLite-backed `serve` are expected to return the same |
| 248 | + token fields and analytics metrics for the same repaired data set. |
| 249 | +- If PG rows were originally pushed from a stale local parser build, |
| 250 | + schema repair alone may not fully converge them; a subsequent push from |
| 251 | + a repaired/resynced SQLite source is the compatibility bridge. |
| 252 | + |
| 253 | +### API And CSV |
| 254 | + |
| 255 | +- New API fields and metric enums are additive. |
| 256 | +- Older clients may ignore the new fields safely. |
| 257 | +- Older servers may omit `total_output_tokens` and |
| 258 | + `token_reporting_sessions` entirely. |
| 259 | +- Newer clients must assume mixed-version data can exist briefly during |
| 260 | + rollout and therefore treat unknown coverage as unknown, not zero. |
| 261 | +- UI surfaces should render `—` for missing summary capability rather |
| 262 | + than `0`. |
| 263 | +- Heatmap and top-session `output_tokens` controls should remain hidden |
| 264 | + when the server does not expose the token-summary capability needed to |
| 265 | + back them. |
| 266 | +- CSV export mirrors the new summary semantics: `Output Tokens` only |
| 267 | + counts reporting sessions, and `Reporting Sessions` makes that |
| 268 | + coverage explicit. |
| 269 | + |
| 270 | +## Rollout Order |
| 271 | + |
| 272 | +1. Define the token coverage model and compatibility rules. |
| 273 | + Gate: written semantics for `has_*`, `Output Tokens`, `Reporting Sessions`, |
| 274 | + placeholders, and unrecoverable legacy rows. |
| 275 | +2. Land SQLite schema changes plus one-time repair markers. |
| 276 | + Gate: legacy SQLite DBs can open, repair once, and stop rescanning. |
| 277 | +3. Land PostgreSQL schema changes plus one-time repair markers. |
| 278 | + Gate: upgraded PG schema can repair once and preserve token fields. |
| 279 | +4. Land parser-owned token presence for supported providers. |
| 280 | + Gate: explicit zero-valued token keys keep `Has*Tokens=true`. |
| 281 | +5. Require full SQLite resync when parser semantics change. |
| 282 | + Gate: stale local DBs surface resync-required state before user-facing |
| 283 | + analytics are trusted or stale token semantics can propagate. |
| 284 | +6. Land sync/push/upload propagation. |
| 285 | + Gate: full sync, incremental append, upload, and PG push preserve |
| 286 | + token coverage truth, and local databases in resync-required state |
| 287 | + are not treated as trusted token sources. |
| 288 | +7. Land analytics endpoints/store/API semantics. |
| 289 | + Gate: SQLite and PG analytics agree for repaired/resynced data, and |
| 290 | + `output_tokens` excludes non-reporting sessions. |
| 291 | +8. Land UI surfacing. |
| 292 | + Gate: badges, controls, placeholders, and summary cards all match the |
| 293 | + documented semantics on desktop and mobile. |
| 294 | +9. Run final integrated verification and history cleanup. |
| 295 | + |
| 296 | +This order matters because user-visible analytics and UI must not be the |
| 297 | +first consumer of stale token semantics. |
| 298 | + |
| 299 | +## Verification Matrix |
| 300 | + |
| 301 | +- Legacy SQLite DB with missing `has_*` flags but recoverable token |
| 302 | + signal |
| 303 | +- Legacy current-schema SQLite DB that needs one-time repair exactly |
| 304 | + once |
| 305 | +- SQLite DB that requires full resync because parser semantics changed |
| 306 | +- PostgreSQL schema upgrade with one-time repair marker |
| 307 | +- repaired/resynced SQLite -> `pg push --full` -> PG parity |
| 308 | +- Incremental append sync after repair and after resync-required state |
| 309 | +- Upload path for sessions with explicit zero-valued token keys |
| 310 | +- Analytics summary, heatmap, top-sessions, and CSV export for reporting |
| 311 | + vs non-reporting sessions |
| 312 | +- Frontend badges and placeholders for reported, partially reported, and |
| 313 | + missing token data |
| 314 | + |
| 315 | +## Operational Notes |
| 316 | + |
| 317 | +- One-time repair is expected to scan only candidate rows and stop once |
| 318 | + the repair marker is recorded. |
| 319 | +- If repair markers are missing or a repair is interrupted, reopen will |
| 320 | + rerun repair until the marker is persisted. |
| 321 | +- If repair output is incorrect because parser semantics changed, the |
| 322 | + recovery path is a full SQLite resync, not repeated best-effort |
| 323 | + repair. |
| 324 | +- Expected repair cost is proportional to candidate legacy rows. On a |
| 325 | + large archive this may be noticeable once, but it should not recur on |
| 326 | + every startup after the marker is set. |
| 327 | +- Operators should expect log evidence for: |
| 328 | + - whether one-time repair ran |
| 329 | + - how many message/session rows were updated |
| 330 | + - whether the local SQLite DB is in resync-required state |
| 331 | + - whether PG still needs a fresh push from a repaired/resynced source |
| 332 | + |
| 333 | +## Rollback And Containment |
| 334 | + |
| 335 | +- If PG analytics remain misleading after schema upgrade, contain the |
| 336 | + issue by running `pg push --full` from a repaired/resynced local DB |
| 337 | + before relying on PG-backed dashboards. |
| 338 | +- v1 does not expose a first-class PG convergence status surface. The |
| 339 | + supported operational stance is therefore conservative: treat PG token |
| 340 | + analytics as best-effort until the relevant repaired/resynced local |
| 341 | + sources have republished them, and do not promise a stronger guarantee |
| 342 | + in UI or API status output. |
| 343 | +- If SQLite repair produced insufficient coverage because the source |
| 344 | + files are still available, prefer full resync over repeated repair. |
| 345 | +- If source files are gone and legacy rows are unrecoverable, the system |
| 346 | + should continue to exclude them from reporting-based analytics rather |
| 347 | + than inventing token coverage. |
0 commit comments