Skip to content

Commit bb286fe

Browse files
tpnwesm
authored andcommitted
feat: persist token coverage metadata
- feat: add token analytics metrics - feat: surface token usage in the UI - docs: add token metrics design note
1 parent 496f9a8 commit bb286fe

47 files changed

Lines changed: 4024 additions & 373 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/token-metrics.md

Lines changed: 347 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,347 @@
1+
# Token Metrics Design Note
2+
3+
## Problem Statement
4+
5+
AgentsView now surfaces token usage in both the session UI and the
6+
analytics summary, but legacy stored data did not preserve an important
7+
distinction: "the provider reported a token count of `0`" versus "the
8+
provider never reported this metric at all."
9+
10+
Without explicit coverage flags, the product cannot tell whether a zero
11+
value should participate in token analytics, whether a session badge
12+
should show a real zero, or whether the UI should show a missing-data
13+
placeholder instead. This especially affects `Output Tokens`, because
14+
treating non-reporting sessions as zero-token sessions undercounts the
15+
uncertainty and overstates the completeness of analytics.
16+
17+
## Goals
18+
19+
- Persist token coverage explicitly at both message and session level.
20+
- Preserve the meaning of reported zero values.
21+
- Exclude non-reporting sessions from token aggregates instead of
22+
silently treating them as zero.
23+
- Repair legacy rows when existing stored signal is sufficient.
24+
- Force a full resync when parser behavior changed in ways that make
25+
existing stored token semantics stale.
26+
- Keep rollout non-destructive: preserve existing session history,
27+
orphaned sessions, and excluded-session metadata.
28+
29+
## Non-Goals
30+
31+
- Reconstruct token metrics that were never present in source data.
32+
- Infer missing token coverage from zeros alone.
33+
- Normalize all providers to emit identical token payload shapes.
34+
- Retroactively make old unrecoverable rows distinguishable.
35+
- Change unrelated analytics semantics outside token reporting.
36+
- Provide a per-row PostgreSQL provenance model or a first-class
37+
convergence status surface in v1.
38+
39+
## User-Facing Semantics
40+
41+
### Output Tokens
42+
43+
`Output Tokens` in analytics is the sum of `sessions.total_output_tokens`
44+
only for sessions where `sessions.has_total_output_tokens = true`.
45+
Sessions that did not report output tokens do not contribute `0`; they
46+
are excluded from the sum entirely.
47+
48+
This same exclusion rule applies to all `output_tokens` analytics:
49+
50+
- the summary total only includes reporting sessions
51+
- the heatmap only accumulates output tokens from reporting sessions for
52+
each day
53+
- `Top Sessions` with `metric=output_tokens` ranks only reporting
54+
sessions, ordered by `total_output_tokens DESC, id ASC`
55+
- an empty state means "no reporting sessions in range", not "all
56+
sessions reported zero"
57+
- newer clients should not show `output_tokens` heatmap/top-session
58+
controls unless the server exposes token-summary capability
59+
60+
### Reporting Sessions
61+
62+
`Reporting Sessions` is the count of sessions included in token-output
63+
analytics. In practice, this is the number of sessions where output
64+
token coverage is known, not the number of sessions in the filtered
65+
result set overall.
66+
67+
On newer clients talking to older servers, `Output Tokens` and
68+
`Reporting Sessions` may be absent entirely. In that case the UI should
69+
render `` rather than silently falling back to `0`, because the server
70+
capability is unknown.
71+
72+
### `` Placeholders
73+
74+
Session-level token badges use explicit placeholders when one side of
75+
the pair is missing:
76+
77+
- `— ctx / 180 out` means output tokens were reported, context tokens
78+
were not.
79+
- `2.4k ctx / — out` means context tokens were reported, output tokens
80+
were not.
81+
- No badge is rendered when neither metric was reported.
82+
83+
The placeholder means "metric not reported," not zero.
84+
85+
### Visibility Rules
86+
87+
- Render a token badge only when at least one relevant metric is
88+
reportable.
89+
- Use `` only for a single missing side of an otherwise reportable
90+
token summary.
91+
- Hide the badge entirely when neither side is reportable.
92+
- Treat message/session badges, subagent badges, summary cards, heatmap,
93+
top sessions, and CSV export as different views over the same coverage
94+
semantics, not separate interpretations.
95+
96+
## Data Model Semantics
97+
98+
Coverage flags are presence markers, not non-zero markers:
99+
100+
- `messages.has_context_tokens` and `messages.has_output_tokens` mean
101+
the provider payload reported those message-level metrics, even when
102+
the numeric value was `0`.
103+
- `sessions.has_peak_context_tokens` and
104+
`sessions.has_total_output_tokens` mean the session has authoritative
105+
evidence that those aggregates are reportable, either from parser-owned
106+
session aggregates or from recoverable message/session coverage during
107+
legacy repair.
108+
109+
Parser-owned message flags are authoritative. For older rows that
110+
predate those flags, fallback inspection of `token_usage` keys is
111+
best-effort only.
112+
113+
## Migration And Rollout Semantics
114+
115+
### Full Resync
116+
117+
SQLite `dataVersion` changes force a full resync when parser behavior
118+
changed in a way that invalidates previously stored token semantics.
119+
That path rebuilds a fresh database, reprocesses source files, copies
120+
orphaned sessions and other preserved metadata, and atomically swaps the
121+
rebuilt database into place.
122+
123+
Use full resync when:
124+
125+
- parser extraction rules changed
126+
- aggregate token semantics changed
127+
- previously stored rows cannot be trusted without reparsing source data
128+
129+
The token-metrics rollout deliberately uses both paths:
130+
131+
- parser/data-version changes force a full SQLite resync before the new
132+
parser semantics can be trusted for existing local archives
133+
- current-schema databases that only lack explicit `has_*` flags can use
134+
one-time repair as a non-destructive bridge until or unless a full
135+
resync is required
136+
137+
### One-Time Repair
138+
139+
The one-time token coverage repair is for current-schema databases whose
140+
stored rows are still usable but missing explicit token coverage flags.
141+
142+
- SQLite runs the repair once per database until a persisted repair
143+
marker is stored.
144+
- PostgreSQL runs the repair when token coverage columns are first added,
145+
or when the repair marker is absent and the schema already contains
146+
sessions that may need backfill.
147+
148+
The repair only backfills flags; it does not invent new token counts.
149+
It is not a substitute for full resync when parser semantics change.
150+
151+
### PostgreSQL Convergence
152+
153+
PostgreSQL does not reparse source files itself. Its convergence model is:
154+
155+
- schema upgrade adds token-coverage columns and can perform one-time
156+
repair on already-synced PG rows when enough stored signal exists
157+
- if parser semantics changed, SQLite becomes the source of truth after
158+
local full resync
159+
- PG only becomes equally trustworthy after a repaired/resynced local
160+
database pushes fresh session/message rows again
161+
- until that push happens, PG-backed views may remain best-effort for
162+
older rows even though the schema is upgraded
163+
164+
Operationally, PG-backed deployments should assume:
165+
166+
- `pg serve` is fully trustworthy for newly pushed or newly repaired PG
167+
rows
168+
- historical PG rows produced under older parser semantics may lag until
169+
a local client republishes them
170+
- the recovery path is `pg push --full` from a machine whose SQLite DB
171+
has already completed the required repair or full resync
172+
- v1 does not provide a measurable per-row convergence/provenance marker;
173+
operators must treat PG token analytics as operationally best-effort
174+
until the relevant repaired/resynced SQLite sources have republished
175+
them
176+
177+
### Best-Effort Only
178+
179+
Legacy repair is intentionally limited to rows with enough surviving
180+
signal to infer coverage:
181+
182+
- non-zero `context_tokens` or `output_tokens`
183+
- `token_usage` payloads that still contain token keys, including
184+
explicit zero-valued keys such as `{"output_tokens":0}`
185+
- session aggregates that already stored non-zero totals or peaks
186+
187+
If a legacy row has empty `token_usage`, zero numeric counts, and no
188+
stored coverage flags, there is no reliable way to distinguish "not
189+
reported" from "reported zero." Those rows remain unrecoverable without
190+
reparsing source files, and even a full resync only helps if the source
191+
files still preserve the original token signal.
192+
193+
## Provider Differences
194+
195+
Providers do not report token usage uniformly:
196+
197+
- Some providers emit explicit per-message token keys, including keys
198+
whose value is `0`. These are recoverable because presence is visible.
199+
- Some providers only contribute session-level aggregates or uneven
200+
message/session coverage, so session flags may be known even when one
201+
message field is absent.
202+
- Older ingested rows may predate parser-owned coverage flags entirely,
203+
leaving only raw numeric values or partial `token_usage` blobs.
204+
205+
The system therefore treats token coverage as provider-specific metadata
206+
that must be preserved, not re-derived from a single universal rule.
207+
208+
## Acceptance Criteria
209+
210+
- Reported zero token values survive end-to-end without being mistaken
211+
for missing data.
212+
- Analytics `Output Tokens` excludes sessions with unknown output-token
213+
coverage.
214+
- Analytics `Reporting Sessions` matches the count of sessions included
215+
in token-output totals.
216+
- Session token badges show `` only for genuinely missing metrics.
217+
- SQLite and PostgreSQL both perform one-time coverage repair when
218+
appropriate, and skip it once the persisted marker is present.
219+
- Parser data-version bumps require a full resync instead of silently
220+
trusting stale token semantics.
221+
- Unrecoverable legacy rows remain excluded rather than being
222+
misclassified.
223+
- SQLite instances with stale parser semantics clearly enter
224+
`NeedsResync()` / resync-required state.
225+
- One-time repair emits enough logging/state for operators to tell when
226+
it ran and whether it updated anything.
227+
- PG-backed deployments have an explicit documented expectation that
228+
repaired/resynced local data must be pushed before PG analytics are
229+
fully converged.
230+
231+
## Compatibility Expectations
232+
233+
### SQLite
234+
235+
- Existing SQLite databases may open in one of two modes:
236+
- repair-only, when stored token data is still semantically valid but
237+
explicit `has_*` flags are missing
238+
- full-resync-required, when `dataVersion` indicates parser semantics
239+
changed and existing stored token meaning may be stale
240+
- During the repair-only path, historical rows may be best-effort until
241+
the one-time repair completes.
242+
243+
### PostgreSQL
244+
245+
- Existing PostgreSQL schemas accept additive `has_*` columns and a
246+
one-time repair step without dropping stored session history.
247+
- `pg serve` and SQLite-backed `serve` are expected to return the same
248+
token fields and analytics metrics for the same repaired data set.
249+
- If PG rows were originally pushed from a stale local parser build,
250+
schema repair alone may not fully converge them; a subsequent push from
251+
a repaired/resynced SQLite source is the compatibility bridge.
252+
253+
### API And CSV
254+
255+
- New API fields and metric enums are additive.
256+
- Older clients may ignore the new fields safely.
257+
- Older servers may omit `total_output_tokens` and
258+
`token_reporting_sessions` entirely.
259+
- Newer clients must assume mixed-version data can exist briefly during
260+
rollout and therefore treat unknown coverage as unknown, not zero.
261+
- UI surfaces should render `` for missing summary capability rather
262+
than `0`.
263+
- Heatmap and top-session `output_tokens` controls should remain hidden
264+
when the server does not expose the token-summary capability needed to
265+
back them.
266+
- CSV export mirrors the new summary semantics: `Output Tokens` only
267+
counts reporting sessions, and `Reporting Sessions` makes that
268+
coverage explicit.
269+
270+
## Rollout Order
271+
272+
1. Define the token coverage model and compatibility rules.
273+
Gate: written semantics for `has_*`, `Output Tokens`, `Reporting Sessions`,
274+
placeholders, and unrecoverable legacy rows.
275+
2. Land SQLite schema changes plus one-time repair markers.
276+
Gate: legacy SQLite DBs can open, repair once, and stop rescanning.
277+
3. Land PostgreSQL schema changes plus one-time repair markers.
278+
Gate: upgraded PG schema can repair once and preserve token fields.
279+
4. Land parser-owned token presence for supported providers.
280+
Gate: explicit zero-valued token keys keep `Has*Tokens=true`.
281+
5. Require full SQLite resync when parser semantics change.
282+
Gate: stale local DBs surface resync-required state before user-facing
283+
analytics are trusted or stale token semantics can propagate.
284+
6. Land sync/push/upload propagation.
285+
Gate: full sync, incremental append, upload, and PG push preserve
286+
token coverage truth, and local databases in resync-required state
287+
are not treated as trusted token sources.
288+
7. Land analytics endpoints/store/API semantics.
289+
Gate: SQLite and PG analytics agree for repaired/resynced data, and
290+
`output_tokens` excludes non-reporting sessions.
291+
8. Land UI surfacing.
292+
Gate: badges, controls, placeholders, and summary cards all match the
293+
documented semantics on desktop and mobile.
294+
9. Run final integrated verification and history cleanup.
295+
296+
This order matters because user-visible analytics and UI must not be the
297+
first consumer of stale token semantics.
298+
299+
## Verification Matrix
300+
301+
- Legacy SQLite DB with missing `has_*` flags but recoverable token
302+
signal
303+
- Legacy current-schema SQLite DB that needs one-time repair exactly
304+
once
305+
- SQLite DB that requires full resync because parser semantics changed
306+
- PostgreSQL schema upgrade with one-time repair marker
307+
- repaired/resynced SQLite -> `pg push --full` -> PG parity
308+
- Incremental append sync after repair and after resync-required state
309+
- Upload path for sessions with explicit zero-valued token keys
310+
- Analytics summary, heatmap, top-sessions, and CSV export for reporting
311+
vs non-reporting sessions
312+
- Frontend badges and placeholders for reported, partially reported, and
313+
missing token data
314+
315+
## Operational Notes
316+
317+
- One-time repair is expected to scan only candidate rows and stop once
318+
the repair marker is recorded.
319+
- If repair markers are missing or a repair is interrupted, reopen will
320+
rerun repair until the marker is persisted.
321+
- If repair output is incorrect because parser semantics changed, the
322+
recovery path is a full SQLite resync, not repeated best-effort
323+
repair.
324+
- Expected repair cost is proportional to candidate legacy rows. On a
325+
large archive this may be noticeable once, but it should not recur on
326+
every startup after the marker is set.
327+
- Operators should expect log evidence for:
328+
- whether one-time repair ran
329+
- how many message/session rows were updated
330+
- whether the local SQLite DB is in resync-required state
331+
- whether PG still needs a fresh push from a repaired/resynced source
332+
333+
## Rollback And Containment
334+
335+
- If PG analytics remain misleading after schema upgrade, contain the
336+
issue by running `pg push --full` from a repaired/resynced local DB
337+
before relying on PG-backed dashboards.
338+
- v1 does not expose a first-class PG convergence status surface. The
339+
supported operational stance is therefore conservative: treat PG token
340+
analytics as best-effort until the relevant repaired/resynced local
341+
sources have republished them, and do not promise a stronger guarantee
342+
in UI or API status output.
343+
- If SQLite repair produced insufficient coverage because the source
344+
files are still available, prefer full resync over repeated repair.
345+
- If source files are gone and legacy rows are unrecoverable, the system
346+
should continue to exclude them from reporting-based analytics rather
347+
than inventing token coverage.

0 commit comments

Comments
 (0)