feat: flag token-count inconsistencies — cc.billing.reconciliation + feedback score by jverre · Pull Request #24 · comet-ml/opik-claude-code-plugin

jverre · 2026-06-12T09:28:57Z

Details

Makes token-count inconsistencies monitorable per trace instead of latent:

cc.billing.reconciliation (metadata, every trace): {input_delta, cache_read_delta, cache_creation_delta, output_delta, consistent} where each delta is Σ lanes minus API usage for that tier column. Healthy attribution reconciles to zero (the unattributed lane absorbs undershoot); a non-zero delta means the replay disagrees with what the API billed - usually a truncation mechanism we don't detect yet.
token_count_consistent feedback score (every trace with billing): 1 when all deltas are zero, 0 otherwise, with the per-column deltas in the score reason. Filter traces by this score in the UI / alert on the average dropping below 1.

This PR also re-applies the exact-overshoot clamp removal that was decided on #22 but missed its merge (commit f41d31c was pushed to the branch after the merge picked up the earlier head). With the clamp still in main, usage-derived pieces were silently rescaled to force Σ lanes == usage, which would make this flag a tautology. The design is: usage-derived pieces are never rescaled, unattributed is the only reconciliation mechanism, and the discrepancy IS the signal - it's how the synthetic zero-usage entries (#23) were found.

Unified count derivation (fixes "tokens without calls" in the AI Spend breakdowns): item counts were computed by a parallel scanner (countNewEvents) that derived keys and sizing independently of billing. Prompts were bucketed by the chars-ratio estimate while billing bucketed by measured tokens, so a prompt could be counted in large but billed in xlarge (observed in production data as large: count=1, tokens=0 next to xlarge: count=0, tokens=4M). Output items were keyed {output, <lane>/<entity>} while counts used input-side keys, so output items could never carry a count. Skill loads used a different detector than billing's SHA matching. Counts are now derived from the same conversationPieces layout billing re-bills every turn (input side) and bumped by attributeOutput under the same keys it books tokens to (output side). The parallel scanner is deleted; a key or bucket drift between counts and tokens is now structurally impossible.

Testing

TestBillingReconciliationFlag: healthy turn reconciles (consistent: true, all deltas 0); a usage-derived overshoot flips consistent: false with a positive input_delta
Dry-run on the real compacted session: reports consistent: false, input_delta: 666408, output_delta: 1004 - exactly the synthetic-entry mass that fix: skip synthetic zero-usage assistant entries in billing #23 removes; once fix: skip synthetic zero-usage assistant entries in billing #23 lands the same session reconciles to zero
Count derivation: full suite covers the skill-load count (counted once on the skills usage item) and the skill-body vs user-prompt split; output items now carry true per-block event counts read by the FE breakdown as calls
Full suite, gofmt, go vet clean; binaries rebuilt

🤖 Generated with Claude Code

…onsistent score Adds the monitoring half of the no-clamp design: every trace now carries cc.billing.reconciliation (Σ lanes minus API usage per tier column + consistent flag) and a token_count_consistent feedback score (1/0, with the per-column deltas in the reason when 0), so traces with token-count inconsistencies are filterable in the UI instead of latent. Also re-applies the exact-overshoot clamp removal: it was decided on PR #22 (commit f41d31c) but pushed after the merge picked up the branch head, so main still scaled usage-derived pieces to force Σ lanes == usage — which would have made the consistent flag a tautology. Usage-derived pieces are never rescaled; the discrepancy is the signal (it found the synthetic zero-usage entries fixed in #23: this flag reports input_delta=666408 on that session until #23 lands). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…onciliation-flag # Conflicts: # bin/opik-logger-darwin-amd64 # bin/opik-logger-darwin-arm64 # bin/opik-logger-linux-amd64 # bin/opik-logger-windows-amd64.exe

jverre and others added 2 commits June 12, 2026 10:28

Merge remote-tracking branch 'origin/main' into jacques/OPIK-6873-rec…

09a8ba8

…onciliation-flag # Conflicts: # bin/opik-logger-darwin-amd64 # bin/opik-logger-darwin-arm64 # bin/opik-logger-linux-amd64 # bin/opik-logger-windows-amd64.exe

jverre merged commit 7436e92 into main Jun 12, 2026
1 check passed

jverre deleted the jacques/OPIK-6873-reconciliation-flag branch June 12, 2026 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: flag token-count inconsistencies — cc.billing.reconciliation + feedback score#24

feat: flag token-count inconsistencies — cc.billing.reconciliation + feedback score#24
jverre merged 2 commits into
mainfrom
jacques/OPIK-6873-reconciliation-flag

jverre commented Jun 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jverre commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jverre commented Jun 12, 2026 •

edited

Loading