Skip to content

feat: flag token-count inconsistencies — cc.billing.reconciliation + feedback score#24

Merged
jverre merged 2 commits into
mainfrom
jacques/OPIK-6873-reconciliation-flag
Jun 12, 2026
Merged

feat: flag token-count inconsistencies — cc.billing.reconciliation + feedback score#24
jverre merged 2 commits into
mainfrom
jacques/OPIK-6873-reconciliation-flag

Conversation

@jverre

@jverre jverre commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Details

Makes token-count inconsistencies monitorable per trace instead of latent:

  • cc.billing.reconciliation (metadata, every trace): {input_delta, cache_read_delta, cache_creation_delta, output_delta, consistent} where each delta is Σ lanes minus API usage for that tier column. Healthy attribution reconciles to zero (the unattributed lane absorbs undershoot); a non-zero delta means the replay disagrees with what the API billed - usually a truncation mechanism we don't detect yet.
  • token_count_consistent feedback score (every trace with billing): 1 when all deltas are zero, 0 otherwise, with the per-column deltas in the score reason. Filter traces by this score in the UI / alert on the average dropping below 1.

This PR also re-applies the exact-overshoot clamp removal that was decided on #22 but missed its merge (commit f41d31c was pushed to the branch after the merge picked up the earlier head). With the clamp still in main, usage-derived pieces were silently rescaled to force Σ lanes == usage, which would make this flag a tautology. The design is: usage-derived pieces are never rescaled, unattributed is the only reconciliation mechanism, and the discrepancy IS the signal - it's how the synthetic zero-usage entries (#23) were found.

  • Unified count derivation (fixes "tokens without calls" in the AI Spend breakdowns): item counts were computed by a parallel scanner (countNewEvents) that derived keys and sizing independently of billing. Prompts were bucketed by the chars-ratio estimate while billing bucketed by measured tokens, so a prompt could be counted in large but billed in xlarge (observed in production data as large: count=1, tokens=0 next to xlarge: count=0, tokens=4M). Output items were keyed {output, <lane>/<entity>} while counts used input-side keys, so output items could never carry a count. Skill loads used a different detector than billing's SHA matching. Counts are now derived from the same conversationPieces layout billing re-bills every turn (input side) and bumped by attributeOutput under the same keys it books tokens to (output side). The parallel scanner is deleted; a key or bucket drift between counts and tokens is now structurally impossible.

Testing

  • TestBillingReconciliationFlag: healthy turn reconciles (consistent: true, all deltas 0); a usage-derived overshoot flips consistent: false with a positive input_delta
  • Dry-run on the real compacted session: reports consistent: false, input_delta: 666408, output_delta: 1004 - exactly the synthetic-entry mass that fix: skip synthetic zero-usage assistant entries in billing #23 removes; once fix: skip synthetic zero-usage assistant entries in billing #23 lands the same session reconciles to zero
  • Count derivation: full suite covers the skill-load count (counted once on the skills usage item) and the skill-body vs user-prompt split; output items now carry true per-block event counts read by the FE breakdown as calls
  • Full suite, gofmt, go vet clean; binaries rebuilt

🤖 Generated with Claude Code

jverre and others added 2 commits June 12, 2026 10:28
…onsistent score

Adds the monitoring half of the no-clamp design: every trace now carries
cc.billing.reconciliation (Σ lanes minus API usage per tier column +
consistent flag) and a token_count_consistent feedback score (1/0, with
the per-column deltas in the reason when 0), so traces with token-count
inconsistencies are filterable in the UI instead of latent.

Also re-applies the exact-overshoot clamp removal: it was decided on
PR #22 (commit f41d31c) but pushed after the merge picked up the
branch head, so main still scaled usage-derived pieces to force
Σ lanes == usage — which would have made the consistent flag a
tautology. Usage-derived pieces are never rescaled; the discrepancy is
the signal (it found the synthetic zero-usage entries fixed in #23:
this flag reports input_delta=666408 on that session until #23 lands).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…onciliation-flag

# Conflicts:
#	bin/opik-logger-darwin-amd64
#	bin/opik-logger-darwin-arm64
#	bin/opik-logger-linux-amd64
#	bin/opik-logger-windows-amd64.exe
@jverre jverre merged commit 7436e92 into main Jun 12, 2026
1 check passed
@jverre jverre deleted the jacques/OPIK-6873-reconciliation-flag branch June 12, 2026 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant