feat: flag token-count inconsistencies — cc.billing.reconciliation + feedback score#24
Merged
Merged
Conversation
…onsistent score Adds the monitoring half of the no-clamp design: every trace now carries cc.billing.reconciliation (Σ lanes minus API usage per tier column + consistent flag) and a token_count_consistent feedback score (1/0, with the per-column deltas in the reason when 0), so traces with token-count inconsistencies are filterable in the UI instead of latent. Also re-applies the exact-overshoot clamp removal: it was decided on PR #22 (commit f41d31c) but pushed after the merge picked up the branch head, so main still scaled usage-derived pieces to force Σ lanes == usage — which would have made the consistent flag a tautology. Usage-derived pieces are never rescaled; the discrepancy is the signal (it found the synthetic zero-usage entries fixed in #23: this flag reports input_delta=666408 on that session until #23 lands). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…onciliation-flag # Conflicts: # bin/opik-logger-darwin-amd64 # bin/opik-logger-darwin-arm64 # bin/opik-logger-linux-amd64 # bin/opik-logger-windows-amd64.exe
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Details
Makes token-count inconsistencies monitorable per trace instead of latent:
cc.billing.reconciliation(metadata, every trace):{input_delta, cache_read_delta, cache_creation_delta, output_delta, consistent}where each delta is Σ lanes minus API usage for that tier column. Healthy attribution reconciles to zero (theunattributedlane absorbs undershoot); a non-zero delta means the replay disagrees with what the API billed - usually a truncation mechanism we don't detect yet.token_count_consistentfeedback score (every trace with billing): 1 when all deltas are zero, 0 otherwise, with the per-column deltas in the score reason. Filter traces by this score in the UI / alert on the average dropping below 1.This PR also re-applies the exact-overshoot clamp removal that was decided on #22 but missed its merge (commit f41d31c was pushed to the branch after the merge picked up the earlier head). With the clamp still in main, usage-derived pieces were silently rescaled to force Σ lanes == usage, which would make this flag a tautology. The design is: usage-derived pieces are never rescaled,
unattributedis the only reconciliation mechanism, and the discrepancy IS the signal - it's how the synthetic zero-usage entries (#23) were found.countNewEvents) that derived keys and sizing independently of billing. Prompts were bucketed by the chars-ratio estimate while billing bucketed by measured tokens, so a prompt could be counted inlargebut billed inxlarge(observed in production data aslarge: count=1, tokens=0next toxlarge: count=0, tokens=4M). Output items were keyed{output, <lane>/<entity>}while counts used input-side keys, so output items could never carry a count. Skill loads used a different detector than billing's SHA matching. Counts are now derived from the sameconversationPieceslayout billing re-bills every turn (input side) and bumped byattributeOutputunder the same keys it books tokens to (output side). The parallel scanner is deleted; a key or bucket drift between counts and tokens is now structurally impossible.Testing
TestBillingReconciliationFlag: healthy turn reconciles (consistent: true, all deltas 0); a usage-derived overshoot flipsconsistent: falsewith a positiveinput_deltaconsistent: false, input_delta: 666408, output_delta: 1004- exactly the synthetic-entry mass that fix: skip synthetic zero-usage assistant entries in billing #23 removes; once fix: skip synthetic zero-usage assistant entries in billing #23 lands the same session reconciles to zerogofmt,go vetclean; binaries rebuilt🤖 Generated with Claude Code