Skip to content

Add provider-calibrated token accounting#1802

Open
Niko4417 wants to merge 2 commits into
devfrom
issue/1766-provider-calibrated-token-accounting
Open

Add provider-calibrated token accounting#1802
Niko4417 wants to merge 2 commits into
devfrom
issue/1766-provider-calibrated-token-accounting

Conversation

@Niko4417

@Niko4417 Niko4417 commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds governed per-model calibrated token accounting for context budgeting and prompt assembly, while preserving the existing conservative fallback estimator and reporting the active accounting source in diagnostics.

Resolves #1766

Scope

  • In scope: model capability metadata for calibrated token accounting, context profile derivation, allocator/prompt/compaction token counting, evidence redaction, and context-quality coverage.
  • Out of scope: exact tokenizer dependency integration, model-written structured summaries, retrieval/indexing changes, and conversation compaction semantics beyond replacing the token-count source.

Reuse And No-Duplication

  • Existing Keiko functionality was inspected before implementation.
  • This PR reuses, extends, generalizes, or consolidates existing functionality where practical.
  • Any new implementation is limited to a documented capability gap in the linked issue.
  • This PR does not introduce a parallel workspace, graph, relationship, policy, evidence, memory, connector, workflow, or UI subsystem where an existing subsystem can be extended.
  • Refactoring or consolidation was considered when existing functionality was close but not shaped for this change.

Delivery Board

  • Linked issue is in the public Keiko Product Delivery project.
  • Linked issue has a valid Parent Epic: #1720 unless this PR updates an epic container directly.
  • Linked issue is attached as a GitHub sub-issue of its parent epic so it appears in the correct board swimlane.
  • Parent epic is on the board with Classification: Epic, Status: Open Epics, and priority/order set for top-to-bottom implementation sequencing.
  • Linked task/card issue is on the board with Classification: Task, Status: In Progress while active, inherited or explicit Priority, and Human Review Required: Yes.
  • Project Status is In Progress while work is active, or Done only after merge and closure evidence.
  • Project Workflow State is PR Open or Ready for Human Review.
  • Owner / Agent, Branch, Pull Request, and Human Review Required are filled.
  • Issue label reflects the current state: status: in progress, status: ready for human review, or status: done after merge.
  • Autonomous agents did not merge into dev, enable auto-merge, close the issue, or bypass human review unless explicitly authorized by the human maintainer.

Product Impact

  • UI or user workflow
  • CLI or developer workflow
  • Core generation engine
  • Evidence, audit, or compliance artifact
  • Security or supply chain
  • Packaging, release, or npm publication
  • Documentation or repository hygiene
  • No user-facing behavior change

Update Impact

  • Not release-impacting.
  • Release-impact catalog or issue metadata updated.
  • Release-note category: improvements.
  • Priority: high.
  • User-visible change: Prompt-budget behavior becomes more accurate for configured models while retaining safe fallback accounting.
  • Release-note bullet: Improves model-context budgeting with provider-calibrated token accounting and explicit fallback diagnostics.
  • Supported-from versions: next release after merge to dev.
  • Affected state stores: none.
  • User action required and remediation: none.
  • Internal-only rationale, if applicable: not internal-only.
  • Release-owner review evidence: issue Repo Context Audit: add provider-calibrated token accounting #1766 update-impact metadata.

Verification

Required:

  • Required GitHub checks pass before merge.
  • Local verification commands or rationale are listed below.
  • Reuse/extension/generalization evidence or gap rationale is listed below.

Local verification:

PATH="/Users/nikolaos.vasilopoulos/.cache/codex-runtimes/codex-primary-runtime/dependencies/node/bin:$PATH" .keiko-scripts/verify-receipt.sh 1766
- typecheck: pass
- check:version-consistency: pass
- lint: pass
- arch:check: pass
- arch:check:negative: pass
- check:qi-supply-chain: pass
- test: pass (913 files, 15,867 passed, 1 skipped)
- verify receipt written for 507f31fb

.keiko-scripts/audit-receipt.sh 1766 --findings 0 --user-facing false
- audit receipt written for 507f31fb

npm run check:context-quality
- calibrated-token-accounting=true

Reuse / gap rationale:

The implementation extends the existing ModelCapability, ContextProfile, context-budget allocator, chat prompt budgeting, conversation compaction, diagnostics, and evidence-redaction paths. It does not add a parallel tokenizer subsystem. Exact tokenizers remain out of scope until governed dependencies are approved; unsupported exact accounting is rejected rather than surfaced as evidence.

Select only what applies:

  • UI behavior manually verified or covered by tests.
  • CLI behavior verified with command output or tests.
  • Core logic covered by unit, integration, property, or fixture tests.
  • Security-sensitive change reviewed for trust boundaries, secrets, external calls, and generated artifacts.
  • Supply-chain or package-surface change verified with package, license, lockfile, SBOM, or npm dry-run checks.
  • Documentation or Markdown change verified by the repository link check or a targeted local equivalent.
  • Release-impacting change verified with npm run check:release-impact or an explicit rationale.
  • Not applicable items are explained below.

Not applicable rationale:

Review And Closure

  • The PR implements only the linked issue scope.
  • Actionable review findings are fixed or explicitly dispositioned.
  • Unresolved review threads are resolved before merge.
  • Checks are repeated after the latest pushed fix.
  • Issue acceptance criteria and closure evidence are updated only where evidence exists.
  • The linked issue records reuse, extension, generalization, or new-gap rationale before closure.
  • Delivery board status is updated before requesting final maintainer review.
  • Use Resolves #1766 only when this PR should close the issue.

Risk Notes

  • Exact tokenizer support is intentionally not exposed because no governed tokenizer dependency was added; configured support is calibrated-only plus explicit fallback.
  • tokenAccounting.counterId is content-free metadata, is trimmed/validated at config boundaries, and is redacted from persisted evidence.
  • The local full receipt had to be run with bundled Node v24.14.0 because the machine default Node v25.9.0 is rejected by dependency-cruiser; no repo code was changed for this environment issue.

@Niko4417 Niko4417 requested a review from oscharko as a code owner July 2, 2026 14:49
@Niko4417 Niko4417 force-pushed the issue/1766-provider-calibrated-token-accounting branch from 276b6e8 to bb83465 Compare July 2, 2026 14:55
@Niko4417 Niko4417 force-pushed the issue/1766-provider-calibrated-token-accounting branch from bb83465 to 507f31f Compare July 2, 2026 15:01
@Niko4417

Niko4417 commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Issue-audit findings — PR #1802 (Resolves #1766) · HEAD 507f31fb

Read-first audit wave: pr-reviewer (correctness/regression) + security-auditor (config boundary, evidence redaction, supply chain). Findings verified against source before posting. CI is fully green at this commit (ci, ui, CodeQL, SBOM/smoke, cross-platform) and verify + audit receipts exist for 507f31fb.

Verdict: no merge blocker. One substantive diagnostics-attribution finding and a matching test-coverage gap worth addressing; the rest are minor. Feature is correct, deterministic, bounded, and well-tested against the acceptance criteria.

Acceptance-criteria check

  • ✅ Configured models can use calibrated token accounting instead of the bytes/3.5 fallback (governed ModelCapability.tokenAccounting, allow-listed strict parsing).
  • ✅ Conservative estimator remains and is explicitly reported (source: "fallback-estimated" default; diagnostics carry the source).
  • ✅ Prompt assembly stays deterministic and bounded across chat/document/memory/tool-observation inputs; per-segment ceil+offset calibration preserves the ADR-0052 gate-3 single-currency identity end-to-end (allocator + gateway paths verified).
  • ✅ No new dependency/tokenizer added — exact source is intentionally rejected until a governed tokenizer lands; supply-chain surface unchanged (no lockfile/manifest in the diff).

Note: the AC text mentions "exact or calibrated". Exact is explicitly deferred (documented in the PR + issue engineering notes). Calibrated + fallback satisfy the deliverable; flag it so the AC checkbox wording isn't read as "exact shipped".

Security — PASS (0 findings)

Every boundary this PR touches is closed: config parsing rejects unknown keys (own-enumerable Object.keys, __proto__ rejected), restricts source to calibrated, and enforces scaleMilli positive-int / offsetTokens non-negative-int / counterId non-empty-trimmed. counterId is content-free (traced to a label field only — never reaches readFile/import/fetch/new URL) and is redacted on both evidence paths (construction-time field redaction + whole-object deepRedactStrings). ConfigInvalidError scrubs echoed key names via the GatewayError redactor. No runtime dependency added.

Correctness / regression

Major — 1. Compacted-path history-lane token misattribution · packages/keiko-server/src/chat-prompt-budget-diagnostics.ts:114-127
estimateHistoryLaneTokens sums all history-message tokens and subtracts only the bare CONVERSATION_SYSTEM_PROMPT cost when messages[0].role === "system". On the compacted path messages[0].content is buildSystemScopedCompactionContent(systemMessage.content, summaryContent) (conversation-compaction.ts:141) — system prompt plus the compaction-summary block. So the summary block's tokens are attributed to the history-summary lane while system-contract stays pinned at the bare-prompt figure. The aggregate identity (Σ lanes === totalEstimatedTokens) is correct and budget enforcement is unaffected — this is a diagnostics-only per-lane attribution question. It is also a deliberate choice per the PR's own notes ("attribute system-scoped compaction summary tokens to the history-summary lane"), so it's a semantics debate rather than a defect: arguably the summary belongs to system-contract since it is physically embedded in the system message. Recommend deriving the system-lane cost from the actual embedded messages[0].content and history = total − thatActualSystemCost, which fixes attribution on both paths while keeping the aggregate identity. Not a blocker.

Major — 2. Test does not pin the compacted-path lane split · packages/keiko-server/src/chat-gateway-assembly.test.ts:263-293
The one compacted-path lane-diagnostics test asserts history > 0 and Σ lanes === total but never asserts the history-summary vs system-contract split — a single-line mutation swapping the attribution would pass. Add an assertion tying system-contract tokens to the real embedded messages[0].content. (Pairs with #1.)

Minor

  • chat-prompt-budget-diagnostics.ts is now 434 LOC, crossing the 400-LOC cap (was 387; this PR tips it over). Suggest extracting the token-summary helpers into a sibling file (mirrors the memory-validation.ts split precedent). context-engineering.ts (583) was already over-cap pre-PR and grew further.
  • context-engineering-validation.ts collectTokenAccountingNumbers accepts a source: "fallback-estimated" object that also carries scaleMilli/offsetTokens; countContextTokens silently ignores them. No current call site produces this, but a self-contradictory-yet-"valid" object is constructible — reject or document.
  • calibratedTokenCount (context-engineering.ts:487-491) Math.max(0, …) clamp is unreachable from validated input; add a one-line WHY (defends direct object-literal callers that bypass the validator).
  • redactTokenAccounting (new surface in context-assembly-redaction.ts) has no direct unit test (pre-existing zero-coverage file). A small test asserting counterId redacted / numeric fields pass through / undefined round-trips would close it.

Positives

Per-segment-then-sum calibration is the correct gate-3 design (avoids sum-then-scale divergence, verified end-to-end); the new evaluateCalibratedTokenAccounting context-quality gate independently recomputes per-lane totals and asserts calibrated ≠ fallback (mutation-robust); config helpers reuse existing primitives and the exactOptionalPropertyTypes conditional-spread pattern; assertWorkflowEligibleForKind de-dups two identical inline guards in scope.

Audit roles: pr-reviewer, security-auditor (a11y/design-system axis N/A — non-user-facing core-engine change). No confirmed merge blockers; findings above are follow-ups.

…counting (#1766)

- Split chat-prompt-budget-diagnostics.ts under the 400-LOC cap by extracting
  the token-summary helpers into chat-prompt-budget-token-summary.ts.
- Document the deliberate lane attribution: the compaction summary's tokens are
  charged to the history-summary lane (which owns the +1 summary item), while
  system-contract reports only the bare canonical system prompt.
- Pin the compacted-path lane split with a mutation-robust test.
- Reject stray scaleMilli/offsetTokens under source "fallback-estimated".
- Document the defensive clamp in calibratedTokenCount; add redactTokenAccounting
  and validator-rejection test coverage.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@Niko4417

Niko4417 commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Audit findings fixed — pushed 507f31fb..67510043

Fix wave from the audit above. Two disjoint implementor slices, then a clean full-verify at the new HEAD.

Correctness / #1 + #2 (history-lane attribution + test) — On tracing the full lane-construction path, the original attribution is correct by design, not a defect: the history-summary lane deliberately owns the compaction summary — it counts the summary as its +1 included item (buildHistorySummaryLane) and charges its tokens there, so each lane's item-count stays consistent with its token-count and Σ lanes == totalEstimatedTokens. Flipping the tokens to system-contract (my initial suggestion) would have broken that item/token consistency. So the fix is to lock the design in, not change it:

  • Added a WHY comment documenting the deliberate attribution (summary → history-summary; system-contract reports only the bare canonical system prompt).
  • Added a mutation-robust compacted-path test pinning the split: system-contract == countContextTokens(CONVERSATION_SYSTEM_PROMPT, accounting) (bare prompt only), history-summary tokens include the summary block with includedItems == retained + 1, and the lane sum equals the total.

Minor #3 (400-LOC cap) — Extracted the token-summary helpers into a new sibling chat-prompt-budget-token-summary.ts. chat-prompt-budget-diagnostics.ts is now 302 LOC (was 434), new module 151 LOC — both under cap. Mechanical, behavior-identical extraction.

Minor #4 (validator)context-engineering-validation.ts now rejects a source: "fallback-estimated" object carrying scaleMilli/offsetTokens (only meaningful for calibrated), reasons tokenAccounting.scaleMilli unexpected / tokenAccounting.offsetTokens unexpected. Test coverage added.

Minor #5 (WHY comment) — Documented the defensive Math.max(0, …) clamp in calibratedTokenCount (guards callers that build ContextTokenAccounting literals without the validator).

Minor #6 (redaction test) — Added a redactTokenAccounting test (via the persistConnectedContextEvidence entry point): counterId redacted, source/scaleMilli/offsetTokens pass through, tokenAccounting: undefined round-trips.

Verification at HEAD 67510043

  • .keiko-scripts/verify-receipt.sh 1766GREEN (913 files, 15,871 passed, 1 skipped) — verify receipt written.
  • .keiko-scripts/audit-receipt.sh 1766 --findings 0 --user-facing false → written (all confirmed findings resolved).
  • Security posture unchanged (validator only tightened; no new dependency). No behavior change to prompt assembly, budgeting, or determinism.

Files: chat-prompt-budget-diagnostics.ts, chat-prompt-budget-token-summary.ts (new), chat-gateway-assembly.test.ts, context-engineering-validation.ts, context-engineering.ts, context-engineering.test.ts, connected-context-evidence.test.ts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Repo Context Audit: add provider-calibrated token accounting

1 participant