Problem
RockBot currently tracks token usage after LLM requests complete (via LlmUsage in LlmResponse), but has no way to estimate token consumption before sending a request. This means the system can't proactively manage the context window — it only discovers it exceeded the budget when the LLM returns a "length" finish reason, at which point information has already been lost.
Tools like Claude Code and GitHub Copilot solve this by using client-side tokenizers to estimate consumption before each request, enabling smart context management strategies.
Proposed Solution
1. ITokenEstimator Abstraction
A provider-agnostic interface for estimating token counts before sending requests:
public interface ITokenEstimator
{
/// <summary>Estimate token count for a single message.</summary>
int EstimateTokens(LlmChatMessage message);
/// <summary>Estimate token count for a full conversation + tools.</summary>
TokenEstimate EstimateRequest(IReadOnlyList<LlmChatMessage> messages, IReadOnlyList<LlmToolDefinition>? tools = null);
}
public sealed record TokenEstimate(
int SystemPromptTokens,
int ConversationTokens,
int ToolDefinitionTokens,
int TotalTokens);
Implementations:
TiktokenEstimator — Uses Microsoft.ML.Tokenizers (covers OpenAI models, good general-purpose BPE approximation)
AnthropicCountTokensEstimator — Calls Anthropic's /v1/messages/count_tokens API for exact counts
CharacterHeuristicEstimator — Simple chars÷4 fallback for unknown models
2. Model Capability Registry
A registry of context window sizes and model metadata so the orchestrator knows its budget:
public interface IModelCapabilityRegistry
{
ModelCapabilities GetCapabilities(string modelId);
}
public sealed record ModelCapabilities(
string ModelId,
int ContextWindowTokens, // e.g., 200_000 for Claude Sonnet
int MaxOutputTokens, // e.g., 8_192
int EffectiveInputBudget); // ContextWindow - MaxOutput - safety margin
Could be populated from configuration, or from provider APIs that expose model metadata.
3. Context Budget Tracker
Middleware or service that tracks running token estimates through the conversation lifecycle:
public interface IContextBudgetTracker
{
/// <summary>Current estimated usage vs. budget.</summary>
ContextBudgetStatus GetStatus(string sessionId);
/// <summary>Event raised when usage crosses a threshold.</summary>
event Action<ContextBudgetAlert> OnBudgetAlert;
}
public sealed record ContextBudgetStatus(
int EstimatedTokensUsed,
int BudgetTokens,
double UtilizationPercent);
public sealed record ContextBudgetAlert(
string SessionId,
double UtilizationPercent, // e.g., 0.70, 0.90
AlertLevel Level); // Warning, Critical
4. Proactive Summarization Trigger
When the budget tracker hits a configurable threshold (e.g., 70%), automatically summarize older conversation turns before the next LLM call — rather than waiting for a "length" finish reason which means context was already truncated.
This builds on the existing "sliding window with summarization" design decision from open-questions.md, making the summarization trigger proactive rather than reactive.
How Token Counting Works (Background)
Three mechanisms, typically used in combination:
| Mechanism |
Timing |
Accuracy |
Cost |
API response usage |
After request |
Exact |
Free (included in response) |
Client-side tokenizer (tiktoken, Microsoft.ML.Tokenizers) |
Before request |
~95-99% accurate |
Free (local computation) |
| Count-tokens API (Anthropic, Google) |
Before request |
Exact |
Small API cost, no inference |
Integration Points
IConversationMemory — Budget tracker reads conversation history to estimate current usage
UserMessageHandler — Checks budget before building LLM request; triggers summarization if needed
ChunkingAIFunction — Already manages large tool results; token estimator could improve the 16K character threshold to be token-aware
ILlmClient — Could expose model capabilities alongside chat completions
Design Considerations
- Token estimation should be optional — the system should degrade gracefully if no estimator is configured (fall back to character heuristics or skip proactive management)
- Different providers have different tokenizers — the estimator should be per-model or per-provider, selected based on the model being used
- Estimation is inherently approximate for most approaches — design for budgets with safety margins, not exact counts
Microsoft.ML.Tokenizers is the canonical .NET tokenizer library and supports OpenAI models; evaluate whether it covers enough models or if provider-specific APIs are needed
Problem
RockBot currently tracks token usage after LLM requests complete (via
LlmUsageinLlmResponse), but has no way to estimate token consumption before sending a request. This means the system can't proactively manage the context window — it only discovers it exceeded the budget when the LLM returns a"length"finish reason, at which point information has already been lost.Tools like Claude Code and GitHub Copilot solve this by using client-side tokenizers to estimate consumption before each request, enabling smart context management strategies.
Proposed Solution
1.
ITokenEstimatorAbstractionA provider-agnostic interface for estimating token counts before sending requests:
Implementations:
TiktokenEstimator— UsesMicrosoft.ML.Tokenizers(covers OpenAI models, good general-purpose BPE approximation)AnthropicCountTokensEstimator— Calls Anthropic's/v1/messages/count_tokensAPI for exact countsCharacterHeuristicEstimator— Simple chars÷4 fallback for unknown models2. Model Capability Registry
A registry of context window sizes and model metadata so the orchestrator knows its budget:
Could be populated from configuration, or from provider APIs that expose model metadata.
3. Context Budget Tracker
Middleware or service that tracks running token estimates through the conversation lifecycle:
4. Proactive Summarization Trigger
When the budget tracker hits a configurable threshold (e.g., 70%), automatically summarize older conversation turns before the next LLM call — rather than waiting for a
"length"finish reason which means context was already truncated.This builds on the existing "sliding window with summarization" design decision from
open-questions.md, making the summarization trigger proactive rather than reactive.How Token Counting Works (Background)
Three mechanisms, typically used in combination:
usageMicrosoft.ML.Tokenizers)Integration Points
IConversationMemory— Budget tracker reads conversation history to estimate current usageUserMessageHandler— Checks budget before building LLM request; triggers summarization if neededChunkingAIFunction— Already manages large tool results; token estimator could improve the 16K character threshold to be token-awareILlmClient— Could expose model capabilities alongside chat completionsDesign Considerations
Microsoft.ML.Tokenizersis the canonical .NET tokenizer library and supports OpenAI models; evaluate whether it covers enough models or if provider-specific APIs are needed