Skip to content

Add proactive context window / token budget management #164

@rockfordlhotka

Description

@rockfordlhotka

Problem

RockBot currently tracks token usage after LLM requests complete (via LlmUsage in LlmResponse), but has no way to estimate token consumption before sending a request. This means the system can't proactively manage the context window — it only discovers it exceeded the budget when the LLM returns a "length" finish reason, at which point information has already been lost.

Tools like Claude Code and GitHub Copilot solve this by using client-side tokenizers to estimate consumption before each request, enabling smart context management strategies.

Proposed Solution

1. ITokenEstimator Abstraction

A provider-agnostic interface for estimating token counts before sending requests:

public interface ITokenEstimator
{
    /// <summary>Estimate token count for a single message.</summary>
    int EstimateTokens(LlmChatMessage message);

    /// <summary>Estimate token count for a full conversation + tools.</summary>
    TokenEstimate EstimateRequest(IReadOnlyList<LlmChatMessage> messages, IReadOnlyList<LlmToolDefinition>? tools = null);
}

public sealed record TokenEstimate(
    int SystemPromptTokens,
    int ConversationTokens,
    int ToolDefinitionTokens,
    int TotalTokens);

Implementations:

  • TiktokenEstimator — Uses Microsoft.ML.Tokenizers (covers OpenAI models, good general-purpose BPE approximation)
  • AnthropicCountTokensEstimator — Calls Anthropic's /v1/messages/count_tokens API for exact counts
  • CharacterHeuristicEstimator — Simple chars÷4 fallback for unknown models

2. Model Capability Registry

A registry of context window sizes and model metadata so the orchestrator knows its budget:

public interface IModelCapabilityRegistry
{
    ModelCapabilities GetCapabilities(string modelId);
}

public sealed record ModelCapabilities(
    string ModelId,
    int ContextWindowTokens,      // e.g., 200_000 for Claude Sonnet
    int MaxOutputTokens,          // e.g., 8_192
    int EffectiveInputBudget);    // ContextWindow - MaxOutput - safety margin

Could be populated from configuration, or from provider APIs that expose model metadata.

3. Context Budget Tracker

Middleware or service that tracks running token estimates through the conversation lifecycle:

public interface IContextBudgetTracker
{
    /// <summary>Current estimated usage vs. budget.</summary>
    ContextBudgetStatus GetStatus(string sessionId);
    
    /// <summary>Event raised when usage crosses a threshold.</summary>
    event Action<ContextBudgetAlert> OnBudgetAlert;
}

public sealed record ContextBudgetStatus(
    int EstimatedTokensUsed,
    int BudgetTokens,
    double UtilizationPercent);

public sealed record ContextBudgetAlert(
    string SessionId,
    double UtilizationPercent,    // e.g., 0.70, 0.90
    AlertLevel Level);            // Warning, Critical

4. Proactive Summarization Trigger

When the budget tracker hits a configurable threshold (e.g., 70%), automatically summarize older conversation turns before the next LLM call — rather than waiting for a "length" finish reason which means context was already truncated.

This builds on the existing "sliding window with summarization" design decision from open-questions.md, making the summarization trigger proactive rather than reactive.

How Token Counting Works (Background)

Three mechanisms, typically used in combination:

Mechanism Timing Accuracy Cost
API response usage After request Exact Free (included in response)
Client-side tokenizer (tiktoken, Microsoft.ML.Tokenizers) Before request ~95-99% accurate Free (local computation)
Count-tokens API (Anthropic, Google) Before request Exact Small API cost, no inference

Integration Points

  • IConversationMemory — Budget tracker reads conversation history to estimate current usage
  • UserMessageHandler — Checks budget before building LLM request; triggers summarization if needed
  • ChunkingAIFunction — Already manages large tool results; token estimator could improve the 16K character threshold to be token-aware
  • ILlmClient — Could expose model capabilities alongside chat completions

Design Considerations

  • Token estimation should be optional — the system should degrade gracefully if no estimator is configured (fall back to character heuristics or skip proactive management)
  • Different providers have different tokenizers — the estimator should be per-model or per-provider, selected based on the model being used
  • Estimation is inherently approximate for most approaches — design for budgets with safety margins, not exact counts
  • Microsoft.ML.Tokenizers is the canonical .NET tokenizer library and supports OpenAI models; evaluate whether it covers enough models or if provider-specific APIs are needed

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestwontfixThis will not be worked on

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions