Skip to content

Spec 24: context-window replay — reconstruct what the model 'saw' at time T #96

@0bserver07

Description

@0bserver07

Goal

For any moment in a session, reconstruct what was actually in the model's context window: system prompt, tool definitions, prior messages, file contents the model was looking at. The "blackbox flight recorder" angle — see where the model lost the thread.

Why now

Playback v2 (v0.7.3) reconstructs the filesystem. The next layer up is the agent's mental state. Together they let you answer "the model lost the original requirement at turn 7 because it never saw it after the file got truncated."

Schema

None. Pure read-side over messages.raw_json + a context-window estimator.

User-visible surface

  • API: GET /api/playback/{session_id}/context?at=<iso>&model=<id> returns:
    {
      "session_id": "...", "snapshot_ts": "...",
      "model_id": "claude-opus-4-7",
      "system_prompt": "...", "system_prompt_tokens": 1234,
      "tools_definition_tokens": 567,
      "prior_messages": [
        {"role": "user", "ts": "...", "content_excerpt": "...", "tokens_estimate": 234}
      ],
      "context_used_tokens": 123456,
      "context_max_tokens": 200000,
      "context_used_pct": 61.7,
      "warnings": ["truncation_likely_at_msg_42"]
    }
  • Meta-agent tool: get_context_at(session_id, at) — feed the meta-agent so it can answer "did Claude see the original spec when it made the change at minute 30?".
  • UI: extend PlaybackTab with a "Context" panel (toggle alongside the FS panel) showing tokens-used / tokens-max bar + collapsible message list at that point.

Implementation plan

  1. New service stackunderflow/services/playback_context.py:
    • reconstruct_context_at(conn, session_id, *, at, model_id) -> dict.
    • Walks messages for the session in seq order up to at.
    • Estimates tokens per message (cl100k-base or Anthropic's tokenizer if available; fall back to chars/4).
    • Looks up the model's context-max from infra/providers/<provider>.py config.
    • Detects "truncation event" — when cumulative-tokens exceeded model context, the previous-history was almost certainly compacted. Mark the message before the next user turn after the breach.
  2. New route in routes/playback.py.
  3. Frontend Context panel.
  4. Meta-agent tool entry.

Tests

  • 5-message synthetic session → assert per-message token estimates are sane (not zero, not 100x off).
  • A session with a known compaction (look for "context-truncated" or model-side compaction signals in raw_json) → assert the warning fires.
  • 404 on unknown session.
  • at cutoff is honoured (messages after at not included).

Hard parts

  • Token-estimation is the load-bearing piece. Use tiktoken with cl100k_base for OpenAI-family, fall back to chars/4 for everyone else (acceptable v1; refine later). Add as an optional dep [tokenizer].
  • Model context limits vary (e.g., Opus 4.7 1M-context vs 200K). Look up via the existing infra/providers/<provider>.py:rates_for(model) family — extend to return context_max if not already there.
  • "What the model saw" includes tool definitions. Estimate ~500-1500 tokens per tool block (depends on schema verbosity). Pull from the meta-agent tool catalogue + the user's installed MCP tool list (see context_budget route — reuse).

Out of scope

  • Exact tokenizer parity with closed-source models (Anthropic's tokenizer isn't public).
  • Reconstructing the file contents the model saw inline (Playback v2's /fs already covers this; the Context panel just links).

Dependencies

  • Builds on Playback v2 (shipped).
  • Consumed by Spec 25 (fork mode — needs to recreate the context to re-prompt).

Estimated effort

Size L — single agent, ~2 hr.

Hard rules

  • DO NOT touch versions / CHANGELOG headings.
  • No schema migration.
  • New optional dep [tokenizer] (with tiktoken) is allowed.
  • Branch: feat/context-replay off main.

Metadata

Metadata

Assignees

No one assigned

    Labels

    size-l~2 hr agent runspecSpec/feature for an agent to implementwave-4Wave 4: replay + active surfacing

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions