Skip to content

feat(agents): add prompt-compaction middleware for McpClient#2055

Open
Mgczacki wants to merge 7 commits into
dimensionalOS:mainfrom
Mgczacki:feat/agent-prompt-compaction
Open

feat(agents): add prompt-compaction middleware for McpClient#2055
Mgczacki wants to merge 7 commits into
dimensionalOS:mainfrom
Mgczacki:feat/agent-prompt-compaction

Conversation

@Mgczacki
Copy link
Copy Markdown

@Mgczacki Mgczacki commented May 12, 2026

Summary

Closes #1899

Caps the prompt the dimos agent sends to its LLM so the conversation history
never grows unbounded. Implemented as a langchain AgentMiddleware plugged into
create_agent(middleware=...). Because the hook (before_model) fires before
every model invocation, the input-size bound becomes an invariant of the agent
loop — including intra-turn re-invocations (model → tool → tool result → model).

On long sessions the middleware quietly summarizes older turns once it detects
an oversized prompt. Behavior is unchanged for short sessions.

Concepts

dimos_turn

A new integer tag attached to each message's additional_kwargs dict.
Incremented once per McpClient._process_message call — that is, once per
user-facing turn (a human input from agent-send, or a tool-stream
notification that wakes the agent). Every message that flows through during
that turn — the input HumanMessage, intermediate AIMessages with
tool_calls, the resulting ToolMessages, the final AIMessage — all get
stamped with the same turn number.

This is what lets compaction:

  1. Group messages by turn so tool_call/tool_response pairs always travel
    together (compaction selects entire turns, never partial ones — no orphan
    tool_call_id references).
  2. Identify the current turn (the latest tag value plus any trailing
    untagged in-flight messages from the agent loop) and preserve it untouched
    regardless of threshold.
  3. Score / inspect the history per-turn for future heuristics (e.g.,
    keep-N-most-recent strategies).

dimos_turn is metadata only — it lives in additional_kwargs, which
providers ignore but langchain serialization preserves. The compaction
summary itself is tagged with the max turn it covers (plus
dimos_compacted: True), so re-compaction folds the prior summary into the
next one cleanly.

Current turn is sacred

_current_turn_start walks from the end of the message list to find the
boundary of the latest turn. Everything from that boundary forward is never
compacted — no image strip, no summary touch. This protects:

  • The user's current query
  • In-progress tool calls and their pending ToolMessage responses
  • Fresh images from perception that the user might be asking about right now

How it works

Two-stage compaction inside before_model:

  1. Strip images in messages older than the current turn. Image content
    blocks are replaced with a small text placeholder. If this alone gets us
    below target_tokens, we stop here.

    Caveat: this is an incomplete solution. Dropping the image with only
    a "[image removed]" placeholder is destructive as the model can no
    longer refer back to that perception. A more principled design would
    follow progressive disclosure: keep the image addressable in a content
    store and replace the inline block with a reference (e.g.,
    [image: ref://…]) plus a tool the agent can call to re-fetch it on
    demand. I am deferring this decision as it needs a broader agent-harness
    conversation about content addressability.

As to why I decided to strip images: LLM's visual reasoning capabilities are
currently noticeably worse than with text. Additionally, the way in which the
agent loop is set up right now makes it so the model gets to see the image at the
beginning of a new turn, and it tends to give a description of what's in the image.
This description is detailed enough for reasoning about the content of the image,
but it also causes a secondary effect: The model, when considering the image, will
default to anchor its perception (even if the image is available in chat history) to the
comment it gave at the moment. Keeping images that were already observed therefore
seems like a waste of tokens that we can save since we are already going to cause a
cache burst with our compaction process.

  1. Summarize older messages into a single SystemMessage while keeping
    the most recent turns verbatim. The summarizer LLM is configurable;
    defaults to reusing the agent's own model. Output is hard-capped via
    summarizer.bind(max_tokens=summary_size_tokens).

See it in action

A public Langfuse trace captured with deliberately small defaults so
compaction fires after a handful of turns:

https://us.cloud.langfuse.com/project/cmp23t80n09ooad08jnw1lksy/traces/887630cfbf49bb97f1c5b4d2cc980ad1?observation=b73fcf77cb4f2dc5&timestamp=2026-05-12T07:54:34.311Z

Use the trace timeline to see the prompt that hits the LLM at each
agent-turn-N span — older turns get folded into a single summary
SystemMessage and the agent continues with a shrunk prompt.

Configuration

All on by default via McpClientConfig, env-driven:

Env var Field Default
AGENT_COMPACTION_THRESHOLD agent_compaction_threshold 40000
AGENT_COMPACTION_TARGET agent_compaction_target 3000
AGENT_COMPACTION_SUMMARY_SIZE agent_compaction_summary_size 1000
AGENT_COMPACTION_MODEL agent_compaction_model None (reuses agent's model)

Why a middleware

Two reasons, both documented in compaction_middleware.py's module docstring:

  1. Middleware vs preprocessing. External preprocessing on _history
    would only fire once per user turn, leaving every intra-turn re-invocation
    unprotected. Middleware fires before each model call.
  2. before_model vs after_model / wrap_model_call. before_model is
    the minimal-intervention hook. after_model is too late (the model
    already errored on overflow); wrap_model_call conflates compaction with
    the model-call concerns (retries, error shaping, tool dispatch).

Changes

New files

  • dimos/agents/compaction_middleware.pyDimosCompactionMiddleware
    class (subclass of langchain.agents.middleware.AgentMiddleware),
    placeholder token counter (3 chars/token, 1000 tokens/image; memoized in
    additional_kwargs[\"dimos_tokens\"] for O(new-only) recompute), static
    token cache for system_prompt + tool schemas, and the algorithm helpers
    (_strip_images, _split, _current_turn_start, _summarize).
  • dimos/agents/test_compaction_middleware.py — 15 pytest cases,
    hermetic (no API key needed). Coverage includes:
    • Token counter unit tests (text, image, memoization, static cache)
    • before_model no-op below threshold
    • Stage 1 alone suffices (image strip only)
    • Stage 2 summarization with FakeListChatModel summarizer
    • Protected SystemMessage prefix preserved
    • Mid-list untagged messages get summarized (not protected)
    • Prior summary re-folded into the next summary (no stacking)
    • Most-recent turns kept verbatim
    • Tool-call/tool-response pairs never split across summarize/keep boundary
    • Summarizer failure propagates after retries
    • Two integration tests that drive a real create_agent loop with a
      RecordingFakeAgent and assert: (a) the agent node receives a compacted
      prompt (proves langgraph's add_messages reducer interprets the
      RemoveMessage(REMOVE_ALL_MESSAGES) sentinel correctly), and
      (b) compaction can fire mid-turn between a tool result and the next
      model call.

Modified: dimos/agents/mcp/mcp_client.py

  • Config: four new fields on McpClientConfig reading the env vars in
    the table above. _env_int / _env_str helpers loaded via pydantic
    Field(default_factory=...).
  • Turn tagging: new _turn: int counter on McpClient (incremented at
    the top of _process_message), and a new module-level _tag_turn(message, turn) helper that stamps additional_kwargs[\"dimos_turn\"]. Every
    message flowing through a turn gets stamped — the incoming HumanMessage
    first, then every message emitted by the state graph.
  • History sync: new _apply_messages_update method that mirrors
    langgraph's add_messages reducer semantics locally — honors
    RemoveMessage(id=REMOVE_ALL_MESSAGES) as "wipe history, use what
    follows" and specific-id RemoveMessage as targeted removal. This keeps
    McpClient._history in sync with the graph's internal state even when the
    middleware replaces the entire message list.
  • Middleware wiring: in on_system_modules, construct the summarizer
    (either via init_chat_model(agent_compaction_model), or
    init_chat_model(model) if the agent's model is a string, or reuse the
    agent's model object), build the middleware with the system prompt and
    tool JSON schemas (t.args_schema.model_json_schema()), and pass it as
    create_agent(..., middleware=middleware).
  • Robustness in the stream loop: the worker thread now guards against
    middleware no-op updates that yield {node: None} instead of {node: {\"messages\": [...]}}, which would previously crash with 'NoneType' object has no attribute 'get'.

Modified: .gitignore

Adds MUJOCO_LOG.TXT (MuJoCo runtime artifact written to the repo root on
every sim run; should never be committed).

Test plan

  • uv run pytest dimos/agents/test_compaction_middleware.py -v — 15/15
    pass.
  • uv run mypy dimos/agents/compaction_middleware.py dimos/agents/test_compaction_middleware.py dimos/agents/mcp/mcp_client.py — clean.
  • Live verification: dimos --simulation run unitree-go2-agentic with
    AGENT_COMPACTION_THRESHOLD=2000, drive the agent until the threshold
    is crossed, confirm a Compaction fired (summarize) log line appears
    and the next prompt sent to the LLM contains the summary
    SystemMessage instead of the older turns.

Known limitations

Documented in the module docstring as "Known limitations":

  1. Image stripping is destructive — see caveat under stage 1 above.
    Progressive disclosure with a content store is the right long-term answer.
  2. Summarizer transcript size is unbounded — a first-ever compaction on a
    very long session could exceed the summarizer model's own context window.
    Mitigation deferred to a follow-up (chunked summarization).
  3. @retry(on_exception=Exception) is intentionally broad because the
    summarizer is duck-typed; permanent errors cost up to 3 attempts + 1s of
    sleeps before propagating.

Caps the prompt the agent sends to its LLM so the conversation history
never grows unbounded. Runs as a langchain AgentMiddleware via
create_agent(middleware=...), so the size bound becomes an invariant of
the agent loop — `before_model` fires before every model call, including
intra-turn re-invocations (model -> tool -> tool result -> model).

Two-stage compaction:
  1. Strip image content blocks from older messages (replace with a small
     text placeholder).
  2. If still over target, summarize older messages into a single
     SystemMessage and keep the most recent turns verbatim.

The current turn (latest dimos_turn group + any trailing untagged
messages, i.e. in-flight tool calls) is preserved untouched — never
compacted, never image-stripped.

Configuration via McpClientConfig fields, env-driven by default:
  AGENT_COMPACTION_THRESHOLD     trigger size           (default 40000)
  AGENT_COMPACTION_TARGET        size after compaction  (default 3000)
  AGENT_COMPACTION_SUMMARY_SIZE  generated summary size (default 1000)
  AGENT_COMPACTION_MODEL         optional separate summarizer model

Also includes:

- Per-message turn tagging via additional_kwargs["dimos_turn"], stamped
  in McpClient._process_message so compaction can group/score by turn.
- McpClient._history mirror updated to honor langgraph's add_messages
  reducer semantics (RemoveMessage(id=REMOVE_ALL_MESSAGES) sentinel) so
  the local history doesn't accrete pre-compaction state.
- Token counter is a pessimistic placeholder (3 chars/token,
  1000/image), memoized on each message for O(new-only) recompute cost.
  Designed to be swapped for a real tokenizer later without touching
  callers.
- 15 pytest cases (hermetic, no API key needed), including two
  integration tests that drive a real create_agent loop and prove
  compaction can fire mid-turn between a tool result and the next
  model call.

Defaults are intentionally conservative so the feature is on by default
without changing behavior for short sessions.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 12, 2026

Greptile Summary

This PR adds DimosCompactionMiddleware, a before_model hook that keeps the agent's prompt within a configurable token budget via two-stage compaction: image-block stripping followed by LLM-based summarisation of older turns. Turn tagging (dimos_turn), a local history-sync method (_apply_messages_update), and env-driven config fields wire the middleware into McpClient.

  • Compaction algorithm_current_turn_start correctly protects the in-flight turn and its tool-call/response pairs from being split; _split aligns the summarise/keep boundary to dimos_turn groups and guarantees at least one message survives in the kept tail.
  • Local history mirroring_apply_messages_update mirrors langgraph's add_messages reducer, handles RemoveMessage(REMOVE_ALL_MESSAGES) for full-wipe replays, and suppresses duplicate publishes to downstream subscribers using Python-identity checks.
  • Test coverage — 15 hermetic pytest cases (unit + two full create_agent integration tests) confirm compaction fires at the right moments, including mid-turn between a tool result and the next model call.

Confidence Score: 5/5

Safe to merge. The compaction algorithm, turn-boundary protection, tool-call coherence, and duplicate-publish suppression are all well-handled, and the two full-loop integration tests confirm the middleware integrates correctly with langgraph's add_messages reducer.

The core logic is sound: the backward-scan current-turn detection, the _split boundary alignment, and the _apply_messages_update replay logic all hold up under close inspection. The previous review concerns have been addressed. Remaining notes are minor style/edge-case issues that do not affect correctness.

No files require special attention. compaction_middleware.py contains one dead method worth removing; mcp_client.py has a minor zero-coercion edge case in the env-var defaults.

Important Files Changed

Filename Overview
dimos/agents/compaction_middleware.py New middleware implementing two-stage compaction (image-strip then summarise); algorithm is sound with correct turn-boundary alignment and memoised token counting. Contains one dead method _total_tokens that is never called.
dimos/agents/mcp/mcp_client.py Adds turn-tagging, _apply_messages_update (mirrors langgraph's add_messages reducer), env-driven config fields, and wires the middleware into create_agent. Previous review concerns about duplicate publishes and env-var error handling have been addressed. The or zero-coercion for integer env vars is a minor residual issue.
dimos/agents/test_compaction_middleware.py 15 hermetic pytest cases covering token counting, all compaction paths, tool-call coherence, re-compaction folding, failure propagation, and two full-loop integration tests with FakeMessagesListChatModel. Coverage is thorough.
.gitignore Adds MUJOCO_LOG.TXT to prevent MuJoCo runtime artefacts from being committed.

Sequence Diagram

sequenceDiagram
    participant U as "User / tool-stream"
    participant MP as "McpClient._process_message"
    participant SG as "LangGraph state_graph"
    participant BM as "before_model hook"
    participant LLM as "Agent LLM"
    participant AU as "_apply_messages_update"

    U->>MP: HumanMessage
    MP->>MP: "increment turn, tag message"
    MP->>MP: "append to history, publish"
    MP->>SG: "stream(messages: history)"

    loop "agent node execution"
        SG->>BM: "before_model(state)"
        alt "total <= threshold"
            BM-->>SG: "None (no-op)"
        else "Stage 1: image strip"
            BM-->>SG: "RemoveMessage(ALL) + stripped + current_turn"
        else "Stage 2: summarise"
            BM->>LLM: "summarise older turns"
            LLM-->>BM: "summary_text"
            BM-->>SG: "RemoveMessage(ALL) + protected + summary + keep + current_turn"
        end
        SG->>LLM: "invoke(compacted messages)"
        LLM-->>SG: "AIMessage"
        SG-->>MP: "stream update {node: messages}"
        MP->>AU: "_apply_messages_update(node_messages, turn)"
        AU->>AU: "wipe history on RemoveMessage(ALL)"
        AU->>AU: "skip-publish replayed objects"
        AU->>AU: "append and publish new objects"
    end
Loading

Reviews (6): Last reviewed commit: "fix: small efficiency rewrite" | Re-trigger Greptile

Comment thread dimos/agents/mcp/mcp_client.py
Comment thread dimos/agents/mcp/mcp_client.py Outdated
Comment on lines +49 to +51
def _env_int(name: str) -> int | None:
v = os.environ.get(name)
return int(v) if v else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 _env_int calls int(v) without a try/except, so a non-numeric value like AGENT_COMPACTION_THRESHOLD=abc raises a bare ValueError deep inside pydantic's default_factory during config construction, producing an unhelpful traceback with no mention of which env var is at fault.

Suggested change
def _env_int(name: str) -> int | None:
v = os.environ.get(name)
return int(v) if v else None
def _env_int(name: str) -> int | None:
v = os.environ.get(name)
if not v:
return None
try:
return int(v)
except ValueError:
raise ValueError(f"Environment variable {name!r} must be an integer, got {v!r}") from None

Comment thread dimos/agents/compaction_middleware.py
Mario Garrido and others added 3 commits May 12, 2026 05:15
- McpClient._apply_messages_update: dedupe publish on compaction replay.
  When the middleware emits [RemoveMessage, protected..., summary,
  keep..., current_turn...], the protected/keep/current messages are the
  same Python objects that were already published when they first arrived.
  Skip publish+print for any iter_msg whose id() was in the pre-wipe
  history; only the genuinely-new summary (and later AIMessages from the
  agent node in subsequent stream updates) get republished. Identified by
  Greptile P1.

- McpClient._env_int: re-raise a labeled ValueError when the env var
  value isn't a valid integer, so misconfiguration surfaces with the
  offending name instead of a bare pydantic traceback. Identified by
  Greptile P2.

- DimosCompactionMiddleware._static_tokens: drop the per-call hash
  computation. Inputs (system_prompt, tool_schemas) are bound at
  __init__ and never mutate, so a simple None-check on the cache is
  sufficient. Identified by Greptile P2.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
1774 1 1773 29
View the top 1 failed test(s) by shortest run time
dimos.project.test_no_sections::test_no_section_markers
Stack Traces | 0.815s run time
def test_no_section_markers():
        """
        Fail if any file contains section-style comment markers.
    
        If a file is too complicated to be understood without sections, then the
        sections should be files. We don't need "subfiles".
        """
        violations = find_section_markers()
        if violations:
            report_lines = [
                f"Found {len(violations)} section marker(s). "
                "If a file is too complicated to be understood without sections, "
                'then the sections should be files. We don\'t need "subfiles".',
                "",
            ]
            for path, lineno, text in violations:
                report_lines.append(f"  {path}:{lineno}: {text.strip()}")
>           raise AssertionError("\n".join(report_lines))
E           AssertionError: Found 14 section marker(s). If a file is too complicated to be understood without sections, then the sections should be files. We don't need "subfiles".
E           
E             dimos/agents/test_compaction_middleware.py:47: # ---------------------------------------------------------------------------
E             dimos/agents/test_compaction_middleware.py:49: # ---------------------------------------------------------------------------
E             dimos/agents/test_compaction_middleware.py:118: # ---------------------------------------------------------------------------
E             dimos/agents/test_compaction_middleware.py:120: # ---------------------------------------------------------------------------
E             dimos/agents/test_compaction_middleware.py:157: # ---------------------------------------------------------------------------
E             dimos/agents/test_compaction_middleware.py:159: # ---------------------------------------------------------------------------
E             dimos/agents/test_compaction_middleware.py:447: # ---------------------------------------------------------------------------
E             dimos/agents/test_compaction_middleware.py:454: # ---------------------------------------------------------------------------
E             dimos/agents/compaction_middleware.py:123: # ---------------------------------------------------------------------------
E             dimos/agents/compaction_middleware.py:125: # ---------------------------------------------------------------------------
E             dimos/agents/compaction_middleware.py:186: # ---------------------------------------------------------------------------
E             dimos/agents/compaction_middleware.py:188: # ---------------------------------------------------------------------------
E             dimos/agents/compaction_middleware.py:514: # ---------------------------------------------------------------------------
E             dimos/agents/compaction_middleware.py:516: # ---------------------------------------------------------------------------

lineno     = 516
path       = 'dimos/agents/compaction_middleware.py'
report_lines = ['Found 14 section marker(s). If a file is too complicated to be understood without sections, then the sections should...test_compaction_middleware.py:120: # ---------------------------------------------------------------------------', ...]
text       = '# ---------------------------------------------------------------------------'
violations = [('dimos/agents/test_compaction_middleware.py', 47, '# ---------------------------------------------------------------..._compaction_middleware.py', 159, '# ---------------------------------------------------------------------------'), ...]

dimos/project/test_no_sections.py:145: AssertionError

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Comment on lines +384 to +385
if max_turn is None:
return len(messages)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 _current_turn_start returns len(messages) when no messages carry a dimos_turn tag, but this places every message into compactable and nothing into current_turn. The downstream split in before_model then treats the latest user query itself as eligible for summarization, silently folding the active request into the summary. The docstring says "the caller will no-op" in this case, which only holds if the function returns 0 (making compactable = []). Any deployment that feeds the middleware a history without turn tags — e.g., a session started before the tagging feature landed, or a standalone use outside McpClient — would have its current-turn messages summarized away on the first threshold crossing.

Suggested change
if max_turn is None:
return len(messages)
if max_turn is None:
return 0

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This turns the middleware off and keeps the problem intact. I have added an alternative split for any agents that don't have turn-tagged messages where the compressible section starts just before the latest human message. It's a good compromise for that edge case.

Mario Garrido added 2 commits May 12, 2026 20:09
…Message

When `_current_turn_start` encounters a history with no `dimos_turn` tags
at all (a caller wired the middleware in without going through McpClient),
it now walks back to find the latest `HumanMessage` and uses that as the
boundary. Older messages compactable, latest user input + any in-flight
assistant/tool messages after it protected.

The previous behavior returned `len(messages)` — making every message
compactable — which would silently summarize the active user query the
first time compaction crossed threshold. (Returning `0` would protect
everything and instead let an oversized prompt reach the LLM, where it
would raise on context overflow — worse than the silent path.) Greptile
P4.

Also: clarify in the module docstring that tool-call coherence is the
harness's responsibility. The middleware never introduces orphan
tool_calls (same-turn messages always travel together via `_split`'s
boundary alignment) but doesn't fix orphans it inherits — those flow
through to the LLM in the current-turn case, where the malformed
conversation will surface as a provider error.

New regression test: test_untagged_history_anchors_current_turn_on_latest_human.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent Compaction

2 participants