feat(agents): add prompt-compaction middleware for McpClient by Mgczacki · Pull Request #2055 · dimensionalOS/dimos

Mgczacki · 2026-05-12T09:06:44Z

Summary

Caps the prompt the dimos agent sends to its LLM so the conversation history
never grows unbounded. Implemented as a langchain AgentMiddleware plugged into
create_agent(middleware=...). Because the hook (before_model) fires before
every model invocation, the input-size bound becomes an invariant of the agent
loop — including intra-turn re-invocations (model → tool → tool result → model).

On long sessions the middleware quietly summarizes older turns once it detects
an oversized prompt. Behavior is unchanged for short sessions.

Concepts

`dimos_turn`

A new integer tag attached to each message's additional_kwargs dict.
Incremented once per McpClient._process_message call — that is, once per
user-facing turn (a human input from agent-send, or a tool-stream
notification that wakes the agent). Every message that flows through during
that turn — the input HumanMessage, intermediate AIMessages with
tool_calls, the resulting ToolMessages, the final AIMessage — all get
stamped with the same turn number.

This is what lets compaction:

Group messages by turn so tool_call/tool_response pairs always travel
together (compaction selects entire turns, never partial ones — no orphan
tool_call_id references).
Identify the current turn (the latest tag value plus any trailing
untagged in-flight messages from the agent loop) and preserve it untouched
regardless of threshold.
Score / inspect the history per-turn for future heuristics (e.g.,
keep-N-most-recent strategies).

dimos_turn is metadata only — it lives in additional_kwargs, which
providers ignore but langchain serialization preserves. The compaction
summary itself is tagged with the max turn it covers (plus
dimos_compacted: True), so re-compaction folds the prior summary into the
next one cleanly.

Current turn is sacred

_current_turn_start walks from the end of the message list to find the
boundary of the latest turn. Everything from that boundary forward is never
compacted — no image strip, no summary touch. This protects:

The user's current query
In-progress tool calls and their pending ToolMessage responses
Fresh images from perception that the user might be asking about right now

How it works

Two-stage compaction inside before_model:

Strip images in messages older than the current turn. Image content
blocks are replaced with a small text placeholder. If this alone gets us
below target_tokens, we stop here.

Caveat: this is an incomplete solution. Dropping the image with only
a "[image removed]" placeholder is destructive as the model can no
longer refer back to that perception. A more principled design would
follow progressive disclosure: keep the image addressable in a content
store and replace the inline block with a reference (e.g.,
[image: ref://…]) plus a tool the agent can call to re-fetch it on
demand. I am deferring this decision as it needs a broader agent-harness
conversation about content addressability.

As to why I decided to strip images: LLM's visual reasoning capabilities are
currently noticeably worse than with text. Additionally, the way in which the
agent loop is set up right now makes it so the model gets to see the image at the
beginning of a new turn, and it tends to give a description of what's in the image.
This description is detailed enough for reasoning about the content of the image,
but it also causes a secondary effect: The model, when considering the image, will
default to anchor its perception (even if the image is available in chat history) to the
comment it gave at the moment. Keeping images that were already observed therefore
seems like a waste of tokens that we can save since we are already going to cause a
cache burst with our compaction process.

Summarize older messages into a single SystemMessage while keeping
the most recent turns verbatim. The summarizer LLM is configurable;
defaults to reusing the agent's own model. Output is hard-capped via
summarizer.bind(max_tokens=summary_size_tokens).

See it in action

A public Langfuse trace captured with deliberately small defaults so
compaction fires after a handful of turns:

https://us.cloud.langfuse.com/project/cmp23t80n09ooad08jnw1lksy/traces/887630cfbf49bb97f1c5b4d2cc980ad1?observation=b73fcf77cb4f2dc5&timestamp=2026-05-12T07:54:34.311Z

Use the trace timeline to see the prompt that hits the LLM at each
agent-turn-N span — older turns get folded into a single summary
SystemMessage and the agent continues with a shrunk prompt.

Configuration

All on by default via McpClientConfig, env-driven:

Env var	Field	Default
`AGENT_COMPACTION_THRESHOLD`	`agent_compaction_threshold`	`40000`
`AGENT_COMPACTION_TARGET`	`agent_compaction_target`	`3000`
`AGENT_COMPACTION_SUMMARY_SIZE`	`agent_compaction_summary_size`	`1000`
`AGENT_COMPACTION_MODEL`	`agent_compaction_model`	`None` (reuses agent's model)

Why a middleware

Two reasons, both documented in compaction_middleware.py's module docstring:

Middleware vs preprocessing. External preprocessing on _history
would only fire once per user turn, leaving every intra-turn re-invocation
unprotected. Middleware fires before each model call.
before_model vs after_model / wrap_model_call. before_model is
the minimal-intervention hook. after_model is too late (the model
already errored on overflow); wrap_model_call conflates compaction with
the model-call concerns (retries, error shaping, tool dispatch).

Changes

New files

dimos/agents/compaction_middleware.py — DimosCompactionMiddleware
class (subclass of langchain.agents.middleware.AgentMiddleware),
placeholder token counter (3 chars/token, 1000 tokens/image; memoized in
additional_kwargs[\"dimos_tokens\"] for O(new-only) recompute), static
token cache for system_prompt + tool schemas, and the algorithm helpers
(_strip_images, _split, _current_turn_start, _summarize).
dimos/agents/test_compaction_middleware.py — 15 pytest cases,
hermetic (no API key needed). Coverage includes:
- Token counter unit tests (text, image, memoization, static cache)
- before_model no-op below threshold
- Stage 1 alone suffices (image strip only)
- Stage 2 summarization with FakeListChatModel summarizer
- Protected SystemMessage prefix preserved
- Mid-list untagged messages get summarized (not protected)
- Prior summary re-folded into the next summary (no stacking)
- Most-recent turns kept verbatim
- Tool-call/tool-response pairs never split across summarize/keep boundary
- Summarizer failure propagates after retries
- Two integration tests that drive a real create_agent loop with a
  RecordingFakeAgent and assert: (a) the agent node receives a compacted
  prompt (proves langgraph's add_messages reducer interprets the
  RemoveMessage(REMOVE_ALL_MESSAGES) sentinel correctly), and
  (b) compaction can fire mid-turn between a tool result and the next
  model call.

Modified: `dimos/agents/mcp/mcp_client.py`

Config: four new fields on McpClientConfig reading the env vars in
the table above. _env_int / _env_str helpers loaded via pydantic
Field(default_factory=...).
Turn tagging: new _turn: int counter on McpClient (incremented at
the top of _process_message), and a new module-level _tag_turn(message, turn) helper that stamps additional_kwargs[\"dimos_turn\"]. Every
message flowing through a turn gets stamped — the incoming HumanMessage
first, then every message emitted by the state graph.
History sync: new _apply_messages_update method that mirrors
langgraph's add_messages reducer semantics locally — honors
RemoveMessage(id=REMOVE_ALL_MESSAGES) as "wipe history, use what
follows" and specific-id RemoveMessage as targeted removal. This keeps
McpClient._history in sync with the graph's internal state even when the
middleware replaces the entire message list.
Middleware wiring: in on_system_modules, construct the summarizer
(either via init_chat_model(agent_compaction_model), or
init_chat_model(model) if the agent's model is a string, or reuse the
agent's model object), build the middleware with the system prompt and
tool JSON schemas (t.args_schema.model_json_schema()), and pass it as
create_agent(..., middleware=middleware).
Robustness in the stream loop: the worker thread now guards against
middleware no-op updates that yield {node: None} instead of {node: {\"messages\": [...]}}, which would previously crash with 'NoneType' object has no attribute 'get'.

Modified: `.gitignore`

Adds MUJOCO_LOG.TXT (MuJoCo runtime artifact written to the repo root on
every sim run; should never be committed).

Test plan

uv run pytest dimos/agents/test_compaction_middleware.py -v — 15/15
pass.
uv run mypy dimos/agents/compaction_middleware.py dimos/agents/test_compaction_middleware.py dimos/agents/mcp/mcp_client.py — clean.
Live verification: dimos --simulation run unitree-go2-agentic with
AGENT_COMPACTION_THRESHOLD=2000, drive the agent until the threshold
is crossed, confirm a Compaction fired (summarize) log line appears
and the next prompt sent to the LLM contains the summary
SystemMessage instead of the older turns.

Known limitations

Documented in the module docstring as "Known limitations":

Image stripping is destructive — see caveat under stage 1 above.
Progressive disclosure with a content store is the right long-term answer.
Summarizer transcript size is unbounded — a first-ever compaction on a
very long session could exceed the summarizer model's own context window.
Mitigation deferred to a follow-up (chunked summarization).
@retry(on_exception=Exception) is intentionally broad because the
summarizer is duck-typed; permanent errors cost up to 3 attempts + 1s of
sleeps before propagating.

Caps the prompt the agent sends to its LLM so the conversation history never grows unbounded. Runs as a langchain AgentMiddleware via create_agent(middleware=...), so the size bound becomes an invariant of the agent loop — `before_model` fires before every model call, including intra-turn re-invocations (model -> tool -> tool result -> model). Two-stage compaction: 1. Strip image content blocks from older messages (replace with a small text placeholder). 2. If still over target, summarize older messages into a single SystemMessage and keep the most recent turns verbatim. The current turn (latest dimos_turn group + any trailing untagged messages, i.e. in-flight tool calls) is preserved untouched — never compacted, never image-stripped. Configuration via McpClientConfig fields, env-driven by default: AGENT_COMPACTION_THRESHOLD trigger size (default 40000) AGENT_COMPACTION_TARGET size after compaction (default 3000) AGENT_COMPACTION_SUMMARY_SIZE generated summary size (default 1000) AGENT_COMPACTION_MODEL optional separate summarizer model Also includes: - Per-message turn tagging via additional_kwargs["dimos_turn"], stamped in McpClient._process_message so compaction can group/score by turn. - McpClient._history mirror updated to honor langgraph's add_messages reducer semantics (RemoveMessage(id=REMOVE_ALL_MESSAGES) sentinel) so the local history doesn't accrete pre-compaction state. - Token counter is a pessimistic placeholder (3 chars/token, 1000/image), memoized on each message for O(new-only) recompute cost. Designed to be swapped for a real tokenizer later without touching callers. - 15 pytest cases (hermetic, no API key needed), including two integration tests that drive a real create_agent loop and prove compaction can fire mid-turn between a tool result and the next model call. Defaults are intentionally conservative so the feature is on by default without changing behavior for short sessions.

greptile-apps · 2026-05-12T09:11:31Z

Greptile Summary

This PR adds DimosCompactionMiddleware, a before_model hook that keeps the agent's prompt within a configurable token budget via two-stage compaction: image-block stripping followed by LLM-based summarisation of older turns. Turn tagging (dimos_turn), a local history-sync method (_apply_messages_update), and env-driven config fields wire the middleware into McpClient.

Compaction algorithm — _current_turn_start correctly protects the in-flight turn and its tool-call/response pairs from being split; _split aligns the summarise/keep boundary to dimos_turn groups and guarantees at least one message survives in the kept tail.
Local history mirroring — _apply_messages_update mirrors langgraph's add_messages reducer, handles RemoveMessage(REMOVE_ALL_MESSAGES) for full-wipe replays, and suppresses duplicate publishes to downstream subscribers using Python-identity checks.
Test coverage — 15 hermetic pytest cases (unit + two full create_agent integration tests) confirm compaction fires at the right moments, including mid-turn between a tool result and the next model call.

Confidence Score: 5/5

Safe to merge. The compaction algorithm, turn-boundary protection, tool-call coherence, and duplicate-publish suppression are all well-handled, and the two full-loop integration tests confirm the middleware integrates correctly with langgraph's add_messages reducer.

The core logic is sound: the backward-scan current-turn detection, the _split boundary alignment, and the _apply_messages_update replay logic all hold up under close inspection. The previous review concerns have been addressed. Remaining notes are minor style/edge-case issues that do not affect correctness.

No files require special attention. compaction_middleware.py contains one dead method worth removing; mcp_client.py has a minor zero-coercion edge case in the env-var defaults.

Important Files Changed

Filename	Overview
dimos/agents/compaction_middleware.py	New middleware implementing two-stage compaction (image-strip then summarise); algorithm is sound with correct turn-boundary alignment and memoised token counting. Contains one dead method `_total_tokens` that is never called.
dimos/agents/mcp/mcp_client.py	Adds turn-tagging, `_apply_messages_update` (mirrors langgraph's add_messages reducer), env-driven config fields, and wires the middleware into `create_agent`. Previous review concerns about duplicate publishes and env-var error handling have been addressed. The `or` zero-coercion for integer env vars is a minor residual issue.
dimos/agents/test_compaction_middleware.py	15 hermetic pytest cases covering token counting, all compaction paths, tool-call coherence, re-compaction folding, failure propagation, and two full-loop integration tests with FakeMessagesListChatModel. Coverage is thorough.
.gitignore	Adds MUJOCO_LOG.TXT to prevent MuJoCo runtime artefacts from being committed.

Sequence Diagram

sequenceDiagram
    participant U as "User / tool-stream"
    participant MP as "McpClient._process_message"
    participant SG as "LangGraph state_graph"
    participant BM as "before_model hook"
    participant LLM as "Agent LLM"
    participant AU as "_apply_messages_update"

    U->>MP: HumanMessage
    MP->>MP: "increment turn, tag message"
    MP->>MP: "append to history, publish"
    MP->>SG: "stream(messages: history)"

    loop "agent node execution"
        SG->>BM: "before_model(state)"
        alt "total <= threshold"
            BM-->>SG: "None (no-op)"
        else "Stage 1: image strip"
            BM-->>SG: "RemoveMessage(ALL) + stripped + current_turn"
        else "Stage 2: summarise"
            BM->>LLM: "summarise older turns"
            LLM-->>BM: "summary_text"
            BM-->>SG: "RemoveMessage(ALL) + protected + summary + keep + current_turn"
        end
        SG->>LLM: "invoke(compacted messages)"
        LLM-->>SG: "AIMessage"
        SG-->>MP: "stream update {node: messages}"
        MP->>AU: "_apply_messages_update(node_messages, turn)"
        AU->>AU: "wipe history on RemoveMessage(ALL)"
        AU->>AU: "skip-publish replayed objects"
        AU->>AU: "append and publish new objects"
    end

_{Reviews (6): Last reviewed commit: "fix: small efficiency rewrite" | Re-trigger Greptile}

greptile-apps · 2026-05-12T09:11:36Z

+def _env_int(name: str) -> int | None:
+    v = os.environ.get(name)
+    return int(v) if v else None


_env_int calls int(v) without a try/except, so a non-numeric value like AGENT_COMPACTION_THRESHOLD=abc raises a bare ValueError deep inside pydantic's default_factory during config construction, producing an unhelpful traceback with no mention of which env var is at fault.

Suggested change

def _env_int(name: str) -> int | None:

v = os.environ.get(name)

return int(v) if v else None

def _env_int(name: str) -> int | None:

v = os.environ.get(name)

if not v:

return None

try:

return int(v)

except ValueError:

raise ValueError(f"Environment variable {name!r} must be an integer, got {v!r}") from None

- McpClient._apply_messages_update: dedupe publish on compaction replay. When the middleware emits [RemoveMessage, protected..., summary, keep..., current_turn...], the protected/keep/current messages are the same Python objects that were already published when they first arrived. Skip publish+print for any iter_msg whose id() was in the pre-wipe history; only the genuinely-new summary (and later AIMessages from the agent node in subsequent stream updates) get republished. Identified by Greptile P1. - McpClient._env_int: re-raise a labeled ValueError when the env var value isn't a valid integer, so misconfiguration surfaces with the offending name instead of a bare pydantic traceback. Identified by Greptile P2. - DimosCompactionMiddleware._static_tokens: drop the per-call hash computation. Inputs (system_prompt, tool_schemas) are bound at __init__ and never mutate, so a simple None-check on the cache is sufficient. Identified by Greptile P2.

codecov · 2026-05-12T22:30:16Z

❌ 1 Tests Failed:

Tests completed	Failed	Passed	Skipped
1774	1	1773	29

View the top 1 failed test(s) by shortest run time

dimos.project.test_no_sections::test_no_section_markers

Stack Traces | 0.815s run time

def test_no_section_markers():
        """
        Fail if any file contains section-style comment markers.
    
        If a file is too complicated to be understood without sections, then the
        sections should be files. We don't need "subfiles".
        """
        violations = find_section_markers()
        if violations:
            report_lines = [
                f"Found {len(violations)} section marker(s). "
                "If a file is too complicated to be understood without sections, "
                'then the sections should be files. We don\'t need "subfiles".',
                "",
            ]
            for path, lineno, text in violations:
                report_lines.append(f"  {path}:{lineno}: {text.strip()}")
>           raise AssertionError("\n".join(report_lines))
E           AssertionError: Found 14 section marker(s). If a file is too complicated to be understood without sections, then the sections should be files. We don't need "subfiles".
E           
E             dimos/agents/test_compaction_middleware.py:47: # ---------------------------------------------------------------------------
E             dimos/agents/test_compaction_middleware.py:49: # ---------------------------------------------------------------------------
E             dimos/agents/test_compaction_middleware.py:118: # ---------------------------------------------------------------------------
E             dimos/agents/test_compaction_middleware.py:120: # ---------------------------------------------------------------------------
E             dimos/agents/test_compaction_middleware.py:157: # ---------------------------------------------------------------------------
E             dimos/agents/test_compaction_middleware.py:159: # ---------------------------------------------------------------------------
E             dimos/agents/test_compaction_middleware.py:447: # ---------------------------------------------------------------------------
E             dimos/agents/test_compaction_middleware.py:454: # ---------------------------------------------------------------------------
E             dimos/agents/compaction_middleware.py:123: # ---------------------------------------------------------------------------
E             dimos/agents/compaction_middleware.py:125: # ---------------------------------------------------------------------------
E             dimos/agents/compaction_middleware.py:186: # ---------------------------------------------------------------------------
E             dimos/agents/compaction_middleware.py:188: # ---------------------------------------------------------------------------
E             dimos/agents/compaction_middleware.py:514: # ---------------------------------------------------------------------------
E             dimos/agents/compaction_middleware.py:516: # ---------------------------------------------------------------------------

lineno     = 516
path       = 'dimos/agents/compaction_middleware.py'
report_lines = ['Found 14 section marker(s). If a file is too complicated to be understood without sections, then the sections should...test_compaction_middleware.py:120: # ---------------------------------------------------------------------------', ...]
text       = '# ---------------------------------------------------------------------------'
violations = [('dimos/agents/test_compaction_middleware.py', 47, '# ---------------------------------------------------------------..._compaction_middleware.py', 159, '# ---------------------------------------------------------------------------'), ...]

dimos/project/test_no_sections.py:145: AssertionError

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

greptile-apps · 2026-05-12T23:56:17Z

+        if max_turn is None:
+            return len(messages)


_current_turn_start returns len(messages) when no messages carry a dimos_turn tag, but this places every message into compactable and nothing into current_turn. The downstream split in before_model then treats the latest user query itself as eligible for summarization, silently folding the active request into the summary. The docstring says "the caller will no-op" in this case, which only holds if the function returns 0 (making compactable = []). Any deployment that feeds the middleware a history without turn tags — e.g., a session started before the tagging feature landed, or a standalone use outside McpClient — would have its current-turn messages summarized away on the first threshold crossing.

Suggested change

if max_turn is None:

return len(messages)

if max_turn is None:

return 0

This turns the middleware off and keeps the problem intact. I have added an alternative split for any agents that don't have turn-tagged messages where the compressible section starts just before the latest human message. It's a good compromise for that edge case.

…Message When `_current_turn_start` encounters a history with no `dimos_turn` tags at all (a caller wired the middleware in without going through McpClient), it now walks back to find the latest `HumanMessage` and uses that as the boundary. Older messages compactable, latest user input + any in-flight assistant/tool messages after it protected. The previous behavior returned `len(messages)` — making every message compactable — which would silently summarize the active user query the first time compaction crossed threshold. (Returning `0` would protect everything and instead let an oversized prompt reach the LLM, where it would raise on context overflow — worse than the silent path.) Greptile P4. Also: clarify in the module docstring that tool-call coherence is the harness's responsibility. The middleware never introduces orphan tool_calls (same-turn messages always travel together via `_split`'s boundary alignment) but doesn't fix orphans it inherits — those flow through to the LLM in the current-turn case, where the malformed conversation will surface as a provider error. New regression test: test_untagged_history_anchors_current_turn_on_latest_human.

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Mario Garrido and others added 3 commits May 12, 2026 05:15

Merge branch 'main' into feat/agent-prompt-compaction

9d2be63

[autofix.ci] apply automated fixes

a033163

style: remove section markers

fe173b9

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Mario Garrido added 2 commits May 12, 2026 20:09

fix: small efficiency rewrite

fcc3fd5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agents): add prompt-compaction middleware for McpClient#2055

feat(agents): add prompt-compaction middleware for McpClient#2055
Mgczacki wants to merge 7 commits into
dimensionalOS:mainfrom
Mgczacki:feat/agent-prompt-compaction

Mgczacki commented May 12, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

greptile-apps Bot May 12, 2026

Uh oh!

Uh oh!

codecov Bot commented May 12, 2026

Uh oh!

greptile-apps Bot May 12, 2026

Uh oh!

Mgczacki May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mgczacki commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Concepts

dimos_turn

Current turn is sacred

How it works

See it in action

Configuration

Why a middleware

Changes

New files

Modified: dimos/agents/mcp/mcp_client.py

Modified: .gitignore

Test plan

Known limitations

Uh oh!

greptile-apps Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

greptile-apps Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented May 12, 2026

❌ 1 Tests Failed:

Uh oh!

greptile-apps Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Mgczacki May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mgczacki commented May 12, 2026 •

edited

Loading

`dimos_turn`

Modified: `dimos/agents/mcp/mcp_client.py`

Modified: `.gitignore`

greptile-apps Bot commented May 12, 2026 •

edited

Loading