Skip to content

Conversation

@csmith49
Copy link
Collaborator

@csmith49 csmith49 commented Dec 16, 2025

Certain LLM APIs place restrictions on how messages should be structured, especially with respect to tool calls. This isn't normally a problem, but sometimes the condensers can violate these properties in ways that are difficult to recover from.

This PR makes the primary condenser -- the LLMSummarizingCondenser -- aware of these restrictions. To do so, we introduce the concept of manipulation indices, which are spots where the conversation history can be changed without violating these properties. The condenser ensures that summaries are only inserted at these indices, and that when events are forgotten it happens from one manipulation index to another.

Design Choices

We could make the View object ensure that these properties cannot be violated. Unfortunately, that means the condenser might produce a Condensation that violates a property and the View is now responsible for deciding how to fix it. That's unfortunate because it means you can't read what a condensation is doing purely from the event any more, you need to know how the views will be processed.

So instead we make it so the View informs the condensers of where changes can be made. This keeps the condensation events literal and also means we can enforce more/less constraints on the conversation history by modifying the code generating the manipulation indices.

The API restrictions are currently codified in a few functions in the View class: _enforce_batch_atomicity and filter_unmatched_tool_calls. These are exactly the same as before, but have been extended with warnings if their property is violated by a condenser. We can remove them and simplify the View at a later date.

Tradeoffs

This of course means there is some slack in the condenser's solutions. The condenser determines forgetting ranges based on the resource usage (events, tokens) of individual events, and these computed ranges are then "projected" into the manipulation index space. The end result is that some condensations may not be what the intuitive semantics imply, but we get the guarantee that the resulting conversation history is well-formed.

Summary of Changes

  • A View.get_manipulation_indices function to tell condensers where changes can be made.
  • Modifications to View methods to throw a warning when they "fix" the conversation history.
  • Modifications to the LLMSummarizingCondenser to use the manipulation indices. No other current condenser needs to be updated.
  • Unit tests for all of the above.

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:24c1423-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-24c1423-python \
  ghcr.io/openhands/agent-server:24c1423-python

All tags pushed for this build

ghcr.io/openhands/agent-server:24c1423-golang-amd64
ghcr.io/openhands/agent-server:24c1423-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:24c1423-golang-arm64
ghcr.io/openhands/agent-server:24c1423-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:24c1423-java-amd64
ghcr.io/openhands/agent-server:24c1423-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:24c1423-java-arm64
ghcr.io/openhands/agent-server:24c1423-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:24c1423-python-amd64
ghcr.io/openhands/agent-server:24c1423-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:24c1423-python-arm64
ghcr.io/openhands/agent-server:24c1423-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:24c1423-golang
ghcr.io/openhands/agent-server:24c1423-java
ghcr.io/openhands/agent-server:24c1423-python

About Multi-Architecture Support

  • Each variant tag (e.g., 24c1423-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 24c1423-python-amd64) are also available if needed

@github-actions
Copy link
Contributor

github-actions bot commented Dec 16, 2025

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-sdk/openhands/sdk/context
   view.py1947760%87, 92, 97–98, 103–104, 109–113, 137–138, 141–147, 150–152, 156, 160–163, 166–168, 172–174, 176, 180–182, 185, 188, 190–191, 193, 209–213, 215, 247–248, 279, 290–291, 299, 302, 358–361, 363–365, 376–377, 379, 381, 403–406, 409, 411–412, 419, 421–422
openhands-sdk/openhands/sdk/context/condenser
   llm_summarizing_condenser.py905538%47, 54, 68, 72–73, 76–79, 82–83, 85, 88–89, 97, 99–103, 105, 125, 127, 134, 138, 142–146, 148, 171–172, 174, 176–178, 180–182, 184, 188–189, 191–192, 194, 204, 207, 210, 215, 218, 221, 229–230, 234
TOTAL13578621354% 

Calvin Smith added 2 commits December 18, 2025 10:30
blocks are preserved
Base automatically changed from csmith49/token-aware-condensation to main December 23, 2025 21:34
@openhands-ai
Copy link

openhands-ai bot commented Dec 23, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1412 at branch `csmith49/tool-call-aware-condensation`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

When thinking is enabled, the Claude API requires that the final assistant
message starts with a thinking block. After condensation, some ActionEvents
may have thinking blocks while others don't, causing API rejection.

This fix:
1. Extends manipulation_indices to track thinking blocks in batches and
   merge all batches from the last thinking batch to the end as a single
   atomic unit, preventing partial removal that would leave inconsistent
   thinking block state.

2. Adds _enforce_thinking_block_consistency method to ensure that when a
   batch with thinking blocks is removed, all subsequent batches without
   thinking blocks are also removed.

3. Updates existing tests to include thinking_blocks attribute on mock
   ActionEvent objects.

4. Adds comprehensive tests for thinking block consistency scenarios.

Fixes #1438

Co-authored-by: openhands <openhands@all-hands.dev>
@csmith49 csmith49 added the integration-test Runs the integration tests and comments the results label Dec 23, 2025
@github-actions
Copy link
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

Instead of merging all batches from the last thinking batch to the end,
we now simply remove cut points after batches without thinking blocks.
This ensures any valid cut leaves a final batch with thinking blocks,
while giving the condenser more manipulation points.

The key insight: cut points are only allowed after batches WITH thinking.
This way, the final batch after any cut will always have thinking blocks.

Co-authored-by: openhands <openhands@all-hands.dev>
The previous implementation only removed cut points immediately after
non-thinking batches. But if non-batch events (like Condensation or
ConversationErrorEvent) follow a non-thinking batch, those cut points
were incorrectly kept.

The fix uses a whitelist approach: only allow cut points that are either
before the first batch (no batches kept) or immediately after a batch
WITH thinking blocks.

For the trajectory in issue #1438 (188 events, 90 batches, 3 with thinking):
- Before: 12 manipulation indices (including invalid ones like 187, 188)
- After: 6 manipulation indices (all valid: 0, 1, 2, 4, 61, 126)

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions
Copy link
Contributor

🧪 Integration Tests Results

Overall Success Rate: 94.1%
Total Cost: $2.26
Models Tested: 6
Timestamp: 2025-12-23 23:18:06 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Integration (Required) Behavior (Optional) Tests Passed Skipped Total Cost Tokens
litellm_proxy_vertex_ai_gemini_3_pro_preview 100.0% 100.0% N/A 9/9 0 9 $0.53 448,075
litellm_proxy_gpt_5.1_codex_max 77.8% 77.8% N/A 7/9 0 9 $0.16 216,463
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 100.0% N/A 9/9 0 9 $0.53 260,798
litellm_proxy_mistral_devstral_2512 87.5% 87.5% N/A 7/8 1 9 $0.16 393,687
litellm_proxy_deepseek_deepseek_chat 100.0% 100.0% N/A 8/8 1 9 $0.04 406,096
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 100.0% N/A 8/8 1 9 $0.84 1,361,148

📋 Detailed Results

litellm_proxy_vertex_ai_gemini_3_pro_preview

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.53
  • Token Usage: prompt: 432,840, completion: 15,235, cache_read: 290,173, reasoning: 10,602
  • Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_e624161_gemini_3_pro_run_N9_20251223_230626

litellm_proxy_gpt_5.1_codex_max

  • Overall Success Rate: 77.8% (7/9)
  • Integration Tests (Required): 77.8% (7/9)
  • Total Cost: $0.16
  • Token Usage: prompt: 212,710, completion: 3,753, cache_read: 130,176, reasoning: 1,920
  • Run Suffix: litellm_proxy_gpt_5.1_codex_max_e624161_gpt51_codex_run_N9_20251223_230625

Failed Tests:

  • t09_token_condenser ⚠️ REQUIRED: Condensation not triggered. Token counting may not work. (Cost: $0.0014)
  • t08_image_file_viewing ⚠️ REQUIRED: Agent did not identify yellow color in the logo. Response: i’m sorry—i don’t actually see the image contents. could you re-upload the logo.png here so i can check its colors? (Cost: $0.0071)

litellm_proxy_claude_sonnet_4_5_20250929

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.53
  • Token Usage: prompt: 247,414, completion: 13,384, cache_read: 172,954, cache_write: 73,438, reasoning: 2,282
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_e624161_sonnet_run_N9_20251223_230626

litellm_proxy_mistral_devstral_2512

  • Overall Success Rate: 87.5% (7/8)
  • Integration Tests (Required): 87.5% (7/9)
  • Total Cost: $0.16
  • Token Usage: prompt: 389,856, completion: 3,831
  • Run Suffix: litellm_proxy_mistral_devstral_2512_e624161_devstral_2512_run_N9_20251223_230617
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

  • t02_add_bash_hello ⚠️ REQUIRED: Shell script is not executable (Cost: $0.0084)

litellm_proxy_deepseek_deepseek_chat

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.04
  • Token Usage: prompt: 395,193, completion: 10,903, cache_read: 370,560
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_e624161_deepseek_run_N9_20251223_230629
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_moonshot_kimi_k2_thinking

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.84
  • Token Usage: prompt: 1,350,037, completion: 11,111, cache_read: 1,246,613
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_e624161_kimi_k2_run_N9_20251223_230629
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

@csmith49 csmith49 added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Dec 23, 2025
@github-actions
Copy link
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

🧪 Integration Tests Results

Overall Success Rate: 96.1%
Total Cost: $1.65
Models Tested: 6
Timestamp: 2025-12-23 23:32:07 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Integration (Required) Behavior (Optional) Tests Passed Skipped Total Cost Tokens
litellm_proxy_gpt_5.1_codex_max 88.9% 88.9% N/A 8/9 0 9 $0.42 472,752
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 100.0% N/A 9/9 0 9 $0.46 275,856
litellm_proxy_deepseek_deepseek_chat 100.0% 100.0% N/A 8/8 1 9 $0.04 395,014
litellm_proxy_mistral_devstral_2512 87.5% 87.5% N/A 7/8 1 9 $0.15 363,650
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 100.0% N/A 8/8 1 9 $0.17 256,547
litellm_proxy_vertex_ai_gemini_3_pro_preview 100.0% 100.0% N/A 9/9 0 9 $0.40 258,147

📋 Detailed Results

litellm_proxy_gpt_5.1_codex_max

  • Overall Success Rate: 88.9% (8/9)
  • Integration Tests (Required): 88.9% (8/9)
  • Total Cost: $0.42
  • Token Usage: prompt: 452,881, completion: 19,871, cache_read: 307,840, reasoning: 17,344
  • Run Suffix: litellm_proxy_gpt_5.1_codex_max_e43a697_gpt51_codex_run_N9_20251223_232521

Failed Tests:

  • t09_token_condenser ⚠️ REQUIRED: Condensation not triggered. Token counting may not work. (Cost: $0.22)

litellm_proxy_claude_sonnet_4_5_20250929

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.46
  • Token Usage: prompt: 267,212, completion: 8,644, cache_read: 194,003, cache_write: 72,725, reasoning: 2,502
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_e43a697_sonnet_run_N9_20251223_232520

litellm_proxy_deepseek_deepseek_chat

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.04
  • Token Usage: prompt: 386,606, completion: 8,408, cache_read: 352,320
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_e43a697_deepseek_run_N9_20251223_232533
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_mistral_devstral_2512

  • Overall Success Rate: 87.5% (7/8)
  • Integration Tests (Required): 87.5% (7/9)
  • Total Cost: $0.15
  • Token Usage: prompt: 359,891, completion: 3,759
  • Run Suffix: litellm_proxy_mistral_devstral_2512_e43a697_devstral_2512_run_N9_20251223_232519
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

  • t02_add_bash_hello ⚠️ REQUIRED: Shell script is not executable (Cost: $0.0085)

litellm_proxy_moonshot_kimi_k2_thinking

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.17
  • Token Usage: prompt: 247,309, completion: 9,238, cache_read: 200,341
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_e43a697_kimi_k2_run_N9_20251223_232524
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_vertex_ai_gemini_3_pro_preview

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.40
  • Token Usage: prompt: 245,020, completion: 13,127, cache_read: 134,900, reasoning: 9,186
  • Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_e43a697_gemini_3_pro_run_N9_20251223_232521

@csmith49 csmith49 merged commit 014c6d4 into main Dec 23, 2025
35 checks passed
@csmith49 csmith49 deleted the csmith49/tool-call-aware-condensation branch December 23, 2025 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration-test Runs the integration tests and comments the results

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants