Conversation
Co-authored-by: openhands <openhands@all-hands.dev>
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
|
Hi! I started running the behavior tests on your PR. You will receive a comment with the results shortly. |
Python API breakage checks — ✅ PASSEDResult: ✅ PASSED |
REST API breakage checks (OpenAPI) — ✅ PASSEDResult: ✅ PASSED |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good taste - Clean release version bump
All packages consistently updated from 1.14.0 → 1.14.1, lock file synced, and eval workflow default updated. No issues found. Ready to merge once checklist items are completed. 🚀
🧪 Integration Tests ResultsOverall Success Rate: 76.7% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_deepseek_deepseek_reasoner
Skipped Tests:
litellm_proxy_gemini_3_pro_preview
litellm_proxy_anthropic_claude_sonnet_4_6
litellm_proxy_moonshot_kimi_k2_thinking
Skipped Tests:
Failed Tests:
|
🔄 Running Examples with
|
| Example | Status | Duration | Cost |
|---|---|---|---|
| 01_standalone_sdk/02_custom_tools.py | ✅ PASS | 25.3s | $0.02 |
| 01_standalone_sdk/03_activate_skill.py | ✅ PASS | 20.9s | $0.02 |
| 01_standalone_sdk/05_use_llm_registry.py | ✅ PASS | 13.3s | $0.01 |
| 01_standalone_sdk/07_mcp_integration.py | ✅ PASS | 31.6s | $0.02 |
| 01_standalone_sdk/09_pause_example.py | ✅ PASS | 14.4s | $0.01 |
| 01_standalone_sdk/10_persistence.py | ✅ PASS | 54.3s | $0.05 |
| 01_standalone_sdk/11_async.py | ✅ PASS | 33.9s | $0.04 |
| 01_standalone_sdk/12_custom_secrets.py | ✅ PASS | 11.4s | $0.00 |
| 01_standalone_sdk/13_get_llm_metrics.py | ✅ PASS | 32.2s | $0.02 |
| 01_standalone_sdk/14_context_condenser.py | ✅ PASS | 2m 36s | $0.18 |
| 01_standalone_sdk/17_image_input.py | ✅ PASS | 17.3s | $0.01 |
| 01_standalone_sdk/18_send_message_while_processing.py | ✅ PASS | 23.8s | $0.02 |
| 01_standalone_sdk/19_llm_routing.py | ✅ PASS | 15.9s | $0.02 |
| 01_standalone_sdk/20_stuck_detector.py | ✅ PASS | 17.4s | $0.02 |
| 01_standalone_sdk/21_generate_extraneous_conversation_costs.py | ✅ PASS | 10.3s | $0.00 |
| 01_standalone_sdk/22_anthropic_thinking.py | ✅ PASS | 17.4s | $0.01 |
| 01_standalone_sdk/23_responses_reasoning.py | ✅ PASS | 1m 12s | $0.01 |
| 01_standalone_sdk/24_planning_agent_workflow.py | ✅ PASS | 4m 29s | $0.34 |
| 01_standalone_sdk/25_agent_delegation.py | ✅ PASS | 1m 16s | $0.08 |
| 01_standalone_sdk/26_custom_visualizer.py | ✅ PASS | 20.5s | $0.03 |
| 01_standalone_sdk/28_ask_agent_example.py | ❌ FAIL Exit code 1 |
12.3s | -- |
| 01_standalone_sdk/29_llm_streaming.py | ✅ PASS | 48.1s | $0.04 |
| 01_standalone_sdk/30_tom_agent.py | ✅ PASS | 21.0s | $0.02 |
| 01_standalone_sdk/31_iterative_refinement.py | ✅ PASS | 3m 31s | $0.24 |
| 01_standalone_sdk/32_configurable_security_policy.py | ✅ PASS | 14.0s | $0.01 |
| 01_standalone_sdk/34_critic_example.py | ✅ PASS | 2m 49s | $0.23 |
| 01_standalone_sdk/36_event_json_to_openai_messages.py | ✅ PASS | 10.9s | $0.00 |
| 01_standalone_sdk/37_llm_profile_store/main.py | ✅ PASS | 9.2s | $0.00 |
| 01_standalone_sdk/38_browser_session_recording.py | ✅ PASS | 27.7s | $0.03 |
| 01_standalone_sdk/39_llm_fallback.py | ✅ PASS | 10.1s | $0.01 |
| 01_standalone_sdk/40_acp_agent_example.py | ✅ PASS | 26.9s | $0.10 |
| 01_standalone_sdk/41_task_tool_set.py | ✅ PASS | 28.0s | $0.03 |
| 01_standalone_sdk/42_file_based_subagents.py | ✅ PASS | 56.6s | $0.06 |
| 01_standalone_sdk/43_mixed_marketplace_skills/main.py | ✅ PASS | 7.2s | $0.00 |
| 01_standalone_sdk/44_model_switching_in_convo.py | ✅ PASS | 8.7s | $0.01 |
| 01_standalone_sdk/45_parallel_tool_execution.py | ✅ PASS | 2m 16s | $0.17 |
| 02_remote_agent_server/01_convo_with_local_agent_server.py | ✅ PASS | 37.6s | $0.02 |
| 02_remote_agent_server/02_convo_with_docker_sandboxed_server.py | ✅ PASS | 1m 28s | $0.02 |
| 02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py | ✅ PASS | 1m 3s | $0.05 |
| 02_remote_agent_server/04_convo_with_api_sandboxed_server.py | ✅ PASS | 1m 15s | $0.04 |
| 02_remote_agent_server/07_convo_with_cloud_workspace.py | ✅ PASS | 38.1s | $0.04 |
| 02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py | ✅ PASS | 3m 40s | $0.02 |
| 02_remote_agent_server/09_acp_agent_with_remote_runtime.py | ✅ PASS | 1m 0s | $0.05 |
| 02_remote_agent_server/10_cloud_workspace_share_credentials.py | ❌ FAIL Exit code 1 |
6.8s | -- |
| 04_llm_specific_tools/01_gpt5_apply_patch_preset.py | ✅ PASS | 39.6s | $0.03 |
| 04_llm_specific_tools/02_gemini_file_tools.py | ✅ PASS | 54.1s | $0.09 |
| 05_skills_and_plugins/01_loading_agentskills/main.py | ✅ PASS | 19.1s | $0.01 |
| 05_skills_and_plugins/02_loading_plugins/main.py | ✅ PASS | 27.5s | $0.03 |
❌ Some tests failed
Total: 48 | Passed: 46 | Failed: 2 | Total Cost: $2.28
Failed examples:
- examples/01_standalone_sdk/28_ask_agent_example.py: Exit code 1
- examples/02_remote_agent_server/10_cloud_workspace_share_credentials.py: Exit code 1
🧪 Integration Tests ResultsOverall Success Rate: 60.0% 📁 Detailed Logs & ArtifactsClick the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.
📊 Summary
📋 Detailed Resultslitellm_proxy_deepseek_deepseek_reasoner
Failed Tests:
Problems:
Positive Aspects:
Expected Behavior:
The unauthorized modification of LLM defaults and over-verification of non-terminal tests represents deviation from the evaluation criteria. (confidence=0.70) (Cost: $0.09) litellm_proxy_gemini_3_pro_preview
Failed Tests:
Issues with the Approach:
What Was Done Correctly:
Assessment: While the end result is technically correct, the execution pattern violates explicit instructions (use litellm_proxy_anthropic_claude_sonnet_4_6
Failed Tests:
However, the agent violated the explicit evaluation criteria by creating an unrequested file:
AGENTS.md is:
While AGENTS.md represents good-faith effort to document repository patterns for future agents, it falls outside the scope of what was requested. The user asked only for a training script following the format of the existing example - nothing more. (confidence=0.88) (Cost: $1.77) litellm_proxy_moonshot_kimi_k2_thinking
Failed Tests:
|
|
@OpenHands Read this report #2548 (comment) and focus on Kimi-K2 thinking. Follow the execution path and understand where we set reasoning_effort for Kimi K2 thinking. Then fix it. Try to use clean code, without too ugly special case, or if we need this as special case, list possible places in the code and choose the one where we get info from litellm on reasoning. Unless this was enabled by code somewhere else in which case remove that code. This PR is a release PR, we must not work here. New branch, open new PR. Babysit to green CI, make it ready for review, and monitor for some 5 mins until an AI reviewer will post comments and inline reviews; read them and address if necessary. Tag enyst, and only enyst, in your PR description. |
|
I'm on it! enyst can track my progress at all-hands.dev |
|
Here’s the status based on the work so far: SummaryI have not completed the full request yet. I investigated the Kimi-K2 thinking failure path and made a partial code change locally, but I stopped before validation / PR creation because of the follow-up design concern about depending more on LiteLLM. ... Re-stating the answer to your question about dependency on LiteLLMYour concern is valid. The clean proposed fix makes us rely more directly on LiteLLM’s parameter support metadata instead of its broader reasoning-model metadata. I think that is still the better signal here, because the failure is specifically about whether LiteLLM will accept the Re-stating the answer to your question about
|
|
Status update for the Kimi K2 thinking fix:
CI status on PR #2549 right now:
Note: this comment was created by an AI assistant (OpenHands) on behalf of the user. |
|
HUMAN: cc @xingyaoww The agent fixed Kimi K2 here:
Edited to add: technically it's the same for the other LLMs in tests. But it's the kind of thing I'd kinda re-run the other 2 labels (behavior and examples) 🤔 |
Or rather, I wouldn't re-run behavior, but I would re-run |
🔄 Running Examples with
|
| Example | Status | Duration | Cost |
|---|---|---|---|
| 01_standalone_sdk/02_custom_tools.py | ✅ PASS | 23.7s | $0.02 |
| 01_standalone_sdk/03_activate_skill.py | ✅ PASS | 16.4s | $0.02 |
| 01_standalone_sdk/05_use_llm_registry.py | ✅ PASS | 14.3s | $0.01 |
| 01_standalone_sdk/07_mcp_integration.py | ✅ PASS | 28.6s | $0.02 |
| 01_standalone_sdk/09_pause_example.py | ✅ PASS | 14.8s | $0.01 |
| 01_standalone_sdk/10_persistence.py | ✅ PASS | 36.2s | $0.02 |
| 01_standalone_sdk/11_async.py | ✅ PASS | 30.5s | $0.04 |
| 01_standalone_sdk/12_custom_secrets.py | ✅ PASS | 11.3s | $0.00 |
| 01_standalone_sdk/13_get_llm_metrics.py | ✅ PASS | 42.6s | $0.03 |
| 01_standalone_sdk/14_context_condenser.py | ✅ PASS | 4m 15s | $0.30 |
| 01_standalone_sdk/17_image_input.py | ✅ PASS | 16.5s | $0.01 |
| 01_standalone_sdk/18_send_message_while_processing.py | ✅ PASS | 24.4s | $0.02 |
| 01_standalone_sdk/19_llm_routing.py | ✅ PASS | 16.0s | $0.02 |
| 01_standalone_sdk/20_stuck_detector.py | ✅ PASS | 18.5s | $0.03 |
| 01_standalone_sdk/21_generate_extraneous_conversation_costs.py | ✅ PASS | 13.1s | $0.00 |
| 01_standalone_sdk/22_anthropic_thinking.py | ✅ PASS | 24.8s | $0.01 |
| 01_standalone_sdk/23_responses_reasoning.py | ✅ PASS | 1m 21s | $0.02 |
| 01_standalone_sdk/24_planning_agent_workflow.py | ✅ PASS | 4m 0s | $0.33 |
| 01_standalone_sdk/25_agent_delegation.py | ✅ PASS | 54.3s | $0.07 |
| 01_standalone_sdk/26_custom_visualizer.py | ✅ PASS | 17.0s | $0.03 |
| 01_standalone_sdk/28_ask_agent_example.py | ✅ PASS | 29.2s | $0.02 |
| 01_standalone_sdk/29_llm_streaming.py | ✅ PASS | 45.6s | $0.03 |
| 01_standalone_sdk/30_tom_agent.py | ✅ PASS | 9.5s | $0.01 |
| 01_standalone_sdk/31_iterative_refinement.py | ✅ PASS | 4m 26s | $0.34 |
| 01_standalone_sdk/32_configurable_security_policy.py | ✅ PASS | 21.1s | $0.02 |
| 01_standalone_sdk/34_critic_example.py | ✅ PASS | 2m 13s | $0.17 |
| 01_standalone_sdk/36_event_json_to_openai_messages.py | ✅ PASS | 12.1s | $0.01 |
| 01_standalone_sdk/37_llm_profile_store/main.py | ✅ PASS | 7.0s | $0.00 |
| 01_standalone_sdk/38_browser_session_recording.py | ✅ PASS | 42.1s | $0.03 |
| 01_standalone_sdk/39_llm_fallback.py | ✅ PASS | 10.1s | $0.01 |
| 01_standalone_sdk/40_acp_agent_example.py | ✅ PASS | 30.0s | $0.10 |
| 01_standalone_sdk/41_task_tool_set.py | ✅ PASS | 29.3s | $0.03 |
| 01_standalone_sdk/42_file_based_subagents.py | ✅ PASS | 42.4s | $0.05 |
| 01_standalone_sdk/43_mixed_marketplace_skills/main.py | ✅ PASS | 5.3s | $0.00 |
| 01_standalone_sdk/44_model_switching_in_convo.py | ✅ PASS | 8.2s | $0.01 |
| 01_standalone_sdk/45_parallel_tool_execution.py | ✅ PASS | 3m 9s | $0.41 |
| 02_remote_agent_server/01_convo_with_local_agent_server.py | ✅ PASS | 40.9s | $0.03 |
| 02_remote_agent_server/02_convo_with_docker_sandboxed_server.py | ✅ PASS | 1m 39s | $0.05 |
| 02_remote_agent_server/03_browser_use_with_docker_sandboxed_server.py | ✅ PASS | 1m 6s | $0.07 |
| 02_remote_agent_server/04_convo_with_api_sandboxed_server.py | ✅ PASS | 1m 3s | $0.03 |
| 02_remote_agent_server/07_convo_with_cloud_workspace.py | ✅ PASS | 33.6s | $0.03 |
| 02_remote_agent_server/08_convo_with_apptainer_sandboxed_server.py | ✅ PASS | 3m 53s | $0.03 |
| 02_remote_agent_server/09_acp_agent_with_remote_runtime.py | ✅ PASS | 1m 2s | $0.03 |
| 02_remote_agent_server/10_cloud_workspace_share_credentials.py | ❌ FAIL Exit code 1 |
6.8s | -- |
| 04_llm_specific_tools/01_gpt5_apply_patch_preset.py | ✅ PASS | 30.9s | $0.03 |
| 04_llm_specific_tools/02_gemini_file_tools.py | ✅ PASS | 1m 42s | $0.09 |
| 05_skills_and_plugins/01_loading_agentskills/main.py | ✅ PASS | 19.6s | $0.01 |
| 05_skills_and_plugins/02_loading_plugins/main.py | ✅ PASS | 23.4s | $0.03 |
❌ Some tests failed
Total: 48 | Passed: 47 | Failed: 1 | Total Cost: $2.68
Failed examples:
- examples/02_remote_agent_server/10_cloud_workspace_share_credentials.py: Exit code 1
|
Is the cloud example supposed to work right now? I guess maybe it uses the version currently deployed on prod |
This comment was marked as duplicate.
This comment was marked as duplicate.
|
@OpenHands Look at this #2548 (comment) and #2548 (comment) Find the PR that added this test. Clone yourself OpenHands repo from the same org, and look up the linked issue and the counterpart PR or PRs in that repo. I suspect maybe we haven't made a release on OH with those included? Verify all I said. You have time, investigate deeply and feel free to actually search the repos for the right timeline and status here. |
|
I'm on it! enyst can track my progress at all-hands.dev |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as duplicate.
This comment was marked as duplicate.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
Correction to my earlier wording: the I re-checked the actual value returned from cloud for Using:
I verified safely that the returned
I also tried a cheap completion against the returned base URL using the cloud-returned key and got So the more accurate prod status is:
So for SDK credential inheritance on prod right now, both halves are effectively broken:
|
|
cc: @xingyaoww It seems my initial guess on the Cloud example may have been right: the OH enterprise side is not deployed. Shall we remove it from the PR for now, or do we want to wait for a patch / deployment on Cloud? |
Yeah, i think it is probably ok to ignore that for now in this release since the get secret functionality should be up in the next few days - as long as we revert OpenHands/docs#401 first, and then the content of it again when it is deployed? @OpenHands can you create two PRs for me, first revert docs PR 401 and the second one re-introduce it (merge AFTER release). |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
Summary of WorkI created two PRs in the OpenHands/docs repository as requested: Checklist
ConcisenessThe changes are minimal and focused — each PR touches only the single file ( |
|
Since there's more changes in the past few days.. I'll close this and just cut a v1.15.0 |
Release v1.14.1
This PR prepares the release for version 1.14.1.
Release Checklist
integration-test)behavior-test)test-examples)v1.14.1rel-1.14.1Next Steps
Once the release is published on GitHub, the PyPI packages will be automatically published via the
pypi-release.ymlworkflow.Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.13-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:e77cdd1-pythonRun
All tags pushed for this build
About Multi-Architecture Support
e77cdd1-python) is a multi-arch manifest supporting both amd64 and arm64e77cdd1-python-amd64) are also available if needed