Summary
When context cache is near full, Late can emit an oversized spawn_subagent tool call. The tool argument JSON appears to be truncated mid-string, and the backend (llama.cpp OpenAI-compatible server) returns HTTP 500 with a parse error. The subagent spawn fails instead of recovering gracefully.
Environment
- OS: Ubuntu 24.04 LTS
- Client: Late (planning flow with subagent spawning)
- Backend: llama.cpp / llama-swap router
- Endpoint:
http://127.0.0.1/v1/chat/completions
- Context state at failure: near max (
n_tokens = 262143, truncated = 1)
Steps to Reproduce
- Run a long planning session until context cache is nearly full.
- Trigger a
spawn_subagent call with a large multi-paragraph goal payload.
- Let the model stream tool-call arguments.
- Observe backend response and Late behavior.
Expected Behavior
- Late should avoid sending oversized tool-call arguments when context is near limit.
- If tool-call JSON is truncated or invalid, Late should recover gracefully (retry with compact args or ask for a smaller payload).
- Error surfaced to user should be actionable.
Actual Behavior
- Backend returns HTTP 500 due to malformed tool-call arguments JSON.
- Subagent spawn fails.
- Delegation step does not complete.
Error Excerpt
LLAMA-CPP ERROR:
prompt eval time = 221.03 ms / 22 tokens
eval time = 88800.20 ms / 1535 tokens
total time = 89021.23 ms / 1557 tokens
slot release: n_tokens = 262143, truncated = 1
got exception: Failed to parse tool call arguments as JSON: parse error ... invalid string: missing closing quote
POST /v1/chat/completions ... 500
Impact
- Long-running sessions become unreliable right when delegation is needed most.
- High-context workflows can fail depending on tool-call argument size.
- Users lose time recovering from failed subagent spawns.
Suggested Fix Direction
- Add a preflight context-budget check before requests with tools.
- Enforce a maximum size for
spawn_subagent goal payloads.
- Add automatic retry on malformed tool-call argument errors with a compact re-prompt.
- Surface backend error body in Late so the exact failure reason is visible.
Temporary Workarounds
- Start a fresh session before spawning subagents when context is large.
- Keep subagent goal text short and atomic.
- Put long implementation details in a file and reference it instead of embedding large payloads.
Summary
When context cache is near full, Late can emit an oversized
spawn_subagenttool call. The tool argument JSON appears to be truncated mid-string, and the backend (llama.cpp OpenAI-compatible server) returns HTTP 500 with a parse error. The subagent spawn fails instead of recovering gracefully.Environment
http://127.0.0.1/v1/chat/completionsn_tokens = 262143,truncated = 1)Steps to Reproduce
spawn_subagentcall with a large multi-paragraph goal payload.Expected Behavior
Actual Behavior
Error Excerpt
LLAMA-CPP ERROR:
Impact
Suggested Fix Direction
spawn_subagentgoal payloads.Temporary Workarounds