fix(mcp): make interactive tool calls transport-drop-safe (supersede, reap, keep-alive)#51
Merged
Merged
Conversation
The local MCP transport can drop an in-flight call (the model sees "transport dropped mid-call; response for tool <name> was lost"). For tools that block on a user grant or answer (ask_user, approval-gated bash_exec / fs_request_grant) the outcome is then unknown and a stale approval/question card can linger in the UI, so the model must re-issue the same interactive call rather than assume an answer or proceed. Add a scoped 'Interactive Tool Reliability' section to the system prompt, emitted only when such a tool is present, plus tests for presence/absence.
…, keep-alive A CLI transport drop used to orphan in-flight interactive waits (ask_user, bash approvals, path grants): the rmcp stateful session worker keeps the tool future alive past the dropped connection, so the abandoned-wait guards never fired, pending registry entries (and their UI cards) lingered until the 55-min interactive timeout, and a re-asked command stacked a duplicate card next to the stale one. Three root fixes: - Supersede on re-ask: registering a bash approval (same run + command), path grant (same run + path + access), or ask_user (same session) takes the stale pending entry, emits resolved/attention so the old card is replaced in place, and wakes the orphaned wait with a channel-closed error that it now treats as superseded (disarm, never cancel the live run). - Run-end reaping: BindingGuard owns a run-scope CancellationToken, cancelled on drop; execute_bound_tool races it, so orphaned tool futures are dropped at run end and their RAII cleanup guards finally fire as documented. - Keep rmcp's default sse_keep_alive (15s) instead of disabling it; idle unpinged response streams during long human waits are exactly what rotted into transport drops. The None came from the pre-1.7 struct literal, not a decision. Workspace deletion now cancels the runs whose approvals it purges (purge_workspace returns run ids), since the awaiting side no longer treats a closed channel as cancel-the-run. The Interactive Tool Reliability prompt section drops the 'user can dismiss the stale card' caveat: re-asking now replaces the card automatically, so the model just re-issues the same call.
39e62ab to
e8ec4d3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When the Claude Code CLI drops the local MCP transport mid-call (
transport dropped mid-call; response for tool <name> was lost) during an interactive tool call —ask_user, an approval-gatedbash_exec, orfs_request_grant— the in-flight wait was orphaned:call_toolin a detached session worker (spawn_session_worker) that survives the dropped connection, so the abandoned-wait RAII guards (which fire on future drop) never ran, despite their doc comments claiming they handled transport drops.CLI_INTERACTIVE_WAIT_TIMEOUT); the run wasn't stopped.ask_userentry also keptsession_has_pending_asktrue, suppressing mid-run input delivery.Fix (three layers, root-cause first)
resolved/attention so the stale card is replaced in place the moment the fresh one appears, and wakes the orphaned wait with a channel-closed error that is now treated as superseded (guard disarmed — never cancels the live run that's waiting on the new request).BindingGuardnow owns a run-scopeCancellationToken, cancelled on drop;execute_bound_toolraces it. Any tool future still in flight when the run ends (only possible for orphans — a live CLI waits for its tool results) is dropped, which finally fires the existing cleanup guards exactly as they were documented to.sse_keep_aliveis left at rmcp's defaultSome(15s)instead ofNone. TheNonewas carried over from the pre-1.7 struct literal, not a decision; an idle unpinged response stream during a long human wait is precisely what rots into a transport drop.Consequential fix:
workspace_deletenow explicitly cancels the runs whose pending approvals it purges (purge_workspacereturns the awaiting run ids), since a closed approval channel no longer implies run cancellation.Prompt change (first commit, amended by the second)
The scoped Interactive Tool Reliability section (emitted only when an interactive tool is present) still tells the model a dropped interactive call has an UNKNOWN outcome and to re-issue the same call once — but the "tell the user they can dismiss the stale card" caveat is gone: superseding replaces the card automatically.
Tests
BindingGuarddrop cancels the run scope + unbinds without touching the run's cancel token; the reap arm drops the racing future (RAII guards fire).take_supersededmatches only same run + command / run + path + access; channel-close as supersede signal; counts maintained.ask_user::take_for_sessionscoping + supersededsubmit_answerrejection.cargo test --lib: 611 passed.cargo fmt --check,cargo clippy --lib --tests,tsc --noEmitclean.