fix: Prevent exec_shell timeout deadlock on Windows#2573
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Thanks @idling11. This is a good catch and the root-cause shape matches the Windows durable-task deadlock report in #2571. I harvested the bounded reader-drain idea into the v0.8.50 triage branch as #2504 commit
Verification on the release branch:
I’m keeping this PR visible while #2504 finishes CI, but the fix path for v0.8.50 is now the harvested release branch rather than merging this PR as-is. Thank you for moving fast on this one. |
| if let Some(status) = child.wait_timeout(timeout)? { | ||
| let stdout = stdout_thread.join().unwrap_or_default(); | ||
| let stderr = stderr_thread.join().unwrap_or_default(); | ||
| let stdout = stdout_rx | ||
| .recv_timeout(READER_JOIN_TIMEOUT) | ||
| .unwrap_or_default(); | ||
| let stderr = stderr_rx | ||
| .recv_timeout(READER_JOIN_TIMEOUT) | ||
| .unwrap_or_default(); |
There was a problem hiding this comment.
Silent output loss on the success path
When a process exits cleanly but a grandchild holds onto the pipe write-handle (common in shell scripts that launch background jobs), read_to_end in the reader thread never returns, recv_timeout expires after 5 s, and unwrap_or_default() silently substitutes an empty Vec<u8>. The caller receives ShellStatus::Completed with empty stdout/stderr — indistinguishable from a command that genuinely produced no output. Unlike the timeout branch (which sets ShellStatus::TimedOut), there is no field in ShellResult to signal that output was discarded, so the data loss is invisible to both the tool layer and the end user.
Summary
Replace
thread.join()with bounded channel receive in the shelltimeout path so a hung
read_to_endon a killed-process pipe doesnot deadlock the global
tool_exec_lock.Closes: #2571
Problem
execute_sync_sandboxedspawned reader threads that calledread_to_endon stdout/stderr pipes. When the child process timedout and was killed on Windows, the pipe handles were not always
closed promptly, causing
read_to_endto hang indefinitely.thread.join()waited forever, the function never returned, andthe
tool_exec_lockwrite-guard was never dropped — deadlockingall subsequent tool calls across the entire session.
Fix
Reader threads now send results through
std::sync::mpsc::channelinstead of returning from
thread::spawn. In the normal path weuse
recv()(equivalent to the oldjoin()). In the timeout pathwe use
recv_timeout(READER_JOIN_TIMEOUT)(5 seconds) — if thereader doesn't finish within 5s after the child is killed, we
proceed with empty output and the orphaned thread eventually
completes on its own.
Files Changed
crates/tui/src/tools/shell.rsTests
Full suite: 3815 passed, 5 pre-existing failures
Greptile Summary
This PR replaces
thread.join()withmpsc::channel+recv_timeoutinexecute_sync_sandboxedto prevent a Windows deadlock where a killed process's pipe handles weren't closed promptly, leavingread_to_end— and therefore thetool_exec_lock— hung indefinitely.recv_timeout(5 s)for each reader thread, withunwrap_or_default()returning empty output if the timeout is hit; detached threads continue in the background.recv_timeoutexpires (e.g. a grandchild process still holds the pipe), the caller receivesShellStatus::Completedwith emptystdout/stderrand no indicator inShellResultthat the output was silently discarded — distinguishable from a command that genuinely produced no output only by adding an explicit flag or log.Confidence Score: 4/5
The deadlock fix is correct and meaningfully improves Windows reliability, but the success path silently discards stdout/stderr when recv_timeout expires, which could confuse callers that trust a Completed status to mean they have full output.
The core deadlock fix works as described. The remaining concern is in the normal (non-timeout) branch: if recv_timeout fires there, the function returns Completed with empty output and no way for callers to detect the data was dropped. This is a real data-loss scenario for users running commands that spawn background children holding the pipe open.
crates/tui/src/tools/shell.rs — specifically the success-exit branch where recv_timeout can silently drop output without setting any flag in ShellResult.
Important Files Changed
Sequence Diagram
sequenceDiagram participant Main as Main Thread participant Child as Child Process participant SOut as stdout_thread participant SErr as stderr_thread Main->>Child: spawn() Main->>SOut: thread::spawn (read_to_end → stdout_tx) Main->>SErr: thread::spawn (read_to_end → stderr_tx) Main->>Child: wait_timeout(timeout) alt Process exits in time Child-->>Main: Some(status) Main->>Main: recv_timeout(5s) on stdout_rx Main->>Main: recv_timeout(5s) on stderr_rx Note over Main: unwrap_or_default() on timeout<br/>silent empty output if pipe hangs Main-->>Main: return ShellStatus::Completed else Timeout Child-->>Main: None Main->>Child: kill() Child-->>Main: wait() Main->>Main: recv_timeout(5s) on stdout_rx Main->>Main: recv_timeout(5s) on stderr_rx Note over Main: unwrap_or_default() on timeout<br/>orphan threads continue in background Main-->>Main: return ShellStatus::TimedOut endReviews (2): Last reviewed commit: "fix: apply recv_timeout to success path ..." | Re-trigger Greptile