Skip to content

fix: Prevent exec_shell timeout deadlock on Windows#2573

Open
idling11 wants to merge 2 commits into
Hmbown:mainfrom
idling11:fix/exec-shell-timeout-deadlock
Open

fix: Prevent exec_shell timeout deadlock on Windows#2573
idling11 wants to merge 2 commits into
Hmbown:mainfrom
idling11:fix/exec-shell-timeout-deadlock

Conversation

@idling11
Copy link
Copy Markdown
Contributor

@idling11 idling11 commented Jun 2, 2026

Summary

Replace thread.join() with bounded channel receive in the shell
timeout path so a hung read_to_end on a killed-process pipe does
not deadlock the global tool_exec_lock.

Closes: #2571

Problem

execute_sync_sandboxed spawned reader threads that called
read_to_end on stdout/stderr pipes. When the child process timed
out and was killed on Windows, the pipe handles were not always
closed promptly, causing read_to_end to hang indefinitely.
thread.join() waited forever, the function never returned, and
the tool_exec_lock write-guard was never dropped — deadlocking
all subsequent tool calls across the entire session.

Fix

Reader threads now send results through std::sync::mpsc::channel
instead of returning from thread::spawn. In the normal path we
use recv() (equivalent to the old join()). In the timeout path
we use recv_timeout(READER_JOIN_TIMEOUT) (5 seconds) — if the
reader doesn't finish within 5s after the child is killed, we
proceed with empty output and the orphaned thread eventually
completes on its own.

Files Changed

File Change
crates/tui/src/tools/shell.rs +22/-10

Tests

Full suite: 3815 passed, 5 pre-existing failures

Greptile Summary

This PR replaces thread.join() with mpsc::channel + recv_timeout in execute_sync_sandboxed to prevent a Windows deadlock where a killed process's pipe handles weren't closed promptly, leaving read_to_end — and therefore the tool_exec_lock — hung indefinitely.

  • Both the timeout branch and the success branch now use recv_timeout(5 s) for each reader thread, with unwrap_or_default() returning empty output if the timeout is hit; detached threads continue in the background.
  • In the success branch, if recv_timeout expires (e.g. a grandchild process still holds the pipe), the caller receives ShellStatus::Completed with empty stdout/stderr and no indicator in ShellResult that the output was silently discarded — distinguishable from a command that genuinely produced no output only by adding an explicit flag or log.

Confidence Score: 4/5

The deadlock fix is correct and meaningfully improves Windows reliability, but the success path silently discards stdout/stderr when recv_timeout expires, which could confuse callers that trust a Completed status to mean they have full output.

The core deadlock fix works as described. The remaining concern is in the normal (non-timeout) branch: if recv_timeout fires there, the function returns Completed with empty output and no way for callers to detect the data was dropped. This is a real data-loss scenario for users running commands that spawn background children holding the pipe open.

crates/tui/src/tools/shell.rs — specifically the success-exit branch where recv_timeout can silently drop output without setting any flag in ShellResult.

Important Files Changed

Filename Overview
crates/tui/src/tools/shell.rs Replaces thread join with mpsc channels and recv_timeout in both the success and timeout branches; fixes the Windows pipe-hang deadlock but introduces silent output loss when recv_timeout expires in the success-exit path.

Sequence Diagram

sequenceDiagram
    participant Main as Main Thread
    participant Child as Child Process
    participant SOut as stdout_thread
    participant SErr as stderr_thread

    Main->>Child: spawn()
    Main->>SOut: thread::spawn (read_to_end → stdout_tx)
    Main->>SErr: thread::spawn (read_to_end → stderr_tx)
    Main->>Child: wait_timeout(timeout)

    alt Process exits in time
        Child-->>Main: Some(status)
        Main->>Main: recv_timeout(5s) on stdout_rx
        Main->>Main: recv_timeout(5s) on stderr_rx
        Note over Main: unwrap_or_default() on timeout<br/>silent empty output if pipe hangs
        Main-->>Main: return ShellStatus::Completed
    else Timeout
        Child-->>Main: None
        Main->>Child: kill()
        Child-->>Main: wait()
        Main->>Main: recv_timeout(5s) on stdout_rx
        Main->>Main: recv_timeout(5s) on stderr_rx
        Note over Main: unwrap_or_default() on timeout<br/>orphan threads continue in background
        Main-->>Main: return ShellStatus::TimedOut
    end
Loading

Fix All in Codex Fix All in Claude Code Fix All in Cursor

Reviews (2): Last reviewed commit: "fix: apply recv_timeout to success path ..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Comment thread crates/tui/src/tools/shell.rs Outdated
Comment thread crates/tui/src/tools/shell.rs
Comment thread crates/tui/src/tools/shell.rs Outdated
Hmbown added a commit that referenced this pull request Jun 2, 2026
Refs #2571

Harvests the core idea from PR #2573 by @idling11, with local cleanup for normal-exit inherited pipe handles and a foreground orphan-pipe regression.

Co-authored-by: Hanmiao Li <894876246@qq.com>
@Hmbown
Copy link
Copy Markdown
Owner

Hmbown commented Jun 2, 2026

Thanks @idling11. This is a good catch and the root-cause shape matches the Windows durable-task deadlock report in #2571.

I harvested the bounded reader-drain idea into the v0.8.50 triage branch as #2504 commit 4a0919743, with you credited as co-author. I tightened the local version in two ways before landing it there:

  • both foreground timeout and normal-exit output drains use bounded reader receives, so the tool executor cannot wedge while holding the global tool lock;
  • the Unix normal-exit path also cleans the process group before draining output, covering foreground commands that exit after leaving a descendant holding inherited stdout/stderr pipes open.

Verification on the release branch:

  • cargo test -p codewhale-tui --all-features --locked foreground_shell_does_not_block_on_orphaned_subprocess_pipe -- --nocapture
  • cargo test -p codewhale-tui --all-features --locked test_orphaned_subprocess_does_not_block_collect_output -- --nocapture
  • cargo test -p codewhale-tui --all-features --locked test_timeout -- --nocapture
  • cargo test -p codewhale-tui --all-features --locked test_exec_shell_foreground_timeout_guides_background_rerun -- --nocapture
  • cargo clippy -p codewhale-tui --all-targets --all-features --locked -- -D warnings
  • cargo fmt --all -- --check
  • ./scripts/release/check-versions.sh
  • git diff --check

I’m keeping this PR visible while #2504 finishes CI, but the fix path for v0.8.50 is now the harvested release branch rather than merging this PR as-is. Thank you for moving fast on this one.

Comment on lines 912 to +918
if let Some(status) = child.wait_timeout(timeout)? {
let stdout = stdout_thread.join().unwrap_or_default();
let stderr = stderr_thread.join().unwrap_or_default();
let stdout = stdout_rx
.recv_timeout(READER_JOIN_TIMEOUT)
.unwrap_or_default();
let stderr = stderr_rx
.recv_timeout(READER_JOIN_TIMEOUT)
.unwrap_or_default();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent output loss on the success path

When a process exits cleanly but a grandchild holds onto the pipe write-handle (common in shell scripts that launch background jobs), read_to_end in the reader thread never returns, recv_timeout expires after 5 s, and unwrap_or_default() silently substitutes an empty Vec<u8>. The caller receives ShellStatus::Completed with empty stdout/stderr — indistinguishable from a command that genuinely produced no output. Unlike the timeout branch (which sets ShellStatus::TimedOut), there is no field in ShellResult to signal that output was discarded, so the data loss is invisible to both the tool layer and the end user.

Fix in Codex Fix in Claude Code Fix in Cursor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug Report: Tool Executor Deadlock After Long-Running exec_shell in Durable Task

2 participants