Skip to content

fix(tui): contain Windows shell process trees#2498

Open
aboimpinto wants to merge 5 commits into
Hmbown:mainfrom
aboimpinto:fix/windows-shell-job-object
Open

fix(tui): contain Windows shell process trees#2498
aboimpinto wants to merge 5 commits into
Hmbown:mainfrom
aboimpinto:fix/windows-shell-job-object

Conversation

@aboimpinto
Copy link
Copy Markdown
Contributor

@aboimpinto aboimpinto commented Jun 1, 2026

Summary

Partial #1812.

This PR contains Windows shell process trees with a Job Object so exec_shell output collection cannot block forever when a descendant process outlives the immediate shell and keeps inherited stdout/stderr pipe handles open.

Why

I captured a live Windows freeze where CodeWhale was stuck after starting this foreground shell command:

findstr /s /m "019e65ab-22c2-7cc0-ab61-bd8411e3e7c4" C:\myWork\AboimPintoConsulting\*

The immediate shell process had already exited, but an orphaned findstr.exe descendant was still alive and holding inherited pipe handles. BackgroundShell::collect_output() then joined the stdout/stderr reader threads, but those readers never saw EOF until the descendant was killed. Killing only findstr.exe immediately released the frozen TUI and the log emitted the delayed tool.exec.end after about 18.8 minutes.

This is a separate freeze mode from the crossterm input-poll freeze discussed in #1812. It is still in the same Windows/shell family, but the blocking point here is shell output collection after process-tree leakage.

What changed

  • Adds a Windows-only WindowsJob wrapper around Job Objects.
  • Sets JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE.
  • Assigns spawned non-PTY shell children to the job immediately after spawn.
  • Terminates the job before joining shell output reader threads.
  • Uses the same job termination path for normal completion, timeout, cancellation, and drop cleanup.
  • Adds a Windows regression test that starts a detached descendant with inherited pipes and verifies get_output returns quickly.

Scope

This should fix the observed exec_shell hang where a descendant process keeps stdout/stderr pipes open after the shell exits.

It does not claim to solve every Windows TUI freeze. In particular, it does not replace the crossterm event::poll(timeout) investigation/mitigation, and it does not address context-pressure or model-streaming stalls.

Validation

cargo fmt --check
git diff --check
cargo check -p codewhale-tui
cargo test -p codewhale-tui background_collection_does_not_block_on_detached_descendant_pipe -- --nocapture

Local Windows note: the GNU toolchain needed C:\msys64\mingw64\bin on PATH for dlltool.exe. The test command was also run with CARGO_INCREMENTAL=0 and RUSTFLAGS="-C debuginfo=0" after the local disk filled during test binary linking.

Paulo Aboim Pinto

Greptile Summary

This PR introduces a WindowsJob RAII wrapper around Windows Job Objects to contain shell process trees during non-PTY exec_shell runs. The immediate child is assigned to a job with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE set so that descendants holding inherited pipe handles are reliably killed before stdout/stderr reader threads are joined, eliminating the hang observed with orphaned findstr.exe and similar detached descendants.

  • Adds WindowsJob wrapper with attach_to_child, terminate, and Drop (via CloseHandle) covering all three execution paths: execute_sync_sandboxed, execute_interactive_sandboxed, and start_background / collect_output.
  • Introduces layered cleanup: explicit TerminateJobObject first, JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE as an automatic fallback when the handle is closed, and a per-child kill() fallback when termination access is denied.
  • Adds four Windows-only regression tests covering the detached-descendant hang, the TerminateJobObject-denied fallback, kill-on-close as a secondary defence, and the end-to-end get_output non-block scenario.

Confidence Score: 5/5

Safe to merge — the Windows-only job-object path is well-isolated behind cfg(windows) guards, all three execution paths drop the job handle before joining reader threads, and the KILL_ON_JOB_CLOSE fallback fires correctly even when explicit TerminateJobObject is denied.

All three shell execution paths (sync sandboxed, interactive sandboxed, background) now drop the WindowsJob handle before joining stdout/stderr reader threads, making KILL_ON_JOB_CLOSE an effective second-line defence. Attach failures are surfaced via tracing::warn rather than silently swallowed. The drop ordering fix from a prior review round has been applied consistently. Four targeted regression tests, including a deny-terminate scenario via DuplicateHandle, give good confidence the fix holds under access-constrained environments.

No files require special attention.

Important Files Changed

Filename Overview
crates/tui/src/tools/shell.rs Core implementation of WindowsJob wrapper and all three cleanup paths; job handle is correctly dropped before thread joins in every code path, and attach failures produce a tracing::warn rather than silently degrading.
crates/tui/src/tools/shell/tests.rs Adds four Windows-only tests including a duplicate-handle helper to simulate TerminateJobObject access denial; timing assertions use a 6 s bound against a 3 s get_output timeout providing adequate slack.
crates/tui/Cargo.toml Adds Win32_Security and Win32_System_JobObjects feature flags to the windows crate dependency; no version bump required.

Sequence Diagram

sequenceDiagram
    participant SM as ShellManager
    participant CH as Child Process
    participant WJ as WindowsJob (KILL_ON_JOB_CLOSE)
    participant RT as Reader Threads
    participant DC as Descendant Processes

    SM->>CH: spawn()
    SM->>WJ: CreateJobObjectW + SetInformationJobObject
    SM->>WJ: AssignProcessToJobObject(child)
    CH-->>DC: spawns descendants (inherit pipes)
    CH->>CH: exits (immediate shell)
    DC-->>RT: hold pipe write-ends open

    Note over SM: poll() detects shell exit

    SM->>WJ: TerminateJobObject (explicit)
    WJ->>DC: kill all job processes
    SM->>WJ: drop handle (CloseHandle)
    Note over WJ: KILL_ON_JOB_CLOSE fires as fallback

    WJ-->>DC: kill remaining descendants
    DC-->>RT: pipe write-ends closed, EOF
    SM->>RT: join stdout_thread
    SM->>RT: join stderr_thread
    RT-->>SM: output collected (no hang)
Loading

Reviews (7): Last reviewed commit: "fix(tui): close Windows job before foreg..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Windows Job Objects to manage and reliably terminate child processes and their descendants on Windows, preventing orphaned processes from leaking or blocking output collection. The changes include updating the windows crate features in Cargo.toml, implementing a WindowsJob wrapper, integrating it into BackgroundShell and ShellManager, and adding a test to verify the behavior. Feedback suggests improving the robustness of terminate_windows_job by falling back to killing the immediate child process if terminating the job object fails.

Comment thread crates/tui/src/tools/shell.rs
Comment thread crates/tui/src/tools/shell.rs Outdated
Comment thread crates/tui/src/tools/shell/tests.rs
@aboimpinto
Copy link
Copy Markdown
Contributor Author

Addressed the review feedback in commit 3ab06d92.

What changed:

  • terminate_windows_job now falls back to child.kill() if TerminateJobObject fails, so the immediate child still gets terminated defensively.
  • Windows job attach failures are no longer silent. The shell code now emits a tracing::warn! with the command and error when job assignment fails, making degraded cleanup diagnosable.
  • The Windows regression test now uses a shorter detached ping descendant and a wider elapsed-time assertion guard, so slow CI machines do not fail the test just because they were close to the get_output timeout boundary.

Validation:

cargo fmt --check
git diff --check
cargo check -p codewhale-tui
cargo test -p codewhale-tui background_collection_does_not_block_on_detached_descendant_pipe -- --nocapture

Paulo Aboim Pinto

@aboimpinto
Copy link
Copy Markdown
Contributor Author

aboimpinto commented Jun 1, 2026

Correction: I removed the temporary turn-dispatch diagnostics from this PR branch after confirming they are debug-only instrumentation.

The PR now contains only the Windows process-tree containment fix. The dispatch diagnostics were moved to the local DEV worktree only, in local commit ecfa5705, so we can reproduce and inspect the next dispatch-timeout instance without proposing that logging upstream.

Paulo Aboim Pinto

Comment thread crates/tui/src/tools/shell.rs Outdated
@Hmbown
Copy link
Copy Markdown
Owner

Hmbown commented Jun 1, 2026

Harvested both Windows job-object commits into codex/v0.8.50-triage while leaving this PR open because it is still marked draft.

The scope is Windows-only and the PR CI was green when reviewed; on the triage branch I also ran the orphaned subprocess regression and codewhale-tui clippy. Thank you for tightening shell cleanup here.

@aboimpinto
Copy link
Copy Markdown
Contributor Author

Added regression coverage to address the confidence gap around Windows job cleanup fallback behavior.

What changed:

  • Added a Windows test helper that duplicates a job handle without JOB_OBJECT_TERMINATE access, so tests can force TerminateJobObject to fail deterministically.
  • Added a regression test proving erminate_windows_job falls back to child.kill() when job termination is denied.
  • Added a regression test proving collect_output() still unblocks pipe-reader joins through JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE when explicit job termination is denied.

Validation run locally on Windows:

  • cargo fmt --check
  • cargo check -p codewhale-tui
  • cargo test -p codewhale-tui windows_job -- --nocapture
  • cargo test -p codewhale-tui background_collection_does_not_block_on_detached_descendant_pipe -- --nocapture
  • git diff --check

Paulo Aboim Pinto

Comment thread crates/tui/src/tools/shell.rs
@aboimpinto aboimpinto marked this pull request as ready for review June 1, 2026 13:33
@Hmbown
Copy link
Copy Markdown
Owner

Hmbown commented Jun 1, 2026

Follow-up harvest complete: the v0.8.50 triage branch #2504 now includes the final #2498 head through 96adffb2, including the close-before-output/foreground-join changes and Windows fallback tests.

Harvest commits on #2504:

  • 14ea0721a from 4db07a451682
  • 54a93994f from f46eb7e6644
  • 2c256d7b3 from 96adffb2438

Thanks for catching the gap before #2504 moved further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants