fix(tui): count unreconciled subagents against the cap#2505
Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies the subagent manager to keep counting recently finished task handles until their terminal status has reconciled, preventing premature cap refills during fanout bursts. Feedback on these changes suggests simplifying the task handle check in mod.rs to directly return agent.task_handle.is_some(), and replacing the busy-yield loop in the test file with a more robust synchronization mechanism, such as a oneshot channel, to avoid potential flakiness under heavy CI load.
| if agent.task_handle.is_none() { | ||
| return false; | ||
| }; | ||
| // Exclude agents whose task has finished (status will be updated to Completed shortly) | ||
| !handle.is_finished() | ||
| } | ||
| // Keep recently finished handles counted until the terminal | ||
| // status update has reconciled. Otherwise a fanout burst can | ||
| // refill the cap before the UI/state catches up (#2211). | ||
| true |
There was a problem hiding this comment.
The check for agent.task_handle.is_none() followed by returning false and then returning true can be simplified to directly returning agent.task_handle.is_some(). This is more idiomatic and concise.
// Keep recently finished handles counted until the terminal
// status update has reconciled. Otherwise a fanout burst can
// refill the cap before the UI/state catches up (#2211).
agent.task_handle.is_some()| let finished_handle = tokio::spawn(async {}); | ||
| while !finished_handle.is_finished() { | ||
| tokio::task::yield_now().await; | ||
| } | ||
| agent.task_handle = Some(finished_handle); |
There was a problem hiding this comment.
Using a busy-yield loop (while !finished_handle.is_finished() { tokio::task::yield_now().await; }) to wait for a spawned task to finish can be CPU-intensive and potentially flaky under heavy CI load. A more robust approach is to synchronize using a oneshot channel to signal completion before checking is_finished().
let (tx, rx) = tokio::sync::oneshot::channel();
let finished_handle = tokio::spawn(async move {
let _ = tx.send(());
});
let _ = rx.await;
while !finished_handle.is_finished() {
tokio::task::yield_now().await;
}
agent.task_handle = Some(finished_handle);|
Thanks @cyq1017. I harvested the narrow cap-reconciliation fix into the v0.8.50 triage PR as #2504 commit Local verification on the harvest branch:
I am treating this as a partial #2211 fix only: it covers the finished-handle/status-reconciliation cap race, but the broader queue/status-panel/worktree-pressure criteria remain open. |
Problem
Running. That lets the parent refill the cap early during fanout bursts, which is one of the remaining bug(tui): sub-agent fanout plus hidden worktrees can saturate the TUI during release work #2211 pressure points.Change
Runningsub-agents with task handles counted until status reconciliation catches up.Verification
cargo test -p codewhale-tui running_count --all-features --locked -- --nocapturecargo check -p codewhale-tui --all-features --lockedcargo fmt --all -- --checkgit diff --check origin/main..HEADRefs #2211
Greptile Summary
This PR tightens the sub-agent concurrency cap by removing the
!handle.is_finished()exemption fromrunning_count(). Previously, once a task'sJoinHandleindicated completion, the agent was immediately excluded from the cap even if itsRunningstatus had not yet been reconciled to a terminal state — allowing fanout bursts to overfill the pool (#2211). Now an agent withRunningstatus counts against the cap untilupdate_from_result(orupdate_failed) atomically sets a terminal status and clearstask_handle.mod.rs:running_count()now counts anyRunningagent with a non-Nonetask_handle, regardless of whether the task has already finished; the slot is released only when status reconciliation runs.tests.rs: The renamed testtest_running_count_counts_running_agents_until_status_reconcilesflips its assertion from0to1for a finished-but-unreconciled handle, and two new companion tests cover the reconciled and no-handle cases.Confidence Score: 4/5
Safe to merge; the race-condition fix is correct for all normal execution paths and the one edge case (panic inside the subagent task) is low probability given the codebase's error-handling style.
The logic change closes the fanout race window correctly: cap slots are now held until
update_from_resultruns, which atomically updates both the status and clearstask_handle. The only gap is thatspawn_supervisedcatches panics without callingupdate_from_result, so a panic insiderun_subagent_taskbefore that final write would leave an agent permanently occupying a slot with no automatic recovery path within the session. Nounwrap/expect/panic!calls were found in the file, making this scenario unlikely in practice, but the supervision wrapper creates a silent path where the invariant can break.The
spawn_supervisedcall site inmod.rs— specifically whether the panic handler should also callupdate_failed— warrants a second look alongside this change.Important Files Changed
!handle.is_finished()fromrunning_count()so that agents whose task has ended but whose status hasn't been reconciled yet still occupy a cap slot. The normal happy-path is correct; a panic edge case (task panics beforeupdate_from_resultruns) leaves the slot permanently occupied.test_running_count_counts_only_agents_with_live_task_handlesname now understates the new semantics (finished handles also count), but the assertion itself is still correct.Sequence Diagram
sequenceDiagram participant P as Parent Agent participant M as SubAgentManager participant T as run_subagent_task (tokio) P->>M: "spawn_background() — running_count() < max" M->>M: "agent.status = Running, task_handle = Some(handle)" M-->>P: Ok(snapshot) note over T: Task executes… T->>T: "task finishes (handle.is_finished() = true)" note over M,T: Race window — OLD code excluded finished handles here,<br/>freeing the cap slot before reconciliation T->>M: write lock → update_from_result() M->>M: "agent.status = Completed/Cancelled/Failed" M->>M: "agent.task_handle = None" note over M: running_count() now excludes agent (status ≠ Running) P->>M: "spawn_background() — running_count() < max ✓"Comments Outside Diff (2)
crates/tui/src/tools/subagent/mod.rs, line 1392-1396 (link)update_from_resultpermanently occupies a cap slotspawn_supervisedwraps the future incatch_unwind, so ifrun_subagent_taskpanics before reaching theupdate_from_result/update_failedcall at the end, the JoinHandle resolves to()but neither the status nortask_handleare updated. With the old!handle.is_finished()guard, a finished handle (including a panicked one) was excluded fromrunning_count(), freeing the slot automatically. Under the new logic, the agent stays inRunningwith aSome(finished handle)and is counted indefinitely — blocking that cap slot until the user manually cancels the agent or the application restarts. Thecleanup()path never evictsRunningagents, so there is no automatic recovery within a session. This scenario requires a panic insiderun_subagent_task(none found today), butspawn_supervisedis precisely the place where that path exists silently.crates/tui/src/tools/subagent/tests.rs, line 888 (link)test_running_count_counts_only_agents_with_live_task_handlesnow slightly misrepresents the semantics: after this PR, finished-but-unreconciled handles are also counted. Renaming it avoids confusion for the next reader.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reviews (1): Last reviewed commit: "fix(tui): hold subagent cap until status..." | Re-trigger Greptile