Skip to content

fix(tui): count unreconciled subagents against the cap#2505

Draft
cyq1017 wants to merge 1 commit into
Hmbown:mainfrom
cyq1017:codex/2211-subagent-cap
Draft

fix(tui): count unreconciled subagents against the cap#2505
cyq1017 wants to merge 1 commit into
Hmbown:mainfrom
cyq1017:codex/2211-subagent-cap

Conversation

@cyq1017
Copy link
Copy Markdown
Contributor

@cyq1017 cyq1017 commented Jun 1, 2026

Problem

Change

  • Keep Running sub-agents with task handles counted until status reconciliation catches up.
  • Continue ignoring malformed/running records that never received a task handle.
  • Add focused coverage for the reconciled, finished-handle, and missing-handle cases.

Verification

  • cargo test -p codewhale-tui running_count --all-features --locked -- --nocapture
  • cargo check -p codewhale-tui --all-features --locked
  • cargo fmt --all -- --check
  • git diff --check origin/main..HEAD

Refs #2211

Greptile Summary

This PR tightens the sub-agent concurrency cap by removing the !handle.is_finished() exemption from running_count(). Previously, once a task's JoinHandle indicated completion, the agent was immediately excluded from the cap even if its Running status had not yet been reconciled to a terminal state — allowing fanout bursts to overfill the pool (#2211). Now an agent with Running status counts against the cap until update_from_result (or update_failed) atomically sets a terminal status and clears task_handle.

  • mod.rs: running_count() now counts any Running agent with a non-None task_handle, regardless of whether the task has already finished; the slot is released only when status reconciliation runs.
  • tests.rs: The renamed test test_running_count_counts_running_agents_until_status_reconciles flips its assertion from 0 to 1 for a finished-but-unreconciled handle, and two new companion tests cover the reconciled and no-handle cases.

Confidence Score: 4/5

Safe to merge; the race-condition fix is correct for all normal execution paths and the one edge case (panic inside the subagent task) is low probability given the codebase's error-handling style.

The logic change closes the fanout race window correctly: cap slots are now held until update_from_result runs, which atomically updates both the status and clears task_handle. The only gap is that spawn_supervised catches panics without calling update_from_result, so a panic inside run_subagent_task before that final write would leave an agent permanently occupying a slot with no automatic recovery path within the session. No unwrap/expect/panic! calls were found in the file, making this scenario unlikely in practice, but the supervision wrapper creates a silent path where the invariant can break.

The spawn_supervised call site in mod.rs — specifically whether the panic handler should also call update_failed — warrants a second look alongside this change.

Important Files Changed

Filename Overview
crates/tui/src/tools/subagent/mod.rs Core logic change: removes !handle.is_finished() from running_count() so that agents whose task has ended but whose status hasn't been reconciled yet still occupy a cap slot. The normal happy-path is correct; a panic edge case (task panics before update_from_result runs) leaves the slot permanently occupied.
crates/tui/src/tools/subagent/tests.rs Test renamed and assertion inverted to match new behavior; existing no-handle and live-handle tests are unmodified and still pass. Minor: test_running_count_counts_only_agents_with_live_task_handles name now understates the new semantics (finished handles also count), but the assertion itself is still correct.

Sequence Diagram

sequenceDiagram
    participant P as Parent Agent
    participant M as SubAgentManager
    participant T as run_subagent_task (tokio)

    P->>M: "spawn_background() — running_count() < max"
    M->>M: "agent.status = Running, task_handle = Some(handle)"
    M-->>P: Ok(snapshot)

    note over T: Task executes…

    T->>T: "task finishes (handle.is_finished() = true)"

    note over M,T: Race window — OLD code excluded finished handles here,<br/>freeing the cap slot before reconciliation

    T->>M: write lock → update_from_result()
    M->>M: "agent.status = Completed/Cancelled/Failed"
    M->>M: "agent.task_handle = None"

    note over M: running_count() now excludes agent (status ≠ Running)

    P->>M: "spawn_background() — running_count() < max ✓"
Loading

Comments Outside Diff (2)

  1. crates/tui/src/tools/subagent/mod.rs, line 1392-1396 (link)

    P2 Panic before update_from_result permanently occupies a cap slot

    spawn_supervised wraps the future in catch_unwind, so if run_subagent_task panics before reaching the update_from_result / update_failed call at the end, the JoinHandle resolves to () but neither the status nor task_handle are updated. With the old !handle.is_finished() guard, a finished handle (including a panicked one) was excluded from running_count(), freeing the slot automatically. Under the new logic, the agent stays in Running with a Some(finished handle) and is counted indefinitely — blocking that cap slot until the user manually cancels the agent or the application restarts. The cleanup() path never evicts Running agents, so there is no automatic recovery within a session. This scenario requires a panic inside run_subagent_task (none found today), but spawn_supervised is precisely the place where that path exists silently.

    Fix in Codex Fix in Claude Code Fix in Cursor

  2. crates/tui/src/tools/subagent/tests.rs, line 888 (link)

    P2 The test name test_running_count_counts_only_agents_with_live_task_handles now slightly misrepresents the semantics: after this PR, finished-but-unreconciled handles are also counted. Renaming it avoids confusion for the next reader.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Fix in Codex Fix in Claude Code Fix in Cursor

Fix All in Codex Fix All in Claude Code Fix All in Cursor

Reviews (1): Last reviewed commit: "fix(tui): hold subagent cap until status..." | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the subagent manager to keep counting recently finished task handles until their terminal status has reconciled, preventing premature cap refills during fanout bursts. Feedback on these changes suggests simplifying the task handle check in mod.rs to directly return agent.task_handle.is_some(), and replacing the busy-yield loop in the test file with a more robust synchronization mechanism, such as a oneshot channel, to avoid potential flakiness under heavy CI load.

Comment on lines +1252 to +1258
if agent.task_handle.is_none() {
return false;
};
// Exclude agents whose task has finished (status will be updated to Completed shortly)
!handle.is_finished()
}
// Keep recently finished handles counted until the terminal
// status update has reconciled. Otherwise a fanout burst can
// refill the cap before the UI/state catches up (#2211).
true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check for agent.task_handle.is_none() followed by returning false and then returning true can be simplified to directly returning agent.task_handle.is_some(). This is more idiomatic and concise.

                // Keep recently finished handles counted until the terminal
                // status update has reconciled. Otherwise a fanout burst can
                // refill the cap before the UI/state catches up (#2211).
                agent.task_handle.is_some()

Comment on lines +956 to +960
let finished_handle = tokio::spawn(async {});
while !finished_handle.is_finished() {
tokio::task::yield_now().await;
}
agent.task_handle = Some(finished_handle);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a busy-yield loop (while !finished_handle.is_finished() { tokio::task::yield_now().await; }) to wait for a spawned task to finish can be CPU-intensive and potentially flaky under heavy CI load. A more robust approach is to synchronize using a oneshot channel to signal completion before checking is_finished().

    let (tx, rx) = tokio::sync::oneshot::channel();
    let finished_handle = tokio::spawn(async move {
        let _ = tx.send(());
    });
    let _ = rx.await;
    while !finished_handle.is_finished() {
        tokio::task::yield_now().await;
    }
    agent.task_handle = Some(finished_handle);

@Hmbown
Copy link
Copy Markdown
Owner

Hmbown commented Jun 1, 2026

Thanks @cyq1017. I harvested the narrow cap-reconciliation fix into the v0.8.50 triage PR as #2504 commit bc34cd13e (fix(tui): hold subagent cap until status reconciles).

Local verification on the harvest branch:

  • cargo test -p codewhale-tui --all-features --locked test_running_count_counts_running_agents_until_status_reconciles
  • cargo test -p codewhale-tui --all-features --locked subagent
  • cargo fmt --all -- --check

I am treating this as a partial #2211 fix only: it covers the finished-handle/status-reconciliation cap race, but the broader queue/status-panel/worktree-pressure criteria remain open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants