Skip to content

Investigate SDK premature session.idle — upstream fix or permanent workaround #389

@PureWeen

Description

@PureWeen

Context

The Copilot SDK/CLI emits premature session.idle events during long tool-executing turns, causing multi-agent workers to be collected with truncated responses. PR #375 added an elaborate workaround (RecoverFromPrematureIdleIfNeededAsync) using:

  1. ManualResetEventSlim signal set by re-arm events after premature idle
  2. events.jsonl mtime freshness detection (< 15s = CLI still writing)
  3. Multi-round recovery loop with bestResponse accumulation
  4. 120s recovery timeout

Why this is a hack

  • Relies on filesystem side-channel (file mtime) to determine SDK state
  • Normal completions can stall ~15s while freshness detection runs
  • Multi-round loop adds complexity and edge cases (OCE handling, bestResponse scoping)
  • The root cause is the SDK emitting session.idle before the turn is actually complete

Additional finding: backgroundTasks field is inconsistently populated (Mar 2025)

The SessionIdleEvent.Data.BackgroundTasks field (agents/shells arrays) is not reliably populated when sub-agents are running. Within the same worker session, some session.idle events arrive with empty backgroundTasks while others correctly list running agents. This causes:

  1. session.idle arrives with empty backgroundTasks → IDLE-DEFER logic does NOT trigger → CompleteResponse fires → IsProcessing=false (premature)
  2. Sub-agent finishes → new TurnStartEvent arrives → EVT-REARM re-sets IsProcessing=true
  3. Cycle repeats multiple times per worker turn (observed 3x in a single turn)

Evidence from PR Review Squad worker-5 (2026-03-25):

13:26:02 [EVT] worker-5 TurnStart (IsProcessing=True)     ← sub-agents launched
13:29:08 [EVT] worker-5 TurnEnd (IsProcessing=False)      ← premature idle fired, no backgroundTasks
13:29:08 [EVT-REARM] worker-5 re-arming IsProcessing      ← sub-agent finished, new turn
13:31:18 [EVT] worker-5 TurnEnd (IsProcessing=False)      ← premature idle AGAIN
13:31:18 [EVT-REARM] worker-5 re-arming IsProcessing
13:34:23 [EVT] worker-5 TurnEnd (IsProcessing=False)      ← premature idle AGAIN
13:34:23 [EVT-REARM] worker-5 re-arming IsProcessing
13:35:08 [IDLE-DEFER] worker-5 session.idle with active    ← THIS time backgroundTasks IS populated
         background tasks — deferring completion

The EVT-REARM mechanism recovers correctly every time, so no data is lost — but the UI spinner flickers (see #395) and diagnostics are confusing.

Proposed SDK fixes (either would resolve this)

  1. Don't emit session.idle while background tasks are active. The SDK clearly tracks them (it populates the field sometimes). Hold the idle event until agents/shells are truly empty.
  2. Add a turn ID / correlation token to session.idle. Consumers could match idle events to the prompt that triggered them, distinguishing real completion from stale/premature idle.

Option 1 eliminates the entire class of premature idle bugs. Option 2 is more general and helps with other edge cases too.

Proposed investigation

  1. File upstream issue with Copilot SDK team documenting the premature idle behavior
  2. Request turn-scoped terminal events or turn IDs so consumers can distinguish real vs premature idle
  3. If upstream fix is not forthcoming, evaluate whether the workaround can be simplified (e.g., just use turn ID matching)

Priority

Medium — the workaround works but adds resilience debt in the most critical orchestration path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    externalUpstream bug or dependency issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions