Skip to content

[External] Copilot CLI: session.idle should be persisted to events.jsonl #538

@PureWeen

Description

@PureWeen

Summary

When PolyPilot resumes or reconnects after missing the live event stream, there is no durable, authoritative way to determine whether the previous turn completed successfully.

session.idle appears to be the canonical live "turn is done / session is now idle" signal, but it is ephemeral and not written to events.jsonl. That leaves restart/reconnect logic relying on inference instead of a persisted completion marker.

Why this is a real API/CLI gap

A resumed client that missed the live stream can currently combine coarse session metadata plus persisted artifacts on disk. The relevant signals are:

Event / signal Persisted? Reliably means the turn is done?
session.idle ❌ No ✅ Yes
assistant.turn_end ✅ Yes ❌ No
assistant.message ✅ Yes ❌ No
session.error ✅ Yes ✅ Yes, but only for failures
session.shutdown ✅ Yes ✅ Yes, but only for shutdown
SessionMetadata.ModifiedTime ✅ Yes ❌ No, only a coarse hint

The key issue is that the authoritative success-path completion signal is not durably available after reconnect.

assistant.turn_end is not a substitute: it can occur between tool rounds and before subsequent reasoning/tool activity. ModifiedTime is useful as a hint, but it does not answer whether the turn is actually finished.

Why session.idle is special among ephemeral events

Most ephemeral events are understandable as live-only because they have a persisted counterpart:

  • assistant.message_delta → persisted assistant.message
  • assistant.reasoning_delta → persisted assistant.reasoning
  • tool.execution_partial_result / tool.execution_progress → persisted tool.execution_complete

session.idle is different: it has no persisted counterpart at all. That is what makes it problematic for restart/reconnect flows.

What PolyPilot has had to do to work around this

Some past stuck-session bugs were genuinely in PolyPilot and have been fixed. This issue tracks the remaining external gap that still forces a large workaround stack even after those local fixes.

Today PolyPilot has to do all of the following just to approximate "did this turn finish?":

  1. Tail-scan events.jsonl and analyze sub-turn structure. IsSessionStillProcessing() cannot trust assistant.turn_end, so it scans the event tail and walks backward within the current sub-turn (IsCleanNoToolSubturn) to decide whether more tool rounds are likely coming.
  2. Run a multi-tier processing watchdog. RunProcessingWatchdogAsync() uses different timeout regimes for resumed sessions, normal inactivity, tool-heavy turns, and deferred-idle/background-task cases. This exists largely to compensate for the missing durable completion marker.
  3. Handle session.idle with active backgroundTasks as a special deferral state. PolyPilot has to defer completion when idle arrives with active agents/shells, track carryover/zombie background tasks, and sometimes re-arm IsProcessing if a later idle arrives after state was already cleared.
  4. Use file-growth and mtime heuristics. For multi-agent sessions, PolyPilot checks whether events.jsonl is still growing. If mtime stays fresh but file size stops increasing for multiple checks, it assumes the connection is dead and moves to recovery.
  5. Force-complete sessions to prevent infinite spinners. When the heuristics say the event stream is dead, PolyPilot flushes any partial response, adds a system warning, and force-completes the session rather than leaving the UI stuck in "Thinking…" forever.
  6. Scan external session directories and lock files. A background ExternalSessionScanner polls session-state folders and lock PIDs to infer whether orphaned sessions are probably still active after restart.
  7. Maintain extensive regression coverage. A nontrivial amount of test coverage now exists purely to keep these resume/watchdog/idle heuristics from regressing.

This is the downstream cost of the missing persisted completion signal: a basic resume question becomes a mix of log parsing, timeout tuning, file-system probing, and recovery heuristics.

Proposed fix

The simplest fix is:

  • Persist session.idle to events.jsonl.

Acceptable alternative fixes

If there is a strong product reason to keep session.idle ephemeral, then an equivalent persisted signal is still needed. Any of these would address the underlying problem:

  1. add a persisted session.turn_complete / session.ready event,
  2. include explicit completion state in session.resume / resume metadata,
  3. persist a final session-status snapshot that can be read after reconnect.

The important point is not the event name; it is that there must be some persisted, authoritative completion marker for resumed clients.

Why this seems worth addressing upstream

There is already prior evidence that this lifecycle edge is fragile: SDK workarounds have had to synthesize session.idle when the CLI omits it after assistant.turn_end in some flows. Persisting a definitive completion marker would remove a whole class of resume/reconnect heuristics from downstream clients.

Scope

  • Affects the CLI event log written under ~/.copilot/session-state/<session-id>/events.jsonl
  • Affects any SDK/app that restores sessions across restart, reconnect, crash recovery, or transport recreation

Upstream tracking issue: github/copilot-cli#2596

Metadata

Metadata

Assignees

No one assigned

    Labels

    externalUpstream bug or dependency issue

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions