Summary
When PolyPilot resumes or reconnects after missing the live event stream, there is no durable, authoritative way to determine whether the previous turn completed successfully.
session.idle appears to be the canonical live "turn is done / session is now idle" signal, but it is ephemeral and not written to events.jsonl. That leaves restart/reconnect logic relying on inference instead of a persisted completion marker.
Why this is a real API/CLI gap
A resumed client that missed the live stream can currently combine coarse session metadata plus persisted artifacts on disk. The relevant signals are:
| Event / signal |
Persisted? |
Reliably means the turn is done? |
session.idle |
❌ No |
✅ Yes |
assistant.turn_end |
✅ Yes |
❌ No |
assistant.message |
✅ Yes |
❌ No |
session.error |
✅ Yes |
✅ Yes, but only for failures |
session.shutdown |
✅ Yes |
✅ Yes, but only for shutdown |
SessionMetadata.ModifiedTime |
✅ Yes |
❌ No, only a coarse hint |
The key issue is that the authoritative success-path completion signal is not durably available after reconnect.
assistant.turn_end is not a substitute: it can occur between tool rounds and before subsequent reasoning/tool activity. ModifiedTime is useful as a hint, but it does not answer whether the turn is actually finished.
Why session.idle is special among ephemeral events
Most ephemeral events are understandable as live-only because they have a persisted counterpart:
assistant.message_delta → persisted assistant.message
assistant.reasoning_delta → persisted assistant.reasoning
tool.execution_partial_result / tool.execution_progress → persisted tool.execution_complete
session.idle is different: it has no persisted counterpart at all. That is what makes it problematic for restart/reconnect flows.
What PolyPilot has had to do to work around this
Some past stuck-session bugs were genuinely in PolyPilot and have been fixed. This issue tracks the remaining external gap that still forces a large workaround stack even after those local fixes.
Today PolyPilot has to do all of the following just to approximate "did this turn finish?":
- Tail-scan
events.jsonl and analyze sub-turn structure. IsSessionStillProcessing() cannot trust assistant.turn_end, so it scans the event tail and walks backward within the current sub-turn (IsCleanNoToolSubturn) to decide whether more tool rounds are likely coming.
- Run a multi-tier processing watchdog.
RunProcessingWatchdogAsync() uses different timeout regimes for resumed sessions, normal inactivity, tool-heavy turns, and deferred-idle/background-task cases. This exists largely to compensate for the missing durable completion marker.
- Handle
session.idle with active backgroundTasks as a special deferral state. PolyPilot has to defer completion when idle arrives with active agents/shells, track carryover/zombie background tasks, and sometimes re-arm IsProcessing if a later idle arrives after state was already cleared.
- Use file-growth and mtime heuristics. For multi-agent sessions, PolyPilot checks whether
events.jsonl is still growing. If mtime stays fresh but file size stops increasing for multiple checks, it assumes the connection is dead and moves to recovery.
- Force-complete sessions to prevent infinite spinners. When the heuristics say the event stream is dead, PolyPilot flushes any partial response, adds a system warning, and force-completes the session rather than leaving the UI stuck in "Thinking…" forever.
- Scan external session directories and lock files. A background
ExternalSessionScanner polls session-state folders and lock PIDs to infer whether orphaned sessions are probably still active after restart.
- Maintain extensive regression coverage. A nontrivial amount of test coverage now exists purely to keep these resume/watchdog/idle heuristics from regressing.
This is the downstream cost of the missing persisted completion signal: a basic resume question becomes a mix of log parsing, timeout tuning, file-system probing, and recovery heuristics.
Proposed fix
The simplest fix is:
- Persist
session.idle to events.jsonl.
Acceptable alternative fixes
If there is a strong product reason to keep session.idle ephemeral, then an equivalent persisted signal is still needed. Any of these would address the underlying problem:
- add a persisted
session.turn_complete / session.ready event,
- include explicit completion state in
session.resume / resume metadata,
- persist a final session-status snapshot that can be read after reconnect.
The important point is not the event name; it is that there must be some persisted, authoritative completion marker for resumed clients.
Why this seems worth addressing upstream
There is already prior evidence that this lifecycle edge is fragile: SDK workarounds have had to synthesize session.idle when the CLI omits it after assistant.turn_end in some flows. Persisting a definitive completion marker would remove a whole class of resume/reconnect heuristics from downstream clients.
Scope
- Affects the CLI event log written under
~/.copilot/session-state/<session-id>/events.jsonl
- Affects any SDK/app that restores sessions across restart, reconnect, crash recovery, or transport recreation
Upstream tracking issue: github/copilot-cli#2596
Summary
When PolyPilot resumes or reconnects after missing the live event stream, there is no durable, authoritative way to determine whether the previous turn completed successfully.
session.idleappears to be the canonical live "turn is done / session is now idle" signal, but it is ephemeral and not written toevents.jsonl. That leaves restart/reconnect logic relying on inference instead of a persisted completion marker.Why this is a real API/CLI gap
A resumed client that missed the live stream can currently combine coarse session metadata plus persisted artifacts on disk. The relevant signals are:
session.idleassistant.turn_endassistant.messagesession.errorsession.shutdownSessionMetadata.ModifiedTimeThe key issue is that the authoritative success-path completion signal is not durably available after reconnect.
assistant.turn_endis not a substitute: it can occur between tool rounds and before subsequent reasoning/tool activity.ModifiedTimeis useful as a hint, but it does not answer whether the turn is actually finished.Why
session.idleis special among ephemeral eventsMost ephemeral events are understandable as live-only because they have a persisted counterpart:
assistant.message_delta→ persistedassistant.messageassistant.reasoning_delta→ persistedassistant.reasoningtool.execution_partial_result/tool.execution_progress→ persistedtool.execution_completesession.idleis different: it has no persisted counterpart at all. That is what makes it problematic for restart/reconnect flows.What PolyPilot has had to do to work around this
Some past stuck-session bugs were genuinely in PolyPilot and have been fixed. This issue tracks the remaining external gap that still forces a large workaround stack even after those local fixes.
Today PolyPilot has to do all of the following just to approximate "did this turn finish?":
events.jsonland analyze sub-turn structure.IsSessionStillProcessing()cannot trustassistant.turn_end, so it scans the event tail and walks backward within the current sub-turn (IsCleanNoToolSubturn) to decide whether more tool rounds are likely coming.RunProcessingWatchdogAsync()uses different timeout regimes for resumed sessions, normal inactivity, tool-heavy turns, and deferred-idle/background-task cases. This exists largely to compensate for the missing durable completion marker.session.idlewith activebackgroundTasksas a special deferral state. PolyPilot has to defer completion when idle arrives with active agents/shells, track carryover/zombie background tasks, and sometimes re-armIsProcessingif a later idle arrives after state was already cleared.events.jsonlis still growing. If mtime stays fresh but file size stops increasing for multiple checks, it assumes the connection is dead and moves to recovery.ExternalSessionScannerpolls session-state folders and lock PIDs to infer whether orphaned sessions are probably still active after restart.This is the downstream cost of the missing persisted completion signal: a basic resume question becomes a mix of log parsing, timeout tuning, file-system probing, and recovery heuristics.
Proposed fix
The simplest fix is:
session.idletoevents.jsonl.Acceptable alternative fixes
If there is a strong product reason to keep
session.idleephemeral, then an equivalent persisted signal is still needed. Any of these would address the underlying problem:session.turn_complete/session.readyevent,session.resume/ resume metadata,The important point is not the event name; it is that there must be some persisted, authoritative completion marker for resumed clients.
Why this seems worth addressing upstream
There is already prior evidence that this lifecycle edge is fragile: SDK workarounds have had to synthesize
session.idlewhen the CLI omits it afterassistant.turn_endin some flows. Persisting a definitive completion marker would remove a whole class of resume/reconnect heuristics from downstream clients.Scope
~/.copilot/session-state/<session-id>/events.jsonlUpstream tracking issue: github/copilot-cli#2596