Skip to content

Commit 4f2c7da

Browse files
PureWeenCopilot
andauthored
fix: smart turn-completion detection — eliminates 600s delay for clean turns (#475)
## Problem After PolyPilot restarts, sessions that completed a clean turn (no tools) while the app was down appear stuck for up to 600s because `IsSessionStillProcessing` treated `assistant.turn_end` as "still processing." ## Root cause `session.idle` is ephemeral (`ephemeral: true` in the SDK schema) — intentionally never written to events.jsonl. So there was no disk-based signal for "turn complete." The poller and watchdog handled this with a 600s timeout. ## Fix Enhanced `IsSessionStillProcessing` with smart completion detection: when the last event is `assistant.turn_end` or `assistant.message`, check the last N events for pending `tool.execution_start`. If no tools are pending, the turn completed cleanly — treat as idle. New `GetTailEventTypes` helper reads the last 8KB of events.jsonl and returns event types in reverse order for efficient tail analysis. ### What this changes: - **Clean turns (no tools)**: Detected as idle immediately — no 600s wait - **Tool-using turns**: Still correctly get 600s watchdog timeout (tools may still be running) - **Zero CLI changes needed** — purely PolyPilot-side improvement ## Tests 3 new tests: - `TurnEndWithoutPendingTools_ReturnsFalse` — clean turn detected as idle - `TurnEndWithPendingTools_ReturnsTrue` — tools pending stays as processing - `MessageWithContentNoTools_ReturnsFalse` — final message without tools is idle ## Related - PR #472 (merged) — poll-then-resume pattern, IDLE-DEFER-REARM - Issue #403 (fixed in #472) — IDLE-DEFER re-arm - This eliminates the need for a CLI feature request for disk-based completion markers --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent d34ae90 commit 4f2c7da

4 files changed

Lines changed: 273 additions & 11 deletions

File tree

.claude/skills/multi-agent-orchestration/SKILL.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -338,6 +338,7 @@ When modifying orchestration, verify:
338338
- [ ] **INV-O7**: Worker timeouts use 10-minute default (600s for resumed sessions)
339339
- [ ] **INV-O8**: Cancellation tokens propagated to all async operations
340340
- [ ] **INV-O15**: IDLE-DEFER flushes CurrentResponse before breaking (content preservation)
341+
- [ ] **INV-O17**: IDLE-DEFER-REARM re-arms IsProcessing when idle arrives with backgroundTasks but IsProcessing is already false (PR #472)
341342

342343
---
343344

@@ -763,6 +764,59 @@ These layers remain as fallbacks for edge cases where IDLE-DEFER doesn't fire:
763764
| `RecoverFromPrematureIdleIfNeededAsync` | Poll signal + events.jsonl freshness | Collect full response after partial TCS |
764765
| `IsEventsFileActive` | events.jsonl mtime < 15s | Detect ongoing CLI activity |
765766

767+
### IDLE-DEFER-REARM (PR #472)
768+
769+
When multiple `session.idle` events arrive for a session with active `backgroundTasks`,
770+
the first IDLE-DEFER correctly defers completion. But if `CompleteResponse` runs between
771+
the first and second idle (e.g., triggered by a watchdog or another code path), then
772+
`IsProcessing` is already `false` when the second idle arrives. Without re-arming,
773+
background agents would finish their work but no completion event would fire — the
774+
classic "zero-idle" symptom where the session appears stuck.
775+
776+
**IDLE-DEFER-REARM** detects this scenario and re-arms processing:
777+
778+
```csharp
779+
case SessionIdleEvent idle:
780+
CancelTurnEndFallback(state);
781+
782+
if (HasActiveBackgroundTasks(idle))
783+
{
784+
// ... existing IDLE-DEFER logic (flush, break) ...
785+
786+
// REARM: If IsProcessing is already false, re-arm it so the
787+
// next idle (without background tasks) triggers CompleteResponse.
788+
if (!state.Info.IsProcessing)
789+
{
790+
state.Info.IsProcessing = true;
791+
state.Info.HasUsedToolsThisTurn = true; // 600s tool timeout
792+
RestartProcessingWatchdog(state);
793+
Debug($"[IDLE-DEFER-REARM] ...");
794+
}
795+
break;
796+
}
797+
798+
CompleteResponse(state, idleGeneration);
799+
```
800+
801+
**What the re-arm sets:**
802+
- `IsProcessing = true` — so the next non-deferred idle fires `CompleteResponse`
803+
- `HasUsedToolsThisTurn = true` — enables the 600s tool timeout instead of 120s,
804+
since background agents may run for several minutes
805+
- Restarts the processing watchdog — provides a safety net if no further idle arrives
806+
- Logs `[IDLE-DEFER-REARM]` to `event-diagnostics.log`
807+
808+
### `session.idle` Is Ephemeral — Never Persisted
809+
810+
> **⚠️ `session.idle` is NEVER written to `events.jsonl` by design.** It is a
811+
> transient signal from the CLI, not a persisted event. Any restore or recovery
812+
> logic that depends on reading `session.idle` from disk is fundamentally flawed.
813+
> The poller in `PollEventsAndResumeWhenIdleAsync` watches for `session.shutdown`
814+
> instead, which IS persisted.
815+
816+
This matters for IDLE-DEFER because after an app restart, there will be no
817+
`session.idle` replay — the watchdog and resume logic handle completion detection
818+
for interrupted sessions, not idle event replay.
819+
766820
---
767821

768822
## "Fix with Copilot" — Multi-Agent Awareness
@@ -959,6 +1013,7 @@ Reflect mode runs multiple iterations. Expect this pattern:
9591013
| Orchestration hangs on reconnect | Check for missing TrySetCanceled | TCS not canceled; see INV-O9 |
9601014
| Many IDLE-DEFER entries | `grep "IDLE-DEFER" diagnostics.log` | Normal — worker has active sub-agents; wait for completion |
9611015
| IDLE-DEFER but worker never completes | Check if background tasks are leaking | Sub-agent/shell not terminating; check CLI logs |
1016+
| IDLE-DEFER-REARM entries | `grep "IDLE-DEFER-REARM" diagnostics.log` | Normal — IsProcessing was false when deferred idle arrived; re-armed for next completion |
9621017
| Worker stuck 30+ min, events.jsonl fresh | Check file size across watchdog cycles | Dead connection — file-size-growth check should catch in ~360s (INV-O16) |
9631018

9641019
---

.github/copilot-instructions.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -254,6 +254,8 @@ For multi-agent sessions, Case B also checks **file-size-growth**: if events.jso
254254

255255
Note: `session.idle` is an ephemeral event (`ephemeral: true` in the SDK schema) — it is delivered over the live event stream but intentionally NOT written to `events.jsonl`. When `session.idle` includes active `backgroundTasks` (sub-agents, shells), the IDLE-DEFER logic defers completion until a subsequent idle arrives with empty/null backgroundTasks. In rare cases where `IsProcessing` was already cleared (by watchdog timeout or reconnect) before the deferred idle arrives, the session may appear stuck until the watchdog fires again — see issue #403.
256256

257+
When `session.idle` with active `backgroundTasks` arrives but `IsProcessing` is already `false`, the handler re-arms: sets `IsProcessing = true`, `HasUsedToolsThisTurn = true` (600s timeout), restarts watchdog, logs `[IDLE-DEFER-REARM]`. This prevents zero-idle symptoms where IDLE-DEFER fires once but subsequent idles are no-ops.
258+
257259
When the watchdog fires, it marshals state mutations to the UI thread via `InvokeOnUI()` and adds a system warning message.
258260

259261
### Diagnostic Log Tags
@@ -262,6 +264,7 @@ The event diagnostics log (`~/.polypilot/event-diagnostics.log`) uses these tags
262264
- `[EVT]` — SDK event received (only SessionIdleEvent, AssistantTurnEndEvent, SessionErrorEvent)
263265
- `[IDLE]` — SessionIdleEvent dispatched to CompleteResponse
264266
- `[IDLE-DEFER]` — SessionIdleEvent deferred due to active background tasks (agents/shells)
267+
- `[IDLE-DEFER-REARM]` — SessionIdleEvent re-armed IsProcessing after it was already cleared
265268
- `[COMPLETE]` — CompleteResponse executed or skipped
266269
- `[RECONNECT]` — session replaced after disconnect
267270
- `[ERROR]` — SessionErrorEvent or SendAsync/reconnect failure cleared IsProcessing
@@ -288,6 +291,8 @@ When a user changes the model via the UI dropdown:
288291
- The SDK continues using the original model from session creation
289292
- To truly switch models, the session must be destroyed and recreated
290293

294+
`GetSessionModel` prioritizes: (1) user's explicit choice (`session.Model`), (2) backend-reported model from usage info, (3) `DefaultModel` fallback. `ShouldAcceptObservedModel()` in `ModelHelper.cs` prevents `SessionUsageInfoEvent` and `AssistantUsageEvent` from overwriting an explicit user model selection — the observed model is only accepted if no explicit choice was made or if the observed model matches the explicit choice.
295+
291296
### SDK Data Types
292297
- `AssistantUsageData` properties (`InputTokens`, `OutputTokens`, etc.) are `Double?` not `int?`
293298
- Use `Convert.ToInt32(value)` for conversion, not `value as int?`
@@ -302,6 +307,9 @@ Avoid `@bind:event="oninput"` — causes round-trip lag per keystroke. Use plain
302307

303308
For detailed stuck-session debugging knowledge (8 invariants from 7 PRs of fix cycles), see `.claude/skills/processing-state-safety/SKILL.md`.
304309

310+
### Smart Completion Scan
311+
**Smart completion scan:** `assistant.turn_end` and `assistant.message` are not unconditionally terminal — they appear between every tool round. `IsSessionStillProcessing` uses `IsCleanNoToolSubturn()` to scan backward from the last event within the current sub-turn (bounded by `assistant.turn_start`/`session.resume`/`session.start`). If any `tool.execution_*` event is found in the current sub-turn, the session is considered still processing. Clean no-tool turns are detected immediately, eliminating the 600s watchdog delay for simple conversations.
312+
305313
### Session Persistence
306314
- Active sessions: `~/.polypilot/active-sessions.json` (includes `LastPrompt` — last user message if session was processing during save)
307315
- Session state: `~/.copilot/session-state/<guid>/events.jsonl` (SDK-managed, stays in ~/.copilot)

PolyPilot.Tests/StuckSessionRecoveryTests.cs

Lines changed: 128 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -303,8 +303,6 @@ public void StalenessThreshold_UsesWatchdogToolExecutionTimeout()
303303
[InlineData("assistant.reasoning")]
304304
[InlineData("assistant.reasoning_delta")]
305305
[InlineData("assistant.intent")]
306-
[InlineData("assistant.turn_end")] // between tool rounds — still processing
307-
[InlineData("assistant.message")] // mid-turn text — still processing
308306
[InlineData("tool.execution_end")] // tool just completed — still processing
309307
public void IsSessionStillProcessing_AllActiveEventTypes_ReturnTrue(string eventType)
310308
{
@@ -331,7 +329,7 @@ public void IsSessionStillProcessing_AllActiveEventTypes_ReturnTrue(string event
331329
[InlineData("session.idle")]
332330
[InlineData("session.error")]
333331
[InlineData("session.shutdown")]
334-
[InlineData("session.start")] // created but never used — not actively processing
332+
[InlineData("session.start")] // created but never used — not actively processing
335333
public void IsSessionStillProcessing_TerminalEventTypes_ReturnFalse(string eventType)
336334
{
337335
var svc = CreateService();
@@ -353,6 +351,133 @@ public void IsSessionStillProcessing_TerminalEventTypes_ReturnFalse(string event
353351
}
354352
}
355353

354+
// --- Smart completion: assistant.turn_end/message without pending tools ---
355+
356+
[Fact]
357+
public void IsSessionStillProcessing_TurnEndWithoutPendingTools_ReturnsFalse()
358+
{
359+
// Turn completed cleanly — no tool.execution_start after the last turn_end.
360+
var svc = CreateService();
361+
var tmpDir = Path.Combine(Path.GetTempPath(), "polypilot-test-" + Guid.NewGuid().ToString("N"));
362+
var sessionId = Guid.NewGuid().ToString();
363+
var sessionDir = Path.Combine(tmpDir, sessionId);
364+
Directory.CreateDirectory(sessionDir);
365+
var eventsFile = Path.Combine(sessionDir, "events.jsonl");
366+
367+
try
368+
{
369+
File.WriteAllText(eventsFile,
370+
"""{"type":"assistant.turn_start","data":{}}""" + "\n" +
371+
"""{"type":"assistant.message","data":{"content":"hello"}}""" + "\n" +
372+
"""{"type":"assistant.turn_end","data":{}}""");
373+
var result = svc.IsSessionStillProcessing(sessionId, tmpDir);
374+
Assert.False(result, "Clean turn_end without pending tools should not report still processing");
375+
}
376+
finally { Directory.Delete(tmpDir, true); }
377+
}
378+
379+
[Fact]
380+
public void IsSessionStillProcessing_TurnEndWithPendingTools_ReturnsTrue()
381+
{
382+
// Turn ended after a tool round. This can still be between rounds, so restore
383+
// must stay conservative and keep the session active until the next events arrive.
384+
var svc = CreateService();
385+
var tmpDir = Path.Combine(Path.GetTempPath(), "polypilot-test-" + Guid.NewGuid().ToString("N"));
386+
var sessionId = Guid.NewGuid().ToString();
387+
var sessionDir = Path.Combine(tmpDir, sessionId);
388+
Directory.CreateDirectory(sessionDir);
389+
var eventsFile = Path.Combine(sessionDir, "events.jsonl");
390+
391+
try
392+
{
393+
File.WriteAllText(eventsFile,
394+
"""{"type":"assistant.turn_start","data":{}}""" + "\n" +
395+
"""{"type":"assistant.message","data":{}}""" + "\n" +
396+
"""{"type":"tool.execution_start","data":{}}""" + "\n" +
397+
"""{"type":"tool.execution_complete","data":{}}""" + "\n" +
398+
"""{"type":"assistant.turn_end","data":{}}""");
399+
var result = svc.IsSessionStillProcessing(sessionId, tmpDir);
400+
Assert.True(result, "turn_end after tool activity in the same sub-turn should report still processing");
401+
}
402+
finally { Directory.Delete(tmpDir, true); }
403+
}
404+
405+
[Fact]
406+
public void IsSessionStillProcessing_TurnEndNoCurrentToolsAfterEarlierToolRound_ReturnsFalse()
407+
{
408+
// Older tool rounds must not leak across assistant.turn_start boundaries and keep a
409+
// later clean no-tool turn alive.
410+
var svc = CreateService();
411+
var tmpDir = Path.Combine(Path.GetTempPath(), "polypilot-test-" + Guid.NewGuid().ToString("N"));
412+
var sessionId = Guid.NewGuid().ToString();
413+
var sessionDir = Path.Combine(tmpDir, sessionId);
414+
Directory.CreateDirectory(sessionDir);
415+
var eventsFile = Path.Combine(sessionDir, "events.jsonl");
416+
417+
try
418+
{
419+
File.WriteAllText(eventsFile,
420+
"""{"type":"assistant.turn_start","data":{}}""" + "\n" +
421+
"""{"type":"assistant.message","data":{}}""" + "\n" +
422+
"""{"type":"tool.execution_start","data":{}}""" + "\n" +
423+
"""{"type":"tool.execution_complete","data":{}}""" + "\n" +
424+
"""{"type":"assistant.turn_end","data":{}}""" + "\n" +
425+
"""{"type":"assistant.turn_start","data":{}}""" + "\n" +
426+
"""{"type":"assistant.message","data":{"content":"final answer"}}""" + "\n" +
427+
"""{"type":"assistant.turn_end","data":{}}""");
428+
var result = svc.IsSessionStillProcessing(sessionId, tmpDir);
429+
Assert.False(result, "A clean no-tool turn_end should not be kept active by tools from an earlier round");
430+
}
431+
finally { Directory.Delete(tmpDir, true); }
432+
}
433+
434+
[Fact]
435+
public void IsSessionStillProcessing_MessageWithContentNoTools_ReturnsFalse()
436+
{
437+
// Final assistant.message with content, no pending tools — session done.
438+
var svc = CreateService();
439+
var tmpDir = Path.Combine(Path.GetTempPath(), "polypilot-test-" + Guid.NewGuid().ToString("N"));
440+
var sessionId = Guid.NewGuid().ToString();
441+
var sessionDir = Path.Combine(tmpDir, sessionId);
442+
Directory.CreateDirectory(sessionDir);
443+
var eventsFile = Path.Combine(sessionDir, "events.jsonl");
444+
445+
try
446+
{
447+
File.WriteAllText(eventsFile,
448+
"""{"type":"assistant.turn_start","data":{}}""" + "\n" +
449+
"""{"type":"assistant.message","data":{"content":"Here is your answer."}}""");
450+
var result = svc.IsSessionStillProcessing(sessionId, tmpDir);
451+
Assert.False(result, "assistant.message without pending tools should not report still processing");
452+
}
453+
finally { Directory.Delete(tmpDir, true); }
454+
}
455+
456+
[Fact]
457+
public void IsSessionStillProcessing_MessageWithPendingToolInSameTurn_ReturnsTrue()
458+
{
459+
// assistant.message as last event, but tool.execution_start earlier in same turn
460+
// → smart scan should detect the pending tool and report still processing.
461+
var svc = CreateService();
462+
var tmpDir = Path.Combine(Path.GetTempPath(), "polypilot-test-" + Guid.NewGuid().ToString("N"));
463+
var sessionId = Guid.NewGuid().ToString();
464+
var sessionDir = Path.Combine(tmpDir, sessionId);
465+
Directory.CreateDirectory(sessionDir);
466+
var eventsFile = Path.Combine(sessionDir, "events.jsonl");
467+
468+
try
469+
{
470+
File.WriteAllText(eventsFile,
471+
"""{"type":"assistant.turn_start","data":{}}""" + "\n" +
472+
"""{"type":"assistant.message","data":{}}""" + "\n" +
473+
"""{"type":"tool.execution_start","data":{}}""" + "\n" +
474+
"""{"type":"assistant.message","data":{"content":"Running the tool now..."}}""");
475+
var result = svc.IsSessionStillProcessing(sessionId, tmpDir);
476+
Assert.True(result, "assistant.message with pending tool.execution_start should report still processing");
477+
}
478+
finally { Directory.Delete(tmpDir, true); }
479+
}
480+
356481
// --- Resume branch selection: RESUME-ACTIVE vs RESUME-QUIESCE ---
357482

358483
[Fact]

0 commit comments

Comments
 (0)