Live-scenario coverage for the non-happy feedback-rejection paths#302
Merged
Conversation
clkao
added a commit
that referenced
this pull request
Jun 5, 2026
clkao
added a commit
that referenced
this pull request
Jun 5, 2026
…303 merging; hwk/ev3e in flight; xa+3c0b roadmapped
…ycle Cover the two non-happy feedback-rejection guarantees no standing test exercised: - feedback-3-cycle-escalation (new shared scenario): on the 3rd consecutive REJECTED validation the FO escalates to the human instead of auto-bouncing a 4th time. Host-neutral assertThirdCycleEscalation grades durable entity-body state alone — >=3 `### Feedback Cycles` entries, the fixture-instructed escalation marker, and no post-cycle-3 implementation report. Registered with both Claude and Codex runners; offline negatives (4th-auto-bounce, stalled-at- cycle-2) red the assertion. - rejection-flow (evolved): drive the full 2-cycle trajectory the Go port dropped — route back, re-implement, re-validate a second cycle reusing the kept-alive reviewer. assertRejectionFlow now grades two implementation reports + two recorded cycles; the host runners grade the reviewer-reuse signal (Claude SendMessage / Codex send_input). Offline negative reds a single-cycle end-state. Doc-lock: bind the `<!-- seed-scenarios -->` block in scenario-testing- principles.md to sharedRuntimeScenarios() (TestSeedScenariosDocLock reds on drift in either direction); name the new scenario in the dev README list. Live legs (AC-3 both hosts, AC-4 live cycle) are cited at validation/terminal per the live-AC policy; the secret-free `go test ./...` lane is green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ction-scope Cycle-2 rework from the detached adversarial audit of 7e03d5ea. Material: - A: add an isolating negative for assertThirdCycleEscalation's no-post-cycle-3- report check — marker present + cycle-3 entry + a stray post-cycle-3 implementation report, no other defect. Mutation control: changing the check to `reports > 99999` now reds exactly this case (before, every report-carrying negative also lacked the marker and red on the marker check first, leaving the report-count clause unguarded). - B: assertThirdCycleEscalation now requires the entity stayed NON-terminal — escalation parks for the human, it does not self-resolve to status: done. Add the isolating negative (marker + cycle-3 + status: done, no other defect); mutation control: deleting the check reds exactly this case. Polish: - Section-scope the `- Cycle N:` and escalation-marker matches to the `### Feedback Cycles` section (was matched anywhere in the body); add the out-of-section isolating case + mutation control. - Move assertClaude/CodexReviewerReuse to a default-tag file and add an offline table test (wrong-recipient / unrelated-tool / loose-narration red, real tool-call passes) so they no longer depend only on the live runners. - Doc/comment nits: drop the stale "three" count in scenario-testing-principles.md (the seed block is the source of truth); realign stale AC-number labels in test comments to this entity's AC scheme. Offline `go test ./...` green (1124/15 pkgs); live lane compiles. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d-correlated reuse The rejection-flow fixture started already-rejected (a pre-written cycle-1 validation report), so no cycle-1 reviewer was ever spawned in-session and the cycle-2 reviewer-reuse signal was unreachable — the FO correctly fresh-dispatched and the reuse assertions failed across all three CI lanes (PR #302). Fixture: start at status: implementation, before cycle-1 validation. The FO now drives a real cycle-1 implementation (omit the fix marker) -> cycle-1 validation REJECTS (spawning a reviewer kept alive entering the feedback flow) -> route back -> cycle-2 rework applies the marker -> cycle-2 re-validation REUSES the kept-alive reviewer. The README's implementation stage is per-round (omit on first, apply on rework); the durable 2-cycle end-state (>=2 impl reports, >=2 cycle entries, marker) is unchanged. Reuse assertions: a local live run of BOTH hosts revealed the reuse matchers were ALSO wrong — they expected a "validation" name, but a kept-alive reviewer is re-engaged by its opaque handle. assertClaudeReviewerReuse now correlates the Agent/Task spawn whose description names the validation stage -> the agentId its tool_result returns -> a SendMessage to that agentId (or a name). assertCodexReviewerReuse parses the real collab_tool_call shape and correlates the validation spawn_agent's receiver thread -> a send_input to that thread, distinguishing reviewer reuse from the feedback-to-implementation send_input. Both proven on the real recorded transcripts and mutation-controlled offline. Offline negatives RED on the shipped no-reviewer shape (a fixture pre-containing a validation report; a run with no reuse call). Timeout sized against a MEASURED local sonnet run: 4 live phases = 13.65 min end-to-end (~3.1m ensign work + ~10.5m teams-mode polling), Codex 5.5 min. Per-scenario rejection-flow timeout 8m -> 22m; CI/local go test -timeout 40m (the suite exceeds Go's default 10m binary timeout, the opus panic source). Docs + workflow guards updated to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The captain ruling bans long basket/binary timeouts. A live shared scenario shares ONE spacedock host process (reviewer-reuse needs the cycle-1 reviewer alive across the route-back, so it cannot be split into per-stage launches), and a measured sonnet rejection-flow is ~6 min of genuine sequential model work — but every UNIT (each FO turn, each ensign stage) completes well under 60s and the host stream emits frequent liveness events. streamWithStallWatchdog reads the host stdout incrementally and resets a 60s (stageStallTimeout) timer on every line; it kills the process and fails fast only when the stream goes silent that long — catching a true hang precisely while never false-killing slow-but-live thinking. Both the Claude and Codex runners now stream through it instead of context.WithTimeout(scenario.timeout). Offline unit tests drive synthetic streams (no model): a normal-cadence stream passes; a >budget-silent stream is killed + returns a stallError. Mutation-controlled — disabling the timer reset false-kills a normal stream; removing the kill/error path hangs a stalled stream (caught by the test -timeout). Offline go test ./... green (1146/15); live lane vet+build clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… table The captain ruling bans long basket timeouts; the stall-watchdog is now the guard. Leaving the 22m (and 8m/2m/90s) values as a `timeout` field in the very shared table this work fixes is dead basket data that drifts back, so remove the field. - sharedRuntimeScenario drops `timeout`; the four scenarios carry only their host-neutral facts (name, provenance, intent). - Both out-of-scope live adapters (claudeRunnerAdapter via livescenario_adapter, auto_continue) reach the host through claudeLiveRunner.run, which already streams through the per-stage stall-watchdog — so they inherit the watchdog mechanically. Removed their now-dead per-call `timeout` (2m / 15m); no explicit per-call deadline needed, no logic restructured. - TestSharedRuntimeScenarioDefinitions drops the `timeout > 0` check and the rejection-flow > gate-guardrail ordering (they policed a removed field); the host-neutrality field-set + no-host-named-field assertions stay, now banning a timeout field too. Offline go test ./... green (1146/15); live lane vet+build clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…stop Captain-approved final timeout numbers, grounded in the team-lead's instrumented measurement (rejection-flow sonnet 6.16m / opus 8.98m; max FO-stream-silence gap sonnet 28.3s / opus 59.1s; full 4-scenario serial suite ~27m opus). - stageStallTimeout 60s -> 120s: ~2x margin over the measured 59.1s opus max-gap (a sub-agent dispatch blocks the FO top-level stream); 60s left only ~1s margin -> CI-flaky. Still a precise hang-detector. The runner/watchdog comments carry the measured justification. - AUDIT the exception rather than evade it: there is a standing AC-1 60s-cap guard (live_budget_test.go's AST scan of streamwatch_test.go + live_test.go, plus the pre-existing streamWatcher/quietBudgetDefault) that my 120s would silently slip past because the AST guard does not scan stall_watchdog_test.go. Added TestStageStallTimeoutIsCaptainApprovedException pinning stageStallTimeout == 120s so the single sanctioned >60s exception is documented + drift-guarded (a new value reds the pin and needs captain re-approval). - go test -timeout 40m stays a LOOSE BACKSTOP (above the ~27m full-suite wall-time), not a hang-timer — the 120s watchdog is the real guard. CI commands, README locals, doc-contract + release/journey guard comments reframed to "backstop only" and mutation-confirmed (drifting either command string reds its guard). Offline go test ./... green (1147/15); live lane vet+build clean. The pre-existing 60s streamWatcher on the TestLiveEnsignCycle path is untouched (out of scope) but carries the same ~1s opus margin — flagged to team-lead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cycle-4 120s watchdog is reversed per captain ruling: (1) a standing codified guard (TestNoTimeoutLiteralExceeds60s) already encodes the <=60s discipline, so a 120s const in a file it didn't scan was a silent end-run; (2) the 59.1s "max gap" was measured on timestamped events only, but the watcher resets on EVERY drained line (incl. the untimestamped assistant lines between them), so true inter-line silence is <=59.1s and likely much less — 60s was never at ~1s margin. UNIFY (X) at 60s — one mechanism, DRY: - Removed the duplicate streamWithStallWatchdog (+ its two files). Added streamWatcher.drainToExit(budget, label): the predicate-free sibling of expect that runs the process to exit accumulating the full transcript, resets the no-progress deadline on every drained line, and kills + trips stepTimeout on quietBudgetDefault (60s) silence. Both shared-scenario runners now wire io.Pipe + newCmdPoller + newStreamWatcher + drainToExit — the SAME path TestLiveEnsignCycle uses. Wiring reuse was clean; the runner files carry zero duration literals. - KEPT quietBudgetDefault = 60s; TestLiveEnsignCycle byte-unchanged; the 120s exception pin removed. - STRENGTHENED the AC-1 guard: liveBudgetSources now also scans the shared-runner files (the unguarded gap that let the old basket exist). Mutation-confirmed a 90s literal in the codex runner reds TestNoTimeoutLiteralExceeds60s there. - drainToExit offline unit tests (full-transcript / resets-on-activity / kills-stalled), mutation-controlled (reset-disabled false-kills a progressing stream; kill-removed hangs a stalled one). - Go -timeout 40m command-line backstop stays (a flag, outside the source-literal guard); CI + README + doc-contract comments reframed to the 60s quiet budget. Offline go test ./... green (1147/15); live lane vet+build clean. Cycle-3 reuse work untouched. The 60s-vs-opus question is left to the authoritative opus live drive (the watcher resets per-line, testing TRUE silence, not the timestamped gap). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…amWatcher" This reverts commit 5171982e6b291629dcede89c33bcbf0b5f33c744.
…60s streamWatcher"" This reverts commit aa162a53715ea7ee63baa3ee953288f192118141.
…-passes A detached cycle-8 audit found both reuse assertions FALSE-PASS the exact contract violation they guard: a run where the FO FRESH-dispatches the cycle-2 validator (forbidden by the #141 keepalive contract) and messages it by validation name/thread greened the assertion. The old name/prompt match could not tell a kept-alive cycle-1 reviewer from a freshly-spawned cycle-2 one, and the recorded production transcripts (testdata/*.jsonl) address members by SPAWN NAME, not agentId — so the name path (the one that fires) was operation-blind. Fix (both hosts): genuine reuse now requires BOTH a message to the cycle-1 reviewer's handle AND the ABSENCE of a fresh cycle-2 validation spawn. The discriminator: the rejection-flow drives ONE cycle-1 validation spawn; a reuse run re-engages it (one spawn), a fresh-dispatch emits a SECOND validation spawn. So >1 validation Agent/Task (Claude) or spawn_agent (Codex) is an immediate FAIL regardless of any reuse-shaped message. The docstrings are corrected to the name-addressed production shape. Mutation-controlled (offline table tests): a planted fresh-dispatch + name/thread message REDs each host path; genuine kept-alive reuse (one spawn + reuse message) PASSES. Verified the strengthened assertions still pass the REAL recorded transcripts (both had exactly one validation spawn); neutering each >1 guard re-opens the false-pass. AC-1 guard polish: durationOf now folds the `time.Duration(N)` integer conversion (`time.Duration(120)*time.Second` = 120s), the BasicLit-only scalar evasion the audit named; mutation-confirmed it reds in a guarded runner file. Un-foldable forms (float conversion, runtime scalar) stay deferred to the wired-const value-guard, documented. Offline go test ./... green (1150/15); live lane vet+build clean. The verified-sound prior work (fixture, timeout unify at 60s, AC-2/AC-5) untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…integration) Rebasing onto origin/next pulled in z8 #305, which added a Pi parity requirement: TestSharedScenarioRunnerCoverage now requires every shared scenario to carry a piSharedScenarioCoverageMap() entry. z8 covered the three scenarios that existed when it branched; this cycle's feedback-3-cycle-escalation (added here) had none, so the parity guard reds. Add its entry as a `gap` (Pi has no live-safe shared FO escalation runner yet), matching z8's honest-reason pattern for the others. Offline go test ./... green (1166/15); live lane vet+build clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
97a63bc to
f21a126
Compare
clkao
added a commit
that referenced
this pull request
Jun 5, 2026
clkao
added a commit
that referenced
this pull request
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Guard the feedback-rejection runaway-loop: add live + offline coverage for the 3-cycle escalation and 2-cycle reviewer-reuse paths that no standing test exercised.
What changed
feedback-3-cycle-escalationshared runtime scenario with Claude and Codex runners.assertThirdCycleEscalationgrading durable state (cycle count + escalation marker + non-terminal + no 4th-bounce report).rejection-flowto drive the 2-cycle trajectory with reviewer-reuse.Evidence
go test ./...1124/15 packages passed; cycle-1 audit findings closed, mutation-controlled.TestLiveClaudeSharedScenarios/feedback-3-cycle-escalationPASS (71.3s) — a real FO escalates on cycle 3 instead of a 4th bounce.gq