Live-scenario coverage for the non-happy feedback-rejection paths by clkao · Pull Request #302 · spacedock-dev/spacedock

clkao · 2026-06-05T05:35:07Z

Guard the feedback-rejection runaway-loop: add live + offline coverage for the 3-cycle escalation and 2-cycle reviewer-reuse paths that no standing test exercised.

What changed

Add a feedback-3-cycle-escalation shared runtime scenario with Claude and Codex runners.
Add assertThirdCycleEscalation grading durable state (cycle count + escalation marker + non-terminal + no 4th-bounce report).
Evolve rejection-flow to drive the 2-cycle trajectory with reviewer-reuse.
Add offline negatives + an offline reviewer-reuse table test; bind the seed-scenarios doc block to the code table.

Evidence

Offline go test ./... 1124/15 packages passed; cycle-1 audit findings closed, mutation-controlled.
Live TestLiveClaudeSharedScenarios/feedback-3-cycle-escalation PASS (71.3s) — a real FO escalates on cycle 3 instead of a 4th bounce.

gq

…303 merging; hwk/ev3e in flight; xa+3c0b roadmapped

…ycle Cover the two non-happy feedback-rejection guarantees no standing test exercised: - feedback-3-cycle-escalation (new shared scenario): on the 3rd consecutive REJECTED validation the FO escalates to the human instead of auto-bouncing a 4th time. Host-neutral assertThirdCycleEscalation grades durable entity-body state alone — >=3 `### Feedback Cycles` entries, the fixture-instructed escalation marker, and no post-cycle-3 implementation report. Registered with both Claude and Codex runners; offline negatives (4th-auto-bounce, stalled-at- cycle-2) red the assertion. - rejection-flow (evolved): drive the full 2-cycle trajectory the Go port dropped — route back, re-implement, re-validate a second cycle reusing the kept-alive reviewer. assertRejectionFlow now grades two implementation reports + two recorded cycles; the host runners grade the reviewer-reuse signal (Claude SendMessage / Codex send_input). Offline negative reds a single-cycle end-state. Doc-lock: bind the `` block in scenario-testing- principles.md to sharedRuntimeScenarios() (TestSeedScenariosDocLock reds on drift in either direction); name the new scenario in the dev README list. Live legs (AC-3 both hosts, AC-4 live cycle) are cited at validation/terminal per the live-AC policy; the secret-free `go test ./...` lane is green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ction-scope Cycle-2 rework from the detached adversarial audit of 7e03d5ea. Material: - A: add an isolating negative for assertThirdCycleEscalation's no-post-cycle-3- report check — marker present + cycle-3 entry + a stray post-cycle-3 implementation report, no other defect. Mutation control: changing the check to `reports > 99999` now reds exactly this case (before, every report-carrying negative also lacked the marker and red on the marker check first, leaving the report-count clause unguarded). - B: assertThirdCycleEscalation now requires the entity stayed NON-terminal — escalation parks for the human, it does not self-resolve to status: done. Add the isolating negative (marker + cycle-3 + status: done, no other defect); mutation control: deleting the check reds exactly this case. Polish: - Section-scope the `- Cycle N:` and escalation-marker matches to the `### Feedback Cycles` section (was matched anywhere in the body); add the out-of-section isolating case + mutation control. - Move assertClaude/CodexReviewerReuse to a default-tag file and add an offline table test (wrong-recipient / unrelated-tool / loose-narration red, real tool-call passes) so they no longer depend only on the live runners. - Doc/comment nits: drop the stale "three" count in scenario-testing-principles.md (the seed block is the source of truth); realign stale AC-number labels in test comments to this entity's AC scheme. Offline `go test ./...` green (1124/15 pkgs); live lane compiles. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d-correlated reuse The rejection-flow fixture started already-rejected (a pre-written cycle-1 validation report), so no cycle-1 reviewer was ever spawned in-session and the cycle-2 reviewer-reuse signal was unreachable — the FO correctly fresh-dispatched and the reuse assertions failed across all three CI lanes (PR #302). Fixture: start at status: implementation, before cycle-1 validation. The FO now drives a real cycle-1 implementation (omit the fix marker) -> cycle-1 validation REJECTS (spawning a reviewer kept alive entering the feedback flow) -> route back -> cycle-2 rework applies the marker -> cycle-2 re-validation REUSES the kept-alive reviewer. The README's implementation stage is per-round (omit on first, apply on rework); the durable 2-cycle end-state (>=2 impl reports, >=2 cycle entries, marker) is unchanged. Reuse assertions: a local live run of BOTH hosts revealed the reuse matchers were ALSO wrong — they expected a "validation" name, but a kept-alive reviewer is re-engaged by its opaque handle. assertClaudeReviewerReuse now correlates the Agent/Task spawn whose description names the validation stage -> the agentId its tool_result returns -> a SendMessage to that agentId (or a name). assertCodexReviewerReuse parses the real collab_tool_call shape and correlates the validation spawn_agent's receiver thread -> a send_input to that thread, distinguishing reviewer reuse from the feedback-to-implementation send_input. Both proven on the real recorded transcripts and mutation-controlled offline. Offline negatives RED on the shipped no-reviewer shape (a fixture pre-containing a validation report; a run with no reuse call). Timeout sized against a MEASURED local sonnet run: 4 live phases = 13.65 min end-to-end (~3.1m ensign work + ~10.5m teams-mode polling), Codex 5.5 min. Per-scenario rejection-flow timeout 8m -> 22m; CI/local go test -timeout 40m (the suite exceeds Go's default 10m binary timeout, the opus panic source). Docs + workflow guards updated to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The captain ruling bans long basket/binary timeouts. A live shared scenario shares ONE spacedock host process (reviewer-reuse needs the cycle-1 reviewer alive across the route-back, so it cannot be split into per-stage launches), and a measured sonnet rejection-flow is ~6 min of genuine sequential model work — but every UNIT (each FO turn, each ensign stage) completes well under 60s and the host stream emits frequent liveness events. streamWithStallWatchdog reads the host stdout incrementally and resets a 60s (stageStallTimeout) timer on every line; it kills the process and fails fast only when the stream goes silent that long — catching a true hang precisely while never false-killing slow-but-live thinking. Both the Claude and Codex runners now stream through it instead of context.WithTimeout(scenario.timeout). Offline unit tests drive synthetic streams (no model): a normal-cadence stream passes; a >budget-silent stream is killed + returns a stallError. Mutation-controlled — disabling the timer reset false-kills a normal stream; removing the kill/error path hangs a stalled stream (caught by the test -timeout). Offline go test ./... green (1146/15); live lane vet+build clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… table The captain ruling bans long basket timeouts; the stall-watchdog is now the guard. Leaving the 22m (and 8m/2m/90s) values as a `timeout` field in the very shared table this work fixes is dead basket data that drifts back, so remove the field. - sharedRuntimeScenario drops `timeout`; the four scenarios carry only their host-neutral facts (name, provenance, intent). - Both out-of-scope live adapters (claudeRunnerAdapter via livescenario_adapter, auto_continue) reach the host through claudeLiveRunner.run, which already streams through the per-stage stall-watchdog — so they inherit the watchdog mechanically. Removed their now-dead per-call `timeout` (2m / 15m); no explicit per-call deadline needed, no logic restructured. - TestSharedRuntimeScenarioDefinitions drops the `timeout > 0` check and the rejection-flow > gate-guardrail ordering (they policed a removed field); the host-neutrality field-set + no-host-named-field assertions stay, now banning a timeout field too. Offline go test ./... green (1146/15); live lane vet+build clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…stop Captain-approved final timeout numbers, grounded in the team-lead's instrumented measurement (rejection-flow sonnet 6.16m / opus 8.98m; max FO-stream-silence gap sonnet 28.3s / opus 59.1s; full 4-scenario serial suite ~27m opus). - stageStallTimeout 60s -> 120s: ~2x margin over the measured 59.1s opus max-gap (a sub-agent dispatch blocks the FO top-level stream); 60s left only ~1s margin -> CI-flaky. Still a precise hang-detector. The runner/watchdog comments carry the measured justification. - AUDIT the exception rather than evade it: there is a standing AC-1 60s-cap guard (live_budget_test.go's AST scan of streamwatch_test.go + live_test.go, plus the pre-existing streamWatcher/quietBudgetDefault) that my 120s would silently slip past because the AST guard does not scan stall_watchdog_test.go. Added TestStageStallTimeoutIsCaptainApprovedException pinning stageStallTimeout == 120s so the single sanctioned >60s exception is documented + drift-guarded (a new value reds the pin and needs captain re-approval). - go test -timeout 40m stays a LOOSE BACKSTOP (above the ~27m full-suite wall-time), not a hang-timer — the 120s watchdog is the real guard. CI commands, README locals, doc-contract + release/journey guard comments reframed to "backstop only" and mutation-confirmed (drifting either command string reds its guard). Offline go test ./... green (1147/15); live lane vet+build clean. The pre-existing 60s streamWatcher on the TestLiveEnsignCycle path is untouched (out of scope) but carries the same ~1s opus margin — flagged to team-lead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The cycle-4 120s watchdog is reversed per captain ruling: (1) a standing codified guard (TestNoTimeoutLiteralExceeds60s) already encodes the <=60s discipline, so a 120s const in a file it didn't scan was a silent end-run; (2) the 59.1s "max gap" was measured on timestamped events only, but the watcher resets on EVERY drained line (incl. the untimestamped assistant lines between them), so true inter-line silence is <=59.1s and likely much less — 60s was never at ~1s margin. UNIFY (X) at 60s — one mechanism, DRY: - Removed the duplicate streamWithStallWatchdog (+ its two files). Added streamWatcher.drainToExit(budget, label): the predicate-free sibling of expect that runs the process to exit accumulating the full transcript, resets the no-progress deadline on every drained line, and kills + trips stepTimeout on quietBudgetDefault (60s) silence. Both shared-scenario runners now wire io.Pipe + newCmdPoller + newStreamWatcher + drainToExit — the SAME path TestLiveEnsignCycle uses. Wiring reuse was clean; the runner files carry zero duration literals. - KEPT quietBudgetDefault = 60s; TestLiveEnsignCycle byte-unchanged; the 120s exception pin removed. - STRENGTHENED the AC-1 guard: liveBudgetSources now also scans the shared-runner files (the unguarded gap that let the old basket exist). Mutation-confirmed a 90s literal in the codex runner reds TestNoTimeoutLiteralExceeds60s there. - drainToExit offline unit tests (full-transcript / resets-on-activity / kills-stalled), mutation-controlled (reset-disabled false-kills a progressing stream; kill-removed hangs a stalled one). - Go -timeout 40m command-line backstop stays (a flag, outside the source-literal guard); CI + README + doc-contract comments reframed to the 60s quiet budget. Offline go test ./... green (1147/15); live lane vet+build clean. Cycle-3 reuse work untouched. The 60s-vs-opus question is left to the authoritative opus live drive (the watcher resets per-line, testing TRUE silence, not the timestamped gap). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…amWatcher" This reverts commit 5171982e6b291629dcede89c33bcbf0b5f33c744.

…60s streamWatcher"" This reverts commit aa162a53715ea7ee63baa3ee953288f192118141.

…-passes A detached cycle-8 audit found both reuse assertions FALSE-PASS the exact contract violation they guard: a run where the FO FRESH-dispatches the cycle-2 validator (forbidden by the #141 keepalive contract) and messages it by validation name/thread greened the assertion. The old name/prompt match could not tell a kept-alive cycle-1 reviewer from a freshly-spawned cycle-2 one, and the recorded production transcripts (testdata/*.jsonl) address members by SPAWN NAME, not agentId — so the name path (the one that fires) was operation-blind. Fix (both hosts): genuine reuse now requires BOTH a message to the cycle-1 reviewer's handle AND the ABSENCE of a fresh cycle-2 validation spawn. The discriminator: the rejection-flow drives ONE cycle-1 validation spawn; a reuse run re-engages it (one spawn), a fresh-dispatch emits a SECOND validation spawn. So >1 validation Agent/Task (Claude) or spawn_agent (Codex) is an immediate FAIL regardless of any reuse-shaped message. The docstrings are corrected to the name-addressed production shape. Mutation-controlled (offline table tests): a planted fresh-dispatch + name/thread message REDs each host path; genuine kept-alive reuse (one spawn + reuse message) PASSES. Verified the strengthened assertions still pass the REAL recorded transcripts (both had exactly one validation spawn); neutering each >1 guard re-opens the false-pass. AC-1 guard polish: durationOf now folds the `time.Duration(N)` integer conversion (`time.Duration(120)*time.Second` = 120s), the BasicLit-only scalar evasion the audit named; mutation-confirmed it reds in a guarded runner file. Un-foldable forms (float conversion, runtime scalar) stay deferred to the wired-const value-guard, documented. Offline go test ./... green (1150/15); live lane vet+build clean. The verified-sound prior work (fixture, timeout unify at 60s, AC-2/AC-5) untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…integration) Rebasing onto origin/next pulled in z8 #305, which added a Pi parity requirement: TestSharedScenarioRunnerCoverage now requires every shared scenario to carry a piSharedScenarioCoverageMap() entry. z8 covered the three scenarios that existed when it branched; this cycle's feedback-3-cycle-escalation (added here) had none, so the parity guard reds. Add its entry as a `gap` (Pi has no live-safe shared FO escalation runner yet), matching z8's honest-reason pattern for the others. Offline go test ./... green (1166/15); live lane vet+build clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

clkao had a problem deploying to CI-E2E June 5, 2026 05:35 — with GitHub Actions Failure

clkao had a problem deploying to CI-E2E-OPUS June 5, 2026 05:35 — with GitHub Actions Failure

clkao had a problem deploying to CI-E2E-CODEX June 5, 2026 05:35 — with GitHub Actions Failure

clkao added a commit that referenced this pull request Jun 5, 2026

pr: feedback-nonhappy-live-coverage #302 pending

35f23c0

clkao added a commit that referenced this pull request Jun 5, 2026

debrief: session 2026-06-05 #2 — proof-policy reckoning; gq #302 + wm #…

848862e

…303 merging; hwk/ev3e in flight; xa+3c0b roadmapped

clkao and others added 11 commits June 5, 2026 08:54

Revert "Reverse 120s; unify shared runners onto the existing 60s stre…

e6cf83d

…amWatcher" This reverts commit 5171982e6b291629dcede89c33bcbf0b5f33c744.

Revert "Revert "Reverse 120s; unify shared runners onto the existing …

0389648

…60s streamWatcher"" This reverts commit aa162a53715ea7ee63baa3ee953288f192118141.

clkao force-pushed the spacedock-ensign/feedback-nonhappy-live-coverage branch from 97a63bc to f21a126 Compare June 5, 2026 19:26

clkao had a problem deploying to CI-E2E June 5, 2026 19:26 — with GitHub Actions Failure

clkao temporarily deployed to CI-E2E-CODEX June 5, 2026 19:26 — with GitHub Actions Inactive

clkao temporarily deployed to CI-E2E-OPUS June 5, 2026 19:26 — with GitHub Actions Inactive

clkao temporarily deployed to CI-E2E-PI June 5, 2026 19:26 — with GitHub Actions Inactive

clkao temporarily deployed to CI-E2E June 5, 2026 19:50 — with GitHub Actions Inactive

clkao merged commit 9cd3b69 into next Jun 5, 2026
9 of 10 checks passed

clkao added a commit that referenced this pull request Jun 5, 2026

complete: gq done (PASSED; #302 squash-merged 9cd3b69)

10952a7

clkao added a commit that referenced this pull request Jun 5, 2026

debrief: session 2026-06-05 #4 - Codex 4n merge and PR #302 triage

8d2c646

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Live-scenario coverage for the non-happy feedback-rejection paths#302

Live-scenario coverage for the non-happy feedback-rejection paths#302
clkao merged 11 commits into
nextfrom
spacedock-ensign/feedback-nonhappy-live-coverage

clkao commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

clkao commented Jun 5, 2026

What changed

Evidence

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant