Skip to content

Live-scenario coverage for the non-happy feedback-rejection paths#302

Merged
clkao merged 11 commits into
nextfrom
spacedock-ensign/feedback-nonhappy-live-coverage
Jun 5, 2026
Merged

Live-scenario coverage for the non-happy feedback-rejection paths#302
clkao merged 11 commits into
nextfrom
spacedock-ensign/feedback-nonhappy-live-coverage

Conversation

@clkao
Copy link
Copy Markdown
Collaborator

@clkao clkao commented Jun 5, 2026

Guard the feedback-rejection runaway-loop: add live + offline coverage for the 3-cycle escalation and 2-cycle reviewer-reuse paths that no standing test exercised.

What changed

  • Add a feedback-3-cycle-escalation shared runtime scenario with Claude and Codex runners.
  • Add assertThirdCycleEscalation grading durable state (cycle count + escalation marker + non-terminal + no 4th-bounce report).
  • Evolve rejection-flow to drive the 2-cycle trajectory with reviewer-reuse.
  • Add offline negatives + an offline reviewer-reuse table test; bind the seed-scenarios doc block to the code table.

Evidence

  • Offline go test ./... 1124/15 packages passed; cycle-1 audit findings closed, mutation-controlled.
  • Live TestLiveClaudeSharedScenarios/feedback-3-cycle-escalation PASS (71.3s) — a real FO escalates on cycle 3 instead of a 4th bounce.

gq

clkao added a commit that referenced this pull request Jun 5, 2026
clkao added a commit that referenced this pull request Jun 5, 2026
…303 merging; hwk/ev3e in flight; xa+3c0b roadmapped
clkao and others added 11 commits June 5, 2026 08:54
…ycle

Cover the two non-happy feedback-rejection guarantees no standing test
exercised:

- feedback-3-cycle-escalation (new shared scenario): on the 3rd consecutive
  REJECTED validation the FO escalates to the human instead of auto-bouncing a
  4th time. Host-neutral assertThirdCycleEscalation grades durable entity-body
  state alone — >=3 `### Feedback Cycles` entries, the fixture-instructed
  escalation marker, and no post-cycle-3 implementation report. Registered with
  both Claude and Codex runners; offline negatives (4th-auto-bounce, stalled-at-
  cycle-2) red the assertion.

- rejection-flow (evolved): drive the full 2-cycle trajectory the Go port
  dropped — route back, re-implement, re-validate a second cycle reusing the
  kept-alive reviewer. assertRejectionFlow now grades two implementation reports
  + two recorded cycles; the host runners grade the reviewer-reuse signal
  (Claude SendMessage / Codex send_input). Offline negative reds a single-cycle
  end-state.

Doc-lock: bind the `<!-- seed-scenarios -->` block in scenario-testing-
principles.md to sharedRuntimeScenarios() (TestSeedScenariosDocLock reds on
drift in either direction); name the new scenario in the dev README list.

Live legs (AC-3 both hosts, AC-4 live cycle) are cited at validation/terminal
per the live-AC policy; the secret-free `go test ./...` lane is green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ction-scope

Cycle-2 rework from the detached adversarial audit of 7e03d5ea.

Material:
- A: add an isolating negative for assertThirdCycleEscalation's no-post-cycle-3-
  report check — marker present + cycle-3 entry + a stray post-cycle-3
  implementation report, no other defect. Mutation control: changing the check to
  `reports > 99999` now reds exactly this case (before, every report-carrying
  negative also lacked the marker and red on the marker check first, leaving the
  report-count clause unguarded).
- B: assertThirdCycleEscalation now requires the entity stayed NON-terminal —
  escalation parks for the human, it does not self-resolve to status: done. Add
  the isolating negative (marker + cycle-3 + status: done, no other defect);
  mutation control: deleting the check reds exactly this case.

Polish:
- Section-scope the `- Cycle N:` and escalation-marker matches to the
  `### Feedback Cycles` section (was matched anywhere in the body); add the
  out-of-section isolating case + mutation control.
- Move assertClaude/CodexReviewerReuse to a default-tag file and add an offline
  table test (wrong-recipient / unrelated-tool / loose-narration red, real
  tool-call passes) so they no longer depend only on the live runners.
- Doc/comment nits: drop the stale "three" count in scenario-testing-principles.md
  (the seed block is the source of truth); realign stale AC-number labels in test
  comments to this entity's AC scheme.

Offline `go test ./...` green (1124/15 pkgs); live lane compiles.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d-correlated reuse

The rejection-flow fixture started already-rejected (a pre-written cycle-1
validation report), so no cycle-1 reviewer was ever spawned in-session and the
cycle-2 reviewer-reuse signal was unreachable — the FO correctly fresh-dispatched
and the reuse assertions failed across all three CI lanes (PR #302).

Fixture: start at status: implementation, before cycle-1 validation. The FO now
drives a real cycle-1 implementation (omit the fix marker) -> cycle-1 validation
REJECTS (spawning a reviewer kept alive entering the feedback flow) -> route back
-> cycle-2 rework applies the marker -> cycle-2 re-validation REUSES the kept-alive
reviewer. The README's implementation stage is per-round (omit on first, apply on
rework); the durable 2-cycle end-state (>=2 impl reports, >=2 cycle entries, marker)
is unchanged.

Reuse assertions: a local live run of BOTH hosts revealed the reuse matchers were
ALSO wrong — they expected a "validation" name, but a kept-alive reviewer is
re-engaged by its opaque handle. assertClaudeReviewerReuse now correlates the
Agent/Task spawn whose description names the validation stage -> the agentId its
tool_result returns -> a SendMessage to that agentId (or a name). assertCodexReviewerReuse
parses the real collab_tool_call shape and correlates the validation spawn_agent's
receiver thread -> a send_input to that thread, distinguishing reviewer reuse from
the feedback-to-implementation send_input. Both proven on the real recorded
transcripts and mutation-controlled offline.

Offline negatives RED on the shipped no-reviewer shape (a fixture pre-containing a
validation report; a run with no reuse call). Timeout sized against a MEASURED
local sonnet run: 4 live phases = 13.65 min end-to-end (~3.1m ensign work + ~10.5m
teams-mode polling), Codex 5.5 min. Per-scenario rejection-flow timeout 8m -> 22m;
CI/local go test -timeout 40m (the suite exceeds Go's default 10m binary timeout,
the opus panic source). Docs + workflow guards updated to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The captain ruling bans long basket/binary timeouts. A live shared scenario shares
ONE spacedock host process (reviewer-reuse needs the cycle-1 reviewer alive across
the route-back, so it cannot be split into per-stage launches), and a measured
sonnet rejection-flow is ~6 min of genuine sequential model work — but every UNIT
(each FO turn, each ensign stage) completes well under 60s and the host stream
emits frequent liveness events.

streamWithStallWatchdog reads the host stdout incrementally and resets a 60s
(stageStallTimeout) timer on every line; it kills the process and fails fast only
when the stream goes silent that long — catching a true hang precisely while never
false-killing slow-but-live thinking. Both the Claude and Codex runners now stream
through it instead of context.WithTimeout(scenario.timeout).

Offline unit tests drive synthetic streams (no model): a normal-cadence stream
passes; a >budget-silent stream is killed + returns a stallError. Mutation-controlled
— disabling the timer reset false-kills a normal stream; removing the kill/error
path hangs a stalled stream (caught by the test -timeout). Offline go test ./...
green (1146/15); live lane vet+build clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… table

The captain ruling bans long basket timeouts; the stall-watchdog is now the guard.
Leaving the 22m (and 8m/2m/90s) values as a `timeout` field in the very shared
table this work fixes is dead basket data that drifts back, so remove the field.

- sharedRuntimeScenario drops `timeout`; the four scenarios carry only their
  host-neutral facts (name, provenance, intent).
- Both out-of-scope live adapters (claudeRunnerAdapter via livescenario_adapter,
  auto_continue) reach the host through claudeLiveRunner.run, which already streams
  through the per-stage stall-watchdog — so they inherit the watchdog mechanically.
  Removed their now-dead per-call `timeout` (2m / 15m); no explicit per-call
  deadline needed, no logic restructured.
- TestSharedRuntimeScenarioDefinitions drops the `timeout > 0` check and the
  rejection-flow > gate-guardrail ordering (they policed a removed field); the
  host-neutrality field-set + no-host-named-field assertions stay, now banning a
  timeout field too.

Offline go test ./... green (1146/15); live lane vet+build clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…stop

Captain-approved final timeout numbers, grounded in the team-lead's instrumented
measurement (rejection-flow sonnet 6.16m / opus 8.98m; max FO-stream-silence gap
sonnet 28.3s / opus 59.1s; full 4-scenario serial suite ~27m opus).

- stageStallTimeout 60s -> 120s: ~2x margin over the measured 59.1s opus max-gap
  (a sub-agent dispatch blocks the FO top-level stream); 60s left only ~1s margin
  -> CI-flaky. Still a precise hang-detector. The runner/watchdog comments carry
  the measured justification.
- AUDIT the exception rather than evade it: there is a standing AC-1 60s-cap guard
  (live_budget_test.go's AST scan of streamwatch_test.go + live_test.go, plus the
  pre-existing streamWatcher/quietBudgetDefault) that my 120s would silently slip
  past because the AST guard does not scan stall_watchdog_test.go. Added
  TestStageStallTimeoutIsCaptainApprovedException pinning stageStallTimeout == 120s
  so the single sanctioned >60s exception is documented + drift-guarded (a new
  value reds the pin and needs captain re-approval).
- go test -timeout 40m stays a LOOSE BACKSTOP (above the ~27m full-suite wall-time),
  not a hang-timer — the 120s watchdog is the real guard. CI commands, README
  locals, doc-contract + release/journey guard comments reframed to "backstop only"
  and mutation-confirmed (drifting either command string reds its guard).

Offline go test ./... green (1147/15); live lane vet+build clean. The pre-existing
60s streamWatcher on the TestLiveEnsignCycle path is untouched (out of scope) but
carries the same ~1s opus margin — flagged to team-lead.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cycle-4 120s watchdog is reversed per captain ruling: (1) a standing codified
guard (TestNoTimeoutLiteralExceeds60s) already encodes the <=60s discipline, so a
120s const in a file it didn't scan was a silent end-run; (2) the 59.1s "max gap"
was measured on timestamped events only, but the watcher resets on EVERY drained
line (incl. the untimestamped assistant lines between them), so true inter-line
silence is <=59.1s and likely much less — 60s was never at ~1s margin.

UNIFY (X) at 60s — one mechanism, DRY:
- Removed the duplicate streamWithStallWatchdog (+ its two files). Added
  streamWatcher.drainToExit(budget, label): the predicate-free sibling of expect
  that runs the process to exit accumulating the full transcript, resets the
  no-progress deadline on every drained line, and kills + trips stepTimeout on
  quietBudgetDefault (60s) silence. Both shared-scenario runners now wire io.Pipe +
  newCmdPoller + newStreamWatcher + drainToExit — the SAME path TestLiveEnsignCycle
  uses. Wiring reuse was clean; the runner files carry zero duration literals.
- KEPT quietBudgetDefault = 60s; TestLiveEnsignCycle byte-unchanged; the 120s
  exception pin removed.
- STRENGTHENED the AC-1 guard: liveBudgetSources now also scans the shared-runner
  files (the unguarded gap that let the old basket exist). Mutation-confirmed a 90s
  literal in the codex runner reds TestNoTimeoutLiteralExceeds60s there.
- drainToExit offline unit tests (full-transcript / resets-on-activity /
  kills-stalled), mutation-controlled (reset-disabled false-kills a progressing
  stream; kill-removed hangs a stalled one).
- Go -timeout 40m command-line backstop stays (a flag, outside the source-literal
  guard); CI + README + doc-contract comments reframed to the 60s quiet budget.

Offline go test ./... green (1147/15); live lane vet+build clean. Cycle-3 reuse
work untouched. The 60s-vs-opus question is left to the authoritative opus live
drive (the watcher resets per-line, testing TRUE silence, not the timestamped gap).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…amWatcher"

This reverts commit 5171982e6b291629dcede89c33bcbf0b5f33c744.
…60s streamWatcher""

This reverts commit aa162a53715ea7ee63baa3ee953288f192118141.
…-passes

A detached cycle-8 audit found both reuse assertions FALSE-PASS the exact contract
violation they guard: a run where the FO FRESH-dispatches the cycle-2 validator
(forbidden by the #141 keepalive contract) and messages it by validation
name/thread greened the assertion. The old name/prompt match could not tell a
kept-alive cycle-1 reviewer from a freshly-spawned cycle-2 one, and the recorded
production transcripts (testdata/*.jsonl) address members by SPAWN NAME, not
agentId — so the name path (the one that fires) was operation-blind.

Fix (both hosts): genuine reuse now requires BOTH a message to the cycle-1
reviewer's handle AND the ABSENCE of a fresh cycle-2 validation spawn. The
discriminator: the rejection-flow drives ONE cycle-1 validation spawn; a reuse run
re-engages it (one spawn), a fresh-dispatch emits a SECOND validation spawn. So
>1 validation Agent/Task (Claude) or spawn_agent (Codex) is an immediate FAIL
regardless of any reuse-shaped message. The docstrings are corrected to the
name-addressed production shape.

Mutation-controlled (offline table tests): a planted fresh-dispatch + name/thread
message REDs each host path; genuine kept-alive reuse (one spawn + reuse message)
PASSES. Verified the strengthened assertions still pass the REAL recorded
transcripts (both had exactly one validation spawn); neutering each >1 guard
re-opens the false-pass.

AC-1 guard polish: durationOf now folds the `time.Duration(N)` integer conversion
(`time.Duration(120)*time.Second` = 120s), the BasicLit-only scalar evasion the
audit named; mutation-confirmed it reds in a guarded runner file. Un-foldable
forms (float conversion, runtime scalar) stay deferred to the wired-const
value-guard, documented.

Offline go test ./... green (1150/15); live lane vet+build clean. The
verified-sound prior work (fixture, timeout unify at 60s, AC-2/AC-5) untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…integration)

Rebasing onto origin/next pulled in z8 #305, which added a Pi parity requirement:
TestSharedScenarioRunnerCoverage now requires every shared scenario to carry a
piSharedScenarioCoverageMap() entry. z8 covered the three scenarios that existed
when it branched; this cycle's feedback-3-cycle-escalation (added here) had none,
so the parity guard reds. Add its entry as a `gap` (Pi has no live-safe shared FO
escalation runner yet), matching z8's honest-reason pattern for the others.

Offline go test ./... green (1166/15); live lane vet+build clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@clkao clkao force-pushed the spacedock-ensign/feedback-nonhappy-live-coverage branch from 97a63bc to f21a126 Compare June 5, 2026 19:26
@clkao clkao temporarily deployed to CI-E2E-CODEX June 5, 2026 19:26 — with GitHub Actions Inactive
@clkao clkao temporarily deployed to CI-E2E-OPUS June 5, 2026 19:26 — with GitHub Actions Inactive
@clkao clkao merged commit 9cd3b69 into next Jun 5, 2026
9 of 10 checks passed
clkao added a commit that referenced this pull request Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant