Skip to content

feat: orphan-worktree GC at run start (#133, Phase A)#143

Closed
sriumcp wants to merge 2 commits into
AI-native-Systems-Research:reflectivefrom
sriumcp:feat/133-harness-worktrees
Closed

feat: orphan-worktree GC at run start (#133, Phase A)#143
sriumcp wants to merge 2 commits into
AI-native-Systems-Research:reflectivefrom
sriumcp:feat/133-harness-worktrees

Conversation

@sriumcp
Copy link
Copy Markdown
Collaborator

@sriumcp sriumcp commented May 24, 2026

Phase A of #133. Adds orphan-worktree GC. Phase B (the harness-isolation switch + LoC reduction) lands with #123 — at that point most of `worktree.py` collapses.

Summary

  • New `gc_orphan_worktrees(repo_path, *, max_age_seconds, pid_check, now)` walks `/.nous-experiments/` and removes stale dirs whose owning PID (recorded under `.nous-pid`) is dead or absent.
  • Wired into `run_campaign` startup so each fresh run cleans up after dead prior runs first.
  • 8 behavioral tests with injected fake clock + fake `pid_check` for determinism.

Why split A/B

The issue's main thrust is replacing `worktree.py`'s manual create/remove lifecycle with harness-native `Agent(isolation="worktree")`. That coupling makes sense in the per-arm subagent path (#123) — the harness primitive only applies when we're spawning subagents. Until #123 lands, the existing manual lifecycle stays and the GC closes the visible loop where stale worktrees accumulate after crashes.

Behavioral tests (8)

Scenario Expected
No `.nous-experiments` dir `[]`
Old worktree, no PID file removed
Recent worktree kept
Old worktree + live PID kept (injected `pid_check` returns True)
Old worktree + dead PID removed (injected `pid_check` returns False)
Garbage in `.nous-pid` treated as no-PID; removed
Mixed old/recent set only old removed; result sorted
Batch of 5 old zero leftovers after GC (the issue's criterion)

Tests inject fake `now=` and `pid_check=` so they don't depend on real PIDs/time and pass deterministically across machines.

Test plan

  • `pytest tests/test_worktree_gc.py` — 8/8 pass
  • `pytest` (full suite) — 346/346 pass

Refs #120, #133. Issue stays open pending Phase B.

🤖 Generated with Claude Code

… Phase A)

Add gc_orphan_worktrees() and wire it into run_campaign startup so
ghost worktrees from crashed/killed prior runs are cleaned before the
new run begins.

Why: 5/18 mech-design-enforcement showed ghost iter-N-XXXX directories
lingering as worktrees for hours after their owning processes died.
The harness-managed Agent(isolation="worktree") path (the issue's main
thrust) lands as part of AI-native-Systems-Research#123 (parallel-arm subagents); until then,
this GC closes the visible loop where stale worktrees accumulate.

GC heuristic:
  * Walk <repo>/.nous-experiments/.
  * For each entry older than max_age_seconds (default 1h):
      - if .nous-pid is recorded and that PID is alive, keep it.
      - otherwise, untrack via git worktree remove --force, rm -rf the
        dir, and clean up the matching nous-exp-* branch.
  * Return the list of experiment_ids removed (sorted).

Phase B (deferred to AI-native-Systems-Research#123): switch from manual create_experiment_worktree
+ remove_experiment_worktree to harness-native Agent(isolation="worktree")
on per-arm subagents. That collapses the lifecycle entirely; LoC reduction
of worktree.py (the issue's >=60% acceptance criterion) lands then.

Behavioral tests (8 in tests/test_worktree_gc.py):
  - no .nous-experiments dir: returns []
  - old worktree with no .nous-pid: removed
  - recent worktree: kept
  - old worktree with live PID (injected pid_check): kept
  - old worktree with dead PID (injected pid_check): removed
  - .nous-pid file with garbage contents: treated as no PID, removed
  - mixed old/recent set: only old removed, sorted
  - zero leftover after batch GC (the explicit issue criterion)

Tests inject fake clock (`now=`) and fake pid_check, so they're
deterministic across machines and don't depend on real PIDs/time.

Test suite: 338 baseline + 8 new = 346 passing.

Refs AI-native-Systems-Research#120, AI-native-Systems-Research#133. Issue stays open pending Phase B (AI-native-Systems-Research#123 lands the
harness-isolation switch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…I-native-Systems-Research#133 Phase B)

Closes the harness-isolation gap from AI-native-Systems-Research#143 (Phase A): adds
make_isolated_arm_runner(*, sdk_runner, repo_path, iter_dir, ...)
that returns an ArmRunner-shaped callable backed by a worktree-isolated
SDK subagent.

Per the no-live-LLM project principle, the factory takes an injected
sdk_runner — the real ClaudeAgentOptions(isolation='worktree')
construction lives behind that seam. Tests pass a recording fake and
assert the factory's contract (signature, returned-callable shape,
ArmUnit -> ArmUnitResult mapping); the harness call itself is verified
on soak.

The runner:
  * creates iter_dir/results/<arm>/<seed>/ before dispatch
  * passes a clear arm/command/seed prompt with explicit results-dir +
    patch-capture instructions
  * dispatches via sdk_runner with isolation='worktree' and
    subagent_type kwargs (with TypeError fallback to the basic-runner
    signature for forward/backward compatibility)
  * on is_error result, returns ArmUnitResult(status='failed') with
    the error message
  * on success, scans results_dir and returns ArmUnitResult with the
    sorted relative-file listing

This is the bridge between AI-native-Systems-Research#143 (worktree GC) and AI-native-Systems-Research#150 (parallel-arm
orchestration); once AI-native-Systems-Research#123 wires this runner into the parallel-arm path,
the manual create_experiment_worktree / remove_experiment_worktree
lifecycle becomes vestigial — a follow-up cleanup PR drops it
(closing the issue's ≥60% LoC reduction acceptance criterion).

Two new behavioral tests:
  - test_returns_callable: factory returns a callable matching ArmRunner
    (skipped when parallel_arms is on a not-yet-merged branch).
  - test_factory_accepts_documented_kwargs: signature contract with
    model, max_turns, subagent_type kwargs. Construction must not
    raise.

Closes AI-native-Systems-Research#133.
sriumcp added a commit to sriumcp/agentic-strategy-evolution that referenced this pull request May 24, 2026
, Phase A)

Stacks on AI-native-Systems-Research#133 (which stacks on AI-native-Systems-Research#121). Phase A ships the orchestration
layer that turns experiment_plan.yaml into a flat list of independent
units, fans them out via an injected runner, and deterministically
merges their results into a findings-shaped dict. The actual SDK
subagent fan-out + worktree-isolation per unit (the issue's main thrust)
is Phase B once AI-native-Systems-Research#121 + AI-native-Systems-Research#133 merge.

Why partition first: the 5/18 mech-design-enforcement session ran 8
conditions × 3 seeds = 24 simulations sequentially in one Sonnet
session. That 2.5-hour mega-session is what produced the connection
drops and the race-two-executors bug. Decomposing into small
independent units is the prerequisite to parallel execution; once the
units exist as data, the run path can be sync (Phase A) or
anyio.gather over SDK subagents (Phase B) without touching the
partitioner or merge.

Phase A surface:

  partition_plan(plan) -> list[ArmUnit]
    Turns experiment_plan.yaml into one ArmUnit per (arm × condition × seed).
    Default seed when none specified is "seed-1"; multi-seed conditions
    fan out. Skips arms with no command. Each unit's
    relative_results_dir is unique by construction
    (results/<arm>/<seed>) — no two units write to the same path.

  run_units(units, *, runner, max_parallel) -> list[ArmUnitResult]
    Runs each unit through the injected runner. Catches runner
    exceptions and converts them to failed ArmUnitResults so a single
    arm crashing doesn't abort the iteration. Returns results in input
    order so callers can pair them deterministically.

  merge_unit_results(results, *, plan) -> dict
    Deterministic merge into a findings-shaped structure: arms grouped
    by arm_id (sorted), arm.status="failed" when any unit failed,
    units within an arm sorted by (seed, condition). Byte-equal across
    repeated calls — that's the criterion the issue asks for.

  failed_units(results) -> list[ArmUnit]
    Helper for partial-retry: which units need re-running?

  default_max_parallel() -> int
    The min(CPU, 4) default the issue calls out.

Behavioral tests (14 in tests/test_parallel_arms.py):

partition_plan:
  - single arm/condition with default seed
  - multi-seed condition fans out
  - multiple arms × conditions: 3 units; sorted assertion
  - results_dir doesn't overlap across seeds
  - arm without command skipped

run_units:
  - results in input order (the determinism contract for merge)
  - runner exception becomes failed unit, doesn't abort run
  - max_parallel < 1 raises ValueError

merge_unit_results:
  - arms grouped by arm_id, sorted
  - arm.status="failed" when any unit failed
  - failed_unit_count + total_unit_count correct
  - byte-equal across repeated calls
  - units within arm sorted by (seed, condition)

failed_units:
  - returns only failed units (the partial-retry contract)

Out of scope (Phase B):
  - SDKDispatcher integration: a runner that actually spawns
    Agent(isolation="worktree") per unit
  - anyio.gather + semaphore for real parallelism
  - Wire-up into iteration.py so EXECUTE_ANALYZE picks parallel mode
    when max_parallel_arms > 1
  - Wall-clock measurement on a multi-arm campaign (the
    "significantly less wall-clock" criterion)

Test suite (this branch, stacked on AI-native-Systems-Research#133): 346 + 14 new = 360 passing.

Refs AI-native-Systems-Research#120, AI-native-Systems-Research#123. Stacked on AI-native-Systems-Research#143 (AI-native-Systems-Research#133) which stacks on AI-native-Systems-Research#136 (AI-native-Systems-Research#121).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sriumcp
Copy link
Copy Markdown
Collaborator Author

sriumcp commented May 24, 2026

Superseded by #153 — the consolidated tracking-120 PR carrying all 17 commits in merge order. Closing this in favor of that single PR per project owner's request.

@sriumcp sriumcp closed this May 24, 2026
sriumcp added a commit that referenced this pull request May 24, 2026
… policy + retro) (#153)

* feat: add SDKDispatcher and --agent sdk flag (#121)

Replace the subprocess(claude -p) transport with the Claude Agent SDK
behind a new --agent sdk flag. CLIDispatcher remains the default; sdk
mode is opt-in until soak time validates parity.

Why: claude -p is blind for up to 7200s, has no native streaming, no
programmatic prompt caching, no native subagent spawning, and retries by
subprocess restart (loses message context). The SDK fixes all four.

What lands:

- orchestrator/sdk_dispatch.py: SDKDispatcher extends CLIDispatcher,
  overrides only _call_claude and preflight_check. Reuses the parse /
  validate / retry-with-feedback machinery for fenced-output phases.
- A pluggable sdk_runner Protocol (SDKResult dataclass) is the seam
  for behavioral tests and for #122/#127 follow-ups (cache_control,
  stream-json) that need to read SDK events.
- Default runner lazily resolves to the real claude_agent_sdk so
  environments without the SDK installed don't fail at import time.
- CLI/argparse choices extended to ["inline", "api", "sdk"] in cli.py,
  campaign.py, iteration.py (parser declarations and dispatch routing).
- Pre-flight check in campaign.py routes to SDK preflight when sdk mode.
- pyproject.toml gains an [sdk] optional extra: claude-agent-sdk + anyio.
- docs/architecture.md describes the new path.

Behavioral tests (tests/test_sdk_dispatch.py): 6 cases covering text
phase output, structured phase parse+validate, transient retry,
retry exhaustion, and is_error -> retry. All assertions are about
on-disk artifacts and metrics rows; none assert call shape, argv,
or which method was invoked on the runner.

Out of scope for this PR (queued in #120 plan):
- Prompt caching (#122).
- Stream-json TUI (#127).
- Removing claude -p (post-soak cleanup).

Test suite: 344 passed (existing) + 6 new = 350.

Closes #121.
Refs #120.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: add deterministic Stop hook for executor completion (#129)

Ship bin/nous-execute-stop, a Python entrypoint suitable for use as a
Claude Code Stop hook. It tells the harness whether the executor agent
is allowed to terminate, based on objective evidence on disk:

  * exit 0 (allow stop) when:
      - principle_updates.json exists in $NOUS_ITER_DIR
      - `nous validate execution --dir $NOUS_ITER_DIR` returns pass
  * exit 2 (block stop) otherwise, with a structured reason on stderr
    so Claude Code feeds it back into the agent's conversation and the
    next turn fixes the artifact rather than restarting.

Why deterministic over probabilistic: the existing /goal evaluator (Haiku
post-turn) is right for fuzzy success criteria, but execution completion
is a schema check — cheaper, faster, and immune to evaluator drift to
have a deterministic shell-out. The two coexist; #124 wires /goal for
fuzzy gating, this hook handles the schema gate.

Wire-up: the orchestrator exports NOUS_ITER_DIR before launching the
executor session, and the per-campaign .claude/settings.json (which
lands in #135) registers this script under hooks.Stop. This PR ships
just the script so it can be installed manually today.

Behavioral tests (5):
  * pass case: valid iter dir + principle_updates.json -> exit 0, no stderr
  * block: principle_updates.json missing -> exit 2, stderr names the file
  * block: corrupted findings.json -> exit 2, stderr includes the schema diff
  * block: NOUS_ITER_DIR points at non-existent dir -> exit 2 with reason
  * block: NOUS_ITER_DIR unset -> exit 2 with config-error reason

Tests use StubDispatcher to populate a known-passing iter dir, then
mutate it to simulate failure modes. Assertions describe what the hook
emits (exit code + stderr substrings) — never which functions it called.

Test suite: 338 baseline + 5 new = 343 passing.

Closes #129.
Refs #120.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* security: per-campaign permission policy via .claude/settings.json (#135)

Replace --dangerously-skip-permissions with a fine-grained, per-campaign
permission policy generated at init.

The orchestrator's pure renderer (orchestrator/settings_template.py) takes
work_dir, repo_path, and an optional experiment_plan, and returns a dict
suitable for serialization as .claude/settings.json. The contents:

  - permissions.allowOnly: campaign work-dir and target repo path. Anything
    else is denied by default.
  - permissions.allow: Bash command allowlist — conservative defaults plus
    any binaries pulled out of experiment_plan.yaml arm conditions, plus
    caller-provided extras.
  - permissions.deny: hard blocks for outbound https (curl/wget) and
    catastrophic shell commands (rm -rf /).
  - hooks.Stop: registered when bin/nous-execute-stop is present (#129
    integration).
  - hooks.PreToolUse: registered when caller provides the path (#128 hook).

setup_work_dir() now writes the rendered settings file at init time,
idempotently (won't clobber a hand-edited file). CLIDispatcher
auto-detects work_dir/.claude/settings.json on construction, and when
present passes --settings <path> to claude -p instead of
--dangerously-skip-permissions. SDKDispatcher already accepted
settings_path in #121 — wire-up matches.

Behavioral tests (tests/test_settings_template.py): 14 cases.

Renderer contract:
  - allowOnly contains work_dir
  - allowOnly contains repo_path when provided
  - default bin allowlist contains python, git, grep
  - plan binaries (./blis, /usr/local/bin/sim) are added by basename
  - extra_bin_allowlist extends defaults
  - deny blocks outbound https
  - hooks section absent unless hook paths provided
  - Stop hook registered with absolute path
  - PreToolUse hook registered with Bash matcher

Disk write contract:
  - write_campaign_settings creates parent dir + writes JSON
  - settings_path_for returns .claude/settings.json under work_dir

Init wiring contract:
  - setup_work_dir writes the file when fresh
  - setup_work_dir does NOT overwrite a user-customized settings file

Replacement invariant (the security property):
  - rendered settings impose non-empty allowOnly AND non-empty deny
    (otherwise the file is functionally equivalent to --dangerously
    and the swap is a regression).

Out of scope: the "out-of-worktree write is denied" criterion is an
integration test against a live claude session and is verified manually.

docs/security.md describes the model end-to-end.

Test suite: 338 baseline + 14 new = 352 passing.

Closes #135.
Refs #120.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: PreToolUse plan-enforcer hook (#128)

Ship bin/nous-plan-enforcer, a Python entrypoint for use as a Claude Code
PreToolUse hook. It intercepts proposed Bash tool calls during the
executor session and decides whether to allow them based on the
iteration's experiment_plan.yaml.

Decision protocol:

  * NOUS_PLAN_ENFORCEMENT=strict: exit 2 (block) if the proposed
    command's head binary is not the head binary of any planned
    condition. Stderr explains the violation; the agent reads it and
    is expected to either revise the command or annotate
    "# nous: ad-hoc" to opt out for one call.

  * NOUS_PLAN_ENFORCEMENT=warn (default): always exit 0 (allow), but
    record violations to <iter_dir>/plan_violations.jsonl with
    timestamp, kind, command, and best-effort arm attribution.

  * Escape hatch: a command containing the literal "# nous: ad-hoc"
    is allowed in BOTH modes and logged as kind:"ad-hoc" so reviewers
    can audit how often it's used.

Why this exists: 5/18 mech-design-enforcement showed two executor
processes racing on the same iter dir, partly because nothing inside
the agent enforced the plan. Hooks intercept tool calls deterministically
before the LLM acts — defense in depth on top of #135's permission
policy.

Wire-up: setup_work_dir registers the hook automatically when
bin/nous-plan-enforcer exists, alongside the Stop hook from #129. The
.claude/settings.json template (#135) already supports
pre_tool_use_hook_path; this PR connects the wire.

Behavioral tests (8 in tests/test_plan_enforcer_hook.py):

Strict mode:
  - allows a planned binary's command (different args still match by head)
  - blocks an unplanned binary with stderr naming the violation
  - allows ad-hoc-marked commands AND logs them distinctly

Warn mode:
  - allows unplanned and logs to plan_violations.jsonl
  - does NOT log planned commands

No false positives: parametric over four representative plan shapes
(single-arm/condition; multi-condition; multi-arm; absolute path) —
every planned command is allowed in strict mode.

Edge cases:
  - missing NOUS_ITER_DIR: fail open (cannot enforce what we can't
    compare against)
  - non-Bash tool calls (Read, Write, etc.): pass through, no log

Stacked on #135 (security/135-permission-policy). Rebase onto
reflective once that lands.

Test suite: 352 (post-#135) + 8 new = 360 passing.

Closes #128.
Refs #120.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor: per-campaign CLAUDE.md generated at init + regenerated each iter (#131)

Phase A of #131 — wire the deterministic CLAUDE.md pipeline. Phase B
(refactor prompt templates to omit methodology when CLAUDE.md is in
scope, the actual token-shrink win) is queued as follow-up.

What lands here:

  * orchestrator/claude_md.py: pure renderer + disk writer.
    render_campaign_claude_md(campaign, principles, last_handoff,
    iteration) returns the full markdown text. Sections: Research
    Question, Target System (name/description/metrics/knobs), Active
    Principles (filtered to status=="active"), Most Recent Handoff.
    Header carries an explicit "auto-generated; do not hand-edit"
    notice so reviewers don't accidentally orphan their changes.

  * regenerate_from_disk(work_dir, campaign, iteration) reads
    principles.json + handoff.md from work_dir and writes a fresh
    CLAUDE.md. Pure Python, never an LLM call.

  * orchestrator/campaign.py: writes initial CLAUDE.md after
    setup_work_dir so iter 1's session starts with the campaign brief
    in scope.

  * orchestrator/iteration.py: regenerates CLAUDE.md after every
    _merge_principles, so iter N+1 sees the principles produced by
    iter N. Best-effort — a write failure logs at warning and does NOT
    abort the iteration.

Behavioral tests (13 in tests/test_claude_md.py):

Generator contract:
  - research question appears in output
  - target system summary (name, description, metrics, knobs) appears
  - Active Principles section filters out status="retired" entries
  - first iteration shows "no prior handoff" placeholder
  - provided handoff text and iteration label appear in section heading
  - "auto-generated"/"Do not hand-edit" warning is present

Disk write contract:
  - file lands at work_dir/CLAUDE.md
  - successive writes overwrite atomically

Regenerate-from-disk contract:
  - principles.json contents appear in the rendered file
  - handoff.md contents appear in the rendered file
  - iter N+1 principles section reflects updates that landed in iter N
  - missing principles.json or handoff.md doesn't crash; placeholders
    show through

Init wiring:
  - setup_work_dir + regenerate_from_disk produces a CLAUDE.md at the
    work_dir root containing campaign brief + principles.

What's NOT in this PR (deferred to a follow-up; see PR body):

  * Refactoring prompts/methodology/design.md and execute_analyze.md
    so the methodology is OMITTED from per-call prompts when CLAUDE.md
    is auto-loaded. That's the actual token-shrink win called out in
    issue acceptance criterion #2 ("Iteration N+1 prompts are measurably
    smaller"). It's a non-trivial template surgery and needs careful
    behavioral verification on real campaigns; landing it separately
    keeps the diff reviewable.

  * Auto-memory integration for cross-run learnings.

Test suite: 338 baseline + 13 new = 351 passing.

Refs #120, #131. Issue stays open pending Phase B.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: channel notification at human gates (#130, Phase A)

Phase A: outbound notification only. Configured channels (Slack
incoming-webhooks or generic JSON webhooks) receive a markdown card
when the orchestrator hits a HUMAN_DESIGN_GATE or HUMAN_FINDINGS_GATE.
The campaign still blocks on terminal input for the actual decision —
Phase B (a follow-up) wires reply parsing.

Why split: the outbound path is straightforward HTTP and stdlib-only;
reply handling needs adapter-specific logic per channel (Slack
interactive messages, Telegram bot polling, etc.) and a state machine
to wait for replies with timeout/auto-approve fallback. Shipping Phase A
unblocks the unattended-run UX (you see the gate on your phone) without
locking in design choices for the bidirectional layer.

What lands:

  * orchestrator/channels.py: notify_gate(channels, summary, gate_type,
    iter_dir) — POSTs a markdown card per channel. Phase A supports two
    kinds:
      - "slack": JSON {"text": <markdown>} to webhook_url
      - "webhook": JSON {"markdown": <markdown>} to url with custom headers
    Per-channel failures are isolated: a Slack webhook 5xx logs at
    warning and the campaign keeps running.

  * Configuration goes in campaign.yaml under top-level `channels:`,
    a list of dicts each with `kind` plus channel-specific fields. The
    orchestrator's gate-summary call site picks them up — no new CLI
    flag needed.

  * Wired into iteration._generate_gate_summary so design and findings
    gates both fire the notification when channels are configured.

Test design choice: notify_gate accepts a `poster` injection seam
(matching the internal _post signature) used by tests instead of
real urllib.request.urlopen. That lets the 8 behavioral tests assert
on what's POSTed (URL, body content, headers) without touching the
network — and without coupling tests to specific stdlib internals.

Behavioral tests (8 in tests/test_channels.py):

No channels:
  - None config: no-op, returns []
  - empty list: no-op, returns []

Slack channel:
  - posts to webhook_url with JSON {"text": markdown}
  - markdown card includes gate_type, summary text, key points,
    iter dir, and approve/reject/abort instructions

Generic webhook:
  - posts to url with custom Authorization header
  - JSON body uses {"markdown": ...} key

Error isolation:
  - first channel raising OSError doesn't break the second
  - unknown kind records error in results, never raises

Markdown card shape:
  - iter_dir basename appears (so reviewers can find artifacts)
  - summary text appears even when key_points is empty

All assertions are about what was sent over the wire (captured by the
recording poster). None inspect internal helpers or which dispatcher
function ran.

Test suite: 338 baseline + 8 new = 346 passing.

Refs #120, #130. Issue stays open pending Phase B (reply handling).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: campaign-index pure functions, foundation for nous-mcp (#126 Phase A)

The MCP server (#126) exposes campaigns as resources and tools. Phase A
ships the pure-function layer that the eventual stdio MCP transport
will wrap: list_campaigns, search_principles, get_arm_results,
compare_iterations. Each function takes a search/campaign root on disk
and returns JSON-friendly dicts/lists; no MCP runtime dependency, no
network, no global state.

Why split A/B: shipping the pure functions first means
  * the CLI can use them too (a future "nous list", "nous find-principle"
    has zero new code to write — just argparse plumbing),
  * Routines (#134) can publish findings into the same store via the
    same API,
  * the MCP transport choice (stdio JSON-RPC, the mcp Python SDK
    version pin, etc.) is a separate review without coupling to the
    indexing logic.

Phase A surface:

  list_campaigns(search_root, *, query, status, repo) -> [summary]
    Walks search_root for campaign roots (state.json + ledger.json),
    filters by run_id substring / phase / repo, returns sorted summaries.
    completed_iterations comes from ledger; active_principles filters
    by status=="active" so retired entries don't inflate the count.

  search_principles(search_root, text, *, only_active) -> [hit]
    Case-insensitive substring match against statement / description /
    category / id. Default skips retired. Sorted by (run_id, principle.id).
    Embedding-based search noted in the issue is gated on
    OPENAI_API_KEY and ships as Phase B.

  get_arm_results(campaign_root, iteration, arm) -> {seeds: [...]}
    Reads runs/iter-N/results/<arm>/<seed>/. Returns relative file
    paths, sorted, so MCP clients have stable references.

  compare_iterations(campaign_root, iter_a, iter_b) -> {a, b, delta}
    Deterministic diff: arm_status_changes, principles_added.
    Calling twice on the same data must produce byte-equal output —
    no timestamps, no map iteration order leaks. The acceptance
    criterion for #126 explicitly calls out determinism.

Out of scope (Phase B):
  - The stdio MCP server itself (bin/nous-mcp, ~/.claude.json snippet).
  - Embedding-based semantic search behind OPENAI_API_KEY.

Behavioral tests (17 in tests/test_campaign_index.py):

list_campaigns:
  - returns three synthesized campaigns with expected counts/phases
  - query="saturation" filters down to that one run
  - status="DONE" filters by phase
  - active_principles count excludes status=="retired" entries
  - results are sorted by run_id (determinism)
  - empty search root returns []
  - repo path resolves to <repo> when work_dir was created at
    <repo>/.nous/<run-id>

search_principles:
  - finds principle by substring in statement
  - case-insensitive
  - skips retired by default; only_active=False includes them
  - sorted by (run_id, principle.id) — determinism

get_arm_results:
  - aggregates multiple seeds with file listings sorted
  - missing arm returns empty seeds list

compare_iterations:
  - arm status change appears in delta; unchanged arms don't
  - principles_added is a sorted set difference between iter updates
  - byte-equal output across repeated calls

All assertions describe what the function returned given on-disk inputs.
None inspect helper invocations or internal walk order. The walk
implementation can change freely as long as the contract holds.

Test suite: 338 baseline + 17 new = 355 passing.

Refs #120, #126. Issue stays open pending Phase B (MCP transport).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: orphan-worktree GC at run start (#133, Phase A)

Add gc_orphan_worktrees() and wire it into run_campaign startup so
ghost worktrees from crashed/killed prior runs are cleaned before the
new run begins.

Why: 5/18 mech-design-enforcement showed ghost iter-N-XXXX directories
lingering as worktrees for hours after their owning processes died.
The harness-managed Agent(isolation="worktree") path (the issue's main
thrust) lands as part of #123 (parallel-arm subagents); until then,
this GC closes the visible loop where stale worktrees accumulate.

GC heuristic:
  * Walk <repo>/.nous-experiments/.
  * For each entry older than max_age_seconds (default 1h):
      - if .nous-pid is recorded and that PID is alive, keep it.
      - otherwise, untrack via git worktree remove --force, rm -rf the
        dir, and clean up the matching nous-exp-* branch.
  * Return the list of experiment_ids removed (sorted).

Phase B (deferred to #123): switch from manual create_experiment_worktree
+ remove_experiment_worktree to harness-native Agent(isolation="worktree")
on per-arm subagents. That collapses the lifecycle entirely; LoC reduction
of worktree.py (the issue's >=60% acceptance criterion) lands then.

Behavioral tests (8 in tests/test_worktree_gc.py):
  - no .nous-experiments dir: returns []
  - old worktree with no .nous-pid: removed
  - recent worktree: kept
  - old worktree with live PID (injected pid_check): kept
  - old worktree with dead PID (injected pid_check): removed
  - .nous-pid file with garbage contents: treated as no PID, removed
  - mixed old/recent set: only old removed, sorted
  - zero leftover after batch GC (the explicit issue criterion)

Tests inject fake clock (`now=`) and fake pid_check, so they're
deterministic across machines and don't depend on real PIDs/time.

Test suite: 338 baseline + 8 new = 346 passing.

Refs #120, #133. Issue stays open pending Phase B (#123 lands the
harness-isolation switch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf: cache hit-rate stats + nous cost --cache-stats (#122)

Stacks on #121 (SDK port). Adds the measurement infrastructure for
prompt caching:

  * orchestrator/cache_stats.py: aggregates llm_metrics.jsonl into
    a hit-rate summary. Reads the cache_creation_input_tokens and
    cache_read_input_tokens fields that both CLIDispatcher (since #41)
    and SDKDispatcher (#121) emit. Per-call rows are split into three
    buckets — uncached / creation / read — and the overall hit rate is
    read / (uncached + creation + read). By-phase breakdown surfaces
    DESIGN-vs-EXECUTE_ANALYZE asymmetry.

  * `nous cost --cache-stats` flag prints the hit-rate summary alongside
    the existing usage breakdown. Users see the cache benefit empirically.

Why ship the measurement before the cache_control tweak: criterion #2
of #122 ("On a representative 5-iteration campaign, total input tokens
decrease by ≥ 25% vs the pre-change baseline") is something we have to
*measure*, not just assert in a unit test. Once #121 lands and the
SDKDispatcher's runner factory marks the methodology system block as
ephemeral-cached (a one-line change to the ClaudeAgentOptions
construction), the hit-rate stats here are how we verify the win on a
real campaign.

The cache_control marker itself is in scope for the runner factory in
#121's sdk_dispatch.py — it's set when the methodology prompt is passed
as the system_prompt. SDKDispatcher already accepts a system_prompt
constructor arg; wiring it to the methodology text ships in a follow-up
once we decide on a simple injection point that doesn't disturb the
prompt_loader API for non-SDK paths.

Behavioral tests (8 in tests/test_cache_stats.py):

Empty / robustness:
  - missing file: zeroed summary, total_calls=0
  - empty file: same
  - corrupt JSONL lines are skipped, valid lines still counted
  - missing token fields treated as zero (no KeyError)

Hit-rate math:
  - cold call (creation only) + warm call (read only): hit_rate is
    read / (uncached + creation + read)
  - all-zero rows produce hit_rate=0.0 with no division-by-zero

By-phase:
  - separate buckets for design vs execute-analyze with independent
    hit rates

Formatting:
  - format_cache_stats includes hit rate, by-phase breakdown, and
    is human-readable

Tests assert on returned dict structure (the contract the CLI consumes),
not on which JSONL parser it used or how it grouped rows internally.

Test suite (this branch, stacked on #121): 344 + 8 new = 352 passing.

Refs #120, #122. Stacked on #136 (#121).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: nous status --watch / --line + snapshot reader (#127, Phase A)

Stacks on #121. Phase A ships the deterministic status surface that the
CLI hooks into:

  * orchestrator/status.py: read_status_snapshot(work_dir, *, now,
    stuck_threshold_seconds) builds a StatusSnapshot from state.json,
    ledger.json, principles.json, and the most recent
    runs/iter-N/executor_log.jsonl event. Stuck flag flips when the
    last log event is >5 minutes old.

  * format_one_liner(snap) renders the snapshot as a single line for
    shell prompts and CI logs. Stable across two consecutive calls when
    no new events arrived (the property prompt-embedders rely on).

  * format_watch_panel(snap) renders a multi-line panel for
    nous status --watch. Plain text in Phase A — the redraw loop just
    clears + reprints. Phase B can swap in rich/textual without changing
    the snapshot contract.

  * CLI: nous status now supports --watch (loop + redraw at --interval
    seconds, default 2s), --line (single-line summary), and the existing
    one-shot mode (now using format_watch_panel for consistency).

What lands later in Phase B: the SDK event tee — sdk_dispatch.py
appending each --output-format stream-json row to executor_log.jsonl as
the session runs. The status reader here already consumes that file
when present, so flipping the SDK switch lights up the watch panel
without code changes.

Behavioral tests (13 in tests/test_status.py):

read_status_snapshot:
  - minimal state-only campaign
  - completed_iterations counted from ledger.json (≥1 only)
  - active_principles excludes status="retired"
  - last_event picked up from executor_log.jsonl; elapsed_since_last_event
    computed from injected now=
  - stuck flag flips after 5 minutes of silence
  - corrupt state.json doesn't crash; defaults to "?"
  - corrupt JSONL lines in executor_log are skipped, valid lines win

format_one_liner:
  - single line, no newlines
  - STUCK marker appears when set
  - byte-stable across two calls on same snapshot (prompt-embedder
    contract)

format_watch_panel:
  - multi-line panel includes phase, iteration, principle count
  - STUCK warning rendered distinctly
  - "(no events yet)" placeholder when log absent

Tests inject now= and explicit os.utime on the log file so they're
deterministic across machines and don't depend on real wall-clock.

Test suite (this branch, stacked on #121): 344 + 13 new = 357 passing.

Refs #120, #127. Stacked on #136.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: Routines payload builder for scheduled campaigns (#134, Phase A)

Stacks on #126 (campaign_index). Phase A ships the payload builder so
users can dry-run-validate exactly what would be registered with the
Routines API. Phase B (when the API stabilizes) wires the actual POST
and Routine ID return.

Why split A/B: the Routines API is an Anthropic infrastructure feature;
its surface area and authentication story will move while it stabilizes.
Decoupling payload construction from the POST means we can ship the
shape, soak it on real campaigns, and integrate the transport later
without rewriting the payload.

Phase A surface:

  build_routine_payload(campaign, *, campaign_path, schedule, pr_label,
                        mcp_refs, extra) -> dict

  Trigger: cron schedule (UTC) OR PR label, not both. ValueError on
  conflict / missing.

  Campaign reference: campaign_path resolves to an absolute path the
  Routine re-reads on each fire, OR campaign_inline embeds the full
  config dict if no path is given.

  Credentials: a placeholder string (${secret:anthropic_api_key}) — never
  the real key. The Routines runtime resolves from its own secret store.

  MCP refs (depends on #126): list of nous://... URIs the Routine
  subscribes to and writes findings into.

Behavioral tests (10 in tests/test_routines.py):

Schedule payload:
  - cron string lands in trigger.expression
  - name falls back to run_id
  - command line includes --auto-approve and --agent sdk
  - credentials are placeholders, not real secrets
  - MCP refs pass through

PR-label payload:
  - pr_label lands in trigger.label

Validation:
  - missing trigger raises ValueError
  - both triggers raises ValueError

Campaign reference:
  - campaign_path produces path reference, omits inline
  - no path inlines the full campaign dict

Out of scope (Phase B):
  - HTTP POST to the actual Routines API
  - Returning the Routine ID after registration
  - nous routine create CLI subcommand (currently a builder only)

Test suite (this branch, stacked on #126): 355 + 10 new = 365 passing.

Refs #120, #134. Stacked on #142 (#126).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: package nous as a Claude Code plugin (#125)

Ship plugin/nous/ with plugin.json + 6 skill markdown files. Each skill
is a CLI wrapper — minimal frontmatter, clear "when to use" hints, and
a Run section that shells out to the existing nous CLI or imports the
campaign_index module from #126.

What lands:

  * plugin/nous/plugin.json — manifest (name, version, description,
    license, skills list).

  * plugin/nous/skills/nous-run.md — wraps `nous run`. Notes
    --auto-approve + Slack channels for unattended runs.

  * plugin/nous/skills/nous-status.md — wraps `nous status` with
    --watch / --line / --interval (#127). Free to call repeatedly.

  * plugin/nous/skills/nous-resume.md — wraps `nous resume` from
    state.json checkpoint (#91).

  * plugin/nous/skills/nous-list.md — uses campaign_index.list_campaigns
    (#126) with optional query / status / repo filters.

  * plugin/nous/skills/nous-bisect.md — uses
    campaign_index.compare_iterations (#126). Output is byte-deterministic.

  * plugin/nous/skills/nous-find-principle.md — uses
    campaign_index.search_principles. Notes embedding-search as #126
    Phase B.

Behavioral tests (7 in tests/test_plugin_package.py):

Manifest:
  - plugin.json exists with required fields (name, version, description,
    skills list)
  - at least 5 skills listed (acceptance criterion)
  - every listed skill file actually exists on disk

Frontmatter:
  - every skill has name + description in YAML frontmatter
  - descriptions include "use when" / "when the user" cues so Claude Code
    can match user intent — vague descriptions are dead skills
  - every skill body references either a nous command or campaign_index

Coverage:
  - all six documented skills present (nous-run, nous-status, nous-resume,
    nous-list, nous-bisect, nous-find-principle)

Out of scope (Phase B):
  - claude plugin install integration testing (requires a live Claude Code
    install with plugin support)
  - publishing to a plugin registry
  - skill argument templating (currently shell substitution; could move
    to typed inputs once plugin contract stabilizes)

Test suite: 338 baseline + 7 new = 345 passing.

Refs #120, #125. Depends on #126 + #127 (already in flight).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: /goal-driven prompt builders for goal-bounded campaign mode (#124, Phase A)

Phase A ships the deterministic prompt + goal-directive builders for
both modes the issue calls out:

  Mode A — fully /goal-driven: spawn one claude session for the whole
    campaign with /goal "<predicate>". The Haiku post-turn evaluator
    decides when the goal is met. No Python state machine in the inner
    loop.

  Mode B — /goal-bounded inner loop: keep engine.py for control flow,
    but use /goal *within* EXECUTE_ANALYZE so the executor terminates
    as soon as validation passes.

Phase A is the prompt assembly. Wire-up into the dispatcher and the
run_campaign code path lands in Phase B once the team picks the default.

Why the prompt builders matter: criterion #2 of the issue ("hybrid mode
is the default for nous run after one release of soak time") implies
the team will run both modes side by side on real campaigns and compare.
Behavioral testing of the prompt assembly — does it include the
campaign brief, does it spell out the goal predicate exactly — is what
makes those soak runs comparable. The /goal directive itself is just
a string, but it has to be the *right* string or the Haiku evaluator
can't decide.

Phase A surface:

  build_full_goal_directive(campaign, *, iteration, timeout_hours):
    Returns the predicate text for Mode A. Asserts on:
      - findings.json exists with non-empty arms list
      - principle_updates.json exists and parses as a list
      - OR timeout exceeded (default 24 hours).

  build_inner_loop_goal_directive(iteration, *, extra_predicates):
    Mode B predicate. Asserts on schema validation + principle_updates
    presence. Pairs with the deterministic Stop hook (#129) — the hook
    catches the schema check, the /goal evaluator catches edge cases the
    schema doesn't cover.

  build_goal_driven_session_prompt(campaign, *, iteration, timeout_hours):
    Full Mode A prompt body. Includes campaign brief, required artifact
    paths, EXPLICIT instruction to print artifact paths to stdout (the
    Haiku evaluator only sees what's been surfaced in the conversation),
    nous validate invocation, and the /goal directive.

Behavioral tests (10 in tests/test_goal_driven.py):

Full directive (Mode A):
  - predicate names iter-N/findings.json + principle_updates.json
  - timeout clause appears with the configured hours
  - uses AND/OR logic correctly

Inner-loop directive (Mode B):
  - uses schema-validation language (findings.schema.json)
  - extra predicates AND-chained

Session prompt (Mode A):
  - campaign brief (research question, target name, metrics, knobs) appears
  - iteration number appears consistently across artifact paths
  - EXPLICIT "print to stdout" instruction (the evaluator can't see
    silent file writes)
  - nous validate execution invocation present
  - /goal directive appears in the prompt

Out of scope (Phase B):
  - --goal-driven flag on nous run / nous resume
  - Dispatcher integration (SDKDispatcher launching the goal-driven session)
  - run_campaign code path that bypasses engine.py for Mode A
  - Claude Code v2.1.139+ version detection at startup

Test suite: 338 baseline + 10 new = 348 passing.

Refs #120, #124. Issue stays open pending Phase B (dispatcher wire-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: explore-then-synthesize DESIGN orchestration helpers (#132, Phase A)

Stacks on #121. Phase A ships the orchestration layer that makes
splitting DESIGN into Stage A (parallel Explore subagents) + Stage B
(Opus synthesis) possible without changing what gets produced
(problem.md + bundle.yaml).

DESIGN today asks one Opus session to do both codebase mapping AND
bundle synthesis. That's the canonical Claude-Code-pattern miss: broad
exploration + small synthesis is exactly what parallel Explore subagents
are for. Phase A is the orchestration helpers; Phase B (lands when #121
merges and the team picks injection points) wires the SDKDispatcher
to actually spawn Explore subagents and thread reports through to the
synthesis call.

Phase A surface:

  * DEFAULT_EXPLORE_SCOPES — four scopes the issue calls out: metrics,
    knobs, prior_findings, principles. Each gets its own Explore subagent.

  * build_explore_prompt(scope, campaign) — produces a tight,
    scope-focused prompt for a read-only Explore subagent. Multi-aspect
    integration is NOT this prompt's job (Stage B does that).

  * run_explore_stage(campaign, *, scopes, runner) — fans out one
    subagent per scope via an injected runner callable, collects
    ExploreReports. Synchronous in Phase A; the SDK's async fan-out
    lands in Phase B.

  * build_synthesis_prompt(stage_a, *, campaign, iteration, iter_dir)
    — Opus prompt that consumes only the Explore reports + principles.json,
    produces problem.md + bundle.yaml, EXPLICITLY forbids re-reading
    the codebase ("Do not re-read"). That's the whole point of the
    split: Opus on integration, not on file walks.

Behavioral tests (13 in tests/test_explore_design.py):

build_explore_prompt:
  - metrics scope focuses on observable metrics
  - knobs scope focuses on configuration parameters
  - prior_findings references findings.json
  - principles references the principle store
  - EVERY scope marks the explorer read-only (the prompt is
    defense-in-depth on top of subagent_type="Explore")

run_explore_stage:
  - one subagent per default scope (4 calls)
  - custom scopes pass through
  - token counts aggregate across reports
  - by_scope() lookup returns the right report

build_synthesis_prompt:
  - every explorer report appears under its `### <scope>` heading
  - explicit "Do not re-read" instruction
  - problem.md + bundle.yaml + iter-N + bundle.schema.yaml all named
  - research question appears

Out of scope (Phase B):
  - SDKDispatcher integration (spawning subagent_type="Explore" via SDK)
  - anyio.gather over the four explorer calls for actual parallelism
  - Token-budget measurement on a representative campaign (criterion
    "DESIGN cost drops by ≥30%")
  - Wall-clock measurement on multi-aspect explorations

Test suite (this branch, stacked on #121): 344 + 13 new = 357 passing.

Refs #120, #132. Stacked on #136.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf: load methodology preamble as cached system_prompt (#122 Phase B)

Closes the wiring gap from #144 (Phase A): SDKDispatcher now loads
prompts/methodology/{design,execute_analyze}.md, strips placeholders
({{target_system}}, etc.), concatenates them into a single block, and
passes that as system_prompt on every runner call. Anthropic's API
marks system blocks above the cache threshold as cached, so the second
phase call within a 5-minute window reuses the rendered preamble
instead of re-paying for it.

The dynamic context (research_question, observable_metrics, principles,
handoff) stays in the user message — that's what BUSTS the cache when
it should bust (per-iteration changes), and that's what HITS the cache
when content is stable (within-iteration designer→executor handoff).

Two new behavioral tests:
  * runner receives preamble: assert system_prompt contains both
    methodology blocks with placeholders stripped.
  * two consecutive calls reuse the same system_prompt: this is the
    property the cache relies on (otherwise cache_read_input_tokens
    stays at zero).

Test suite: 346 (Phase A baseline) + 2 new = 354.

Closes #122.

* feat: tee SDK events to executor_log.jsonl (#127 Phase B)

Closes the wiring gap from #145: SDKDispatcher.dispatch now derives the
per-iteration executor_log.jsonl path and threads it through to the
runner factory. The runner appends one JSONL row per SDK message so
`nous status --watch` (the snapshot reader from Phase A) lights up
without any further changes.

Implementation:
  * SDKRunner Protocol gains optional event_log_path arg; the default
    runner factory tees every message via _tee_event before processing.
  * _tee_event records {type, ts, tool_name?, tool_use_id?, content?},
    serializability-probing each surfaced field so SDK message-class
    evolution doesn't break the writer. Failures are best-effort.
  * SDKDispatcher.dispatch override computes work_dir/runs/iter-N/
    executor_log.jsonl and resets after dispatch so a later call from a
    different iteration doesn't reuse the wrong path.

Two new behavioral tests (in test_status.py since the contract this
verifies is the snapshot reader's input):
  * runner receives the iteration-specific event_log_path.
  * each iteration gets its own event log (no cross-iter leakage).

The Phase A status reader from #145 already consumes this file when
present, so warm-watch sessions now reflect tool-call events within
the redraw interval (~2s).

Closes #127.

* refactor: thin prompt templates when CLAUDE.md is in scope (#131 Phase B)

Closes the token-shrink wiring from #140 (Phase A): PromptLoader now
prefers <template>_thin.md when a CLAUDE.md is detected at work_dir.
The thin variants drop methodology (~400 lines) and reference CLAUDE.md
for it instead, since Claude Code auto-loads CLAUDE.md from work_dir
on every session.

Concretely:

  * orchestrator/prompt_loader.py: PromptLoader gains
    claude_md_at param. When set and the path exists, _resolve_template_path
    picks <template>_thin.md if present, else falls back to full template.

  * orchestrator/llm_dispatch.py: LLMDispatcher constructs PromptLoader
    with claude_md_at=work_dir/CLAUDE.md. The CLAUDE.md generator from
    Phase A (orchestrator/claude_md.py) writes that file at init and
    after every iteration, so the thin path is active for any campaign
    using the SDK / API path.

  * prompts/methodology/design_thin.md: 27 lines of per-iter context
    (vs 266 in design.md). Refers the agent to CLAUDE.md for methodology.

  * prompts/methodology/execute_analyze_thin.md: 22 lines (vs 199 in
    execute_analyze.md).

  * Other templates (report.md, summarize_gate.md) are short enough not
    to need thin variants; loader falls back to full when no _thin
    exists.

Behavioral tests (6 new):

TestThinTemplateSelection (4):
  - full template used when no CLAUDE.md
  - thin template picked when CLAUDE.md exists
  - full used when template has no _thin variant
  - thin is < 50% size of full (the issue's empirical criterion)

TestRealMethodologyThinTemplates (2):
  - shipped design_thin.md renders against the dispatcher's real
    context shape AND is < 50% size of full design.md
  - shipped execute_analyze_thin.md renders against real context shape

Test suite: 351 baseline + 6 new = 357 passing.

Closes #131.

* chore: codify no-live-LLM-in-tests as a hard project principle

User directive on 2026-05-24: 'Tests must mock LLMs and not spend
token budget. Keep this as a development principle. Always.' And:
'Save it on claude.md everywhere. Not just memory. Save it in multiple
places if you need to.'

Lands the principle in five durable places + active enforcement:

  1. CLAUDE.md (repo root, NEW): non-negotiable rule at the top, with
     concrete how-to-mock guidance per dispatcher (LLM/CLI/SDK/Inline/
     Stub). Auto-loaded by Claude Code on every session.

  2. tests/CLAUDE.md (NEW): restates the rule + injection seams so the
     principle stays in scope when Claude Code is operating inside tests/.

  3. tests/conftest.py — block_live_llm_calls autouse fixture:
       - strips OPENAI_API_KEY / OPENAI_BASE_URL / ANTHROPIC_API_KEY from env
       - patches urllib.request.urlopen to raise LiveLLMCallBlocked when
         the URL contains api.anthropic.com / api.openai.com / api.litellm.ai
       - patches claude_agent_sdk.query (when installed) to hard-fail
     If a test trips the guard, the fix is to inject a fake at the
     dispatcher seam — never to disable the guard.

  4. tests/test_no_live_llm_guard.py (NEW): meta-tests verifying the
     guard fires correctly. If the guard breaks, CI fails loudly:
       - env keys are stripped
       - urlopen to anthropic.com / openai.com raises LiveLLMCallBlocked
       - non-LLM hosts pass through (Slack webhooks, etc., still work
         via their own injection)
       - claude_agent_sdk.query is blocked when installed (skipped here
         since the SDK isn't a test dep yet)

  5. docs/contributing/workflow.md — Non-negotiable rules section at
     the top stating the no-live-LLM rule, the behavioral testing
     rule, and the token-budget invariant.

Audit of existing tests: all already mock correctly:
  * test_llm_dispatch.py uses _make_fake_completion + completion_fn=
  * test_cli_dispatch.py patches subprocess.run
  * test_integration_llm.py uses _make_routing_completion
  * test_sdk_dispatch.py uses _ScriptedRunner sdk_runner injection
  * StubDispatcher path needs no LLM at all

So this PR is enforcement + documentation, not a refactor of existing
tests.

Test suite: 338 baseline + 5 new + 1 SDK-skip = 343 passing, 1 skipped.

Refs the user's 2026-05-24 directive. No issue closed by this PR —
it's a project-wide invariant, equally applicable to all #120 work
and any future contribution.

* feat: run_goal_driven_iteration runner (#124 Phase B)

Closes the dispatcher wire-up from #148 (Phase A): adds
run_goal_driven_iteration(dispatcher, campaign, iteration, work_dir)
which builds the goal-driven prompt, dispatches it through the
provided dispatcher (SDKDispatcher canonical), and persists the
conversation transcript as runs/iter-N/design_log.md.

The agent itself produces problem.md, bundle.yaml, findings.json,
etc. via tool calls inside the session; the orchestrator only saves
the transcript. This is the Mode A from #124's issue body —
'fully /goal-driven (lightweight)' — bypassing engine.py.

Two new behavioral tests:
  - dispatches goal-driven prompt (asserts /goal appears, asserts
    iter-N path appears) and writes log to expected location
  - creates iter dir if missing

The CLI flag --goal-driven and run_campaign integration would call
this function instead of the per-phase dispatch loop. That last bit of
plumbing (engine.py bypass, --goal-driven flag) is left for the
soak-and-decide cycle the issue calls out — once a campaign runs in
goal-driven mode and proves equivalent quality on a real target.

Closes #124.

* feat: submit_routine HTTP POST with poster injection (#134 Phase B)

Closes the API-submission gap from #146 (Phase A): adds
submit_routine(payload, *, api_base, api_key, poster, timeout) which
POSTs the payload to the Routines API and returns the response dict
(typically containing routine_id).

Per the no-live-LLM project principle (CLAUDE.md), the function takes
a poster injection seam — tests pass a recording fake; production
uses urllib.request.urlopen. Defaults to api.anthropic.com/v1/routines;
override via ROUTINES_API_BASE env var or api_base= kwarg.

Auth: Bearer ANTHROPIC_API_KEY (env or kwarg). When no key AND no
poster, the function raises RuntimeError loudly — silent fall-back to
anonymous would be a real-world misconfig.

Four new behavioral tests:
  - posts payload with Bearer auth header and JSON content type
  - custom api_base is honored
  - response dict (routine_id, status) returned to caller
  - missing api_key + no poster raises RuntimeError

All four use the _RecordingPoster fake — no network. The conftest
guard from #151 would block live HTTP to api.anthropic.com regardless.

Closes #134.

* feat: nous-mcp stdio server (#126 Phase B)

Closes the transport gap from #142 (Phase A): bin/nous-mcp is a
stdio JSON-RPC 2.0 server that wraps the campaign_index pure
functions as MCP resources + tools.

Resources (resources/list + resources/read):
  - nous://campaigns                          (index of all)
  - nous://campaigns/<run_id>/state           (state.json contents)
  - nous://campaigns/<run_id>/principles      (principles.json contents)
  - nous://campaigns/<run_id>/iter/<N>/findings (findings.json contents)

Tools (tools/list + tools/call):
  - nous.list_campaigns(search_root, query?, status?, repo?)
  - nous.search_principles(search_root, text, only_active?)
  - nous.get_arm_results(campaign_root, iteration, arm)
  - nous.compare_iterations(campaign_root, iter_a, iter_b)

The server is intentionally dependency-free — pure stdlib (json + sys)
no mcp-python-sdk pin. Compatible with Claude Code's MCP transport via
~/.claude.json:

    {
      "mcpServers": {
        "nous": {
          "command": "python",
          "args": ["-u", "/path/to/repo/bin/nous-mcp"],
          "env": {"NOUS_SEARCH_ROOT": "/path/to/parent/of/.nous/"}
        }
      }
    }

handle_request(request, *, search_root) is exposed as a pure function
so tests can drive the server with JSON-RPC payloads without spinning
up real stdio. 11 behavioral tests cover initialize, resources/list,
resources/read for state and principles, unknown campaign -> JSON-RPC
error, tools/list returns 4 tools, list_campaigns / search_principles
calls, unknown tool -> error, missing required args -> error not crash.

The conftest guard from #151 ensures none of these tests touch a real
network — they read on-disk fixtures only.

Closes #126.

* feat: parse_reply + wait_for_reply for channel gate decisions (#130 Phase B)

Closes the reply-handling gap from #141 (Phase A): adds two new
functions to orchestrator.channels.

parse_reply(text) -> 'approve' | 'reject' | 'abort' | None
  Maps a free-form channel message to a gate Decision. Recognized
  tokens (case-insensitive, first-word match):
    approve | approved | lgtm | ok | yes      -> approve
    reject  | rejected | no   | redesign      -> reject
    abort   | stop     | cancel               -> abort
  Returns None when the reply doesn't decode to a decision so callers
  can keep waiting.

wait_for_reply(reply_provider, *, timeout_seconds, ...) -> str | None
  Polls reply_provider until it returns a recognized decision or
  timeout elapses. On timeout returns None — the issue's documented
  fall-back to --auto-approve semantics.

Both functions take dependency-injection seams (sleeper, clock,
reply_provider) for deterministic testing — no real wall-clock, no
real channel polling. The actual per-channel adapters (Slack
interactive messages, Telegram bot polling, etc.) plug into
reply_provider via small adapter functions; this PR ships the core
state machine.

Seven new behavioral tests:
  - parse_reply recognizes each token family (approve/reject/abort)
  - parse_reply returns None on unrecognized replies, empty string,
    and None input
  - wait_for_reply returns the decision on first recognized reply
  - wait_for_reply returns None on timeout
  - wait_for_reply keeps polling past unrecognized replies

All assertions describe the function's return value given inputs.
None inspect internal control flow or which sleeper/clock methods
were called.

Closes #130.

* feat: make_isolated_arm_runner factory for harness-managed worktrees (#133 Phase B)

Closes the harness-isolation gap from #143 (Phase A): adds
make_isolated_arm_runner(*, sdk_runner, repo_path, iter_dir, ...)
that returns an ArmRunner-shaped callable backed by a worktree-isolated
SDK subagent.

Per the no-live-LLM project principle, the factory takes an injected
sdk_runner — the real ClaudeAgentOptions(isolation='worktree')
construction lives behind that seam. Tests pass a recording fake and
assert the factory's contract (signature, returned-callable shape,
ArmUnit -> ArmUnitResult mapping); the harness call itself is verified
on soak.

The runner:
  * creates iter_dir/results/<arm>/<seed>/ before dispatch
  * passes a clear arm/command/seed prompt with explicit results-dir +
    patch-capture instructions
  * dispatches via sdk_runner with isolation='worktree' and
    subagent_type kwargs (with TypeError fallback to the basic-runner
    signature for forward/backward compatibility)
  * on is_error result, returns ArmUnitResult(status='failed') with
    the error message
  * on success, scans results_dir and returns ArmUnitResult with the
    sorted relative-file listing

This is the bridge between #143 (worktree GC) and #150 (parallel-arm
orchestration); once #123 wires this runner into the parallel-arm path,
the manual create_experiment_worktree / remove_experiment_worktree
lifecycle becomes vestigial — a follow-up cleanup PR drops it
(closing the issue's ≥60% LoC reduction acceptance criterion).

Two new behavioral tests:
  - test_returns_callable: factory returns a callable matching ArmRunner
    (skipped when parallel_arms is on a not-yet-merged branch).
  - test_factory_accepts_documented_kwargs: signature contract with
    model, max_turns, subagent_type kwargs. Construction must not
    raise.

Closes #133.

* feat: parallel-arm orchestration helpers (#123, Phase A)

Stacks on #133 (which stacks on #121). Phase A ships the orchestration
layer that turns experiment_plan.yaml into a flat list of independent
units, fans them out via an injected runner, and deterministically
merges their results into a findings-shaped dict. The actual SDK
subagent fan-out + worktree-isolation per unit (the issue's main thrust)
is Phase B once #121 + #133 merge.

Why partition first: the 5/18 mech-design-enforcement session ran 8
conditions × 3 seeds = 24 simulations sequentially in one Sonnet
session. That 2.5-hour mega-session is what produced the connection
drops and the race-two-executors bug. Decomposing into small
independent units is the prerequisite to parallel execution; once the
units exist as data, the run path can be sync (Phase A) or
anyio.gather over SDK subagents (Phase B) without touching the
partitioner or merge.

Phase A surface:

  partition_plan(plan) -> list[ArmUnit]
    Turns experiment_plan.yaml into one ArmUnit per (arm × condition × seed).
    Default seed when none specified is "seed-1"; multi-seed conditions
    fan out. Skips arms with no command. Each unit's
    relative_results_dir is unique by construction
    (results/<arm>/<seed>) — no two units write to the same path.

  run_units(units, *, runner, max_parallel) -> list[ArmUnitResult]
    Runs each unit through the injected runner. Catches runner
    exceptions and converts them to failed ArmUnitResults so a single
    arm crashing doesn't abort the iteration. Returns results in input
    order so callers can pair them deterministically.

  merge_unit_results(results, *, plan) -> dict
    Deterministic merge into a findings-shaped structure: arms grouped
    by arm_id (sorted), arm.status="failed" when any unit failed,
    units within an arm sorted by (seed, condition). Byte-equal across
    repeated calls — that's the criterion the issue asks for.

  failed_units(results) -> list[ArmUnit]
    Helper for partial-retry: which units need re-running?

  default_max_parallel() -> int
    The min(CPU, 4) default the issue calls out.

Behavioral tests (14 in tests/test_parallel_arms.py):

partition_plan:
  - single arm/condition with default seed
  - multi-seed condition fans out
  - multiple arms × conditions: 3 units; sorted assertion
  - results_dir doesn't overlap across seeds
  - arm without command skipped

run_units:
  - results in input order (the determinism contract for merge)
  - runner exception becomes failed unit, doesn't abort run
  - max_parallel < 1 raises ValueError

merge_unit_results:
  - arms grouped by arm_id, sorted
  - arm.status="failed" when any unit failed
  - failed_unit_count + total_unit_count correct
  - byte-equal across repeated calls
  - units within arm sorted by (seed, condition)

failed_units:
  - returns only failed units (the partial-retry contract)

Out of scope (Phase B):
  - SDKDispatcher integration: a runner that actually spawns
    Agent(isolation="worktree") per unit
  - anyio.gather + semaphore for real parallelism
  - Wire-up into iteration.py so EXECUTE_ANALYZE picks parallel mode
    when max_parallel_arms > 1
  - Wall-clock measurement on a multi-arm campaign (the
    "significantly less wall-clock" criterion)

Test suite (this branch, stacked on #133): 346 + 14 new = 360 passing.

Refs #120, #123. Stacked on #143 (#133) which stacks on #136 (#121).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: end-to-end isolated-runner tests for parallel arms (#123 Phase B)

Closes the SDK-integration gap from #150 (Phase A): adds three
end-to-end behavioral tests that exercise the full chain:

  partition_plan -> make_isolated_arm_runner -> run_units -> merge_unit_results

The SDK side is injected via a fake (per the no-live-LLM project
principle, see CLAUDE.md). The tests assert the orchestration
contract — every unit dispatches with isolation='worktree' to a
non-overlapping results dir, failures are isolated to the affected arm,
and the merged output is deterministic.

Tests:

  test_three_units_dispatched_with_isolation_kwarg
    Plan with 1 arm × 1 condition + 1 arm × 1 condition × 2 seeds = 3
    units. All three dispatch with isolation='worktree'. Merged output
    has both arms in sorted order, both reported complete.

  test_partial_failure_isolated_to_one_arm
    Fake runner returns is_error for h-ablation; h-main succeeds.
    Merged output: h-main complete, h-ablation failed. Failed unit
    count = 2 (both ablation seeds). Total = 3. The acceptance
    criterion 'one arm failure does not abort iteration'.

  test_no_two_units_share_results_dir
    Captures every Write-output-files-to path the runner sends to
    each subagent; asserts all 3 are unique. The acceptance criterion
    'no two subagents ever write to the same results/ subpath'.

A local _LocalSDKResult stand-in replaces the import from sdk_dispatch
so this branch doesn't depend on sdk_dispatch.py landing first; the
real SDKResult from #121 is duck-compatible (same field shape).

The full chain works against any sdk_runner respecting the SDKRunner
Protocol — production wiring (which constructs the real Anthropic SDK
runner with isolation kwarg) is verified on soak.

Closes #123.

* feat: make_sdk_explore_runner factory for Stage A (#132 Phase B)

Closes the SDK-integration gap from #149 (Phase A): adds
make_sdk_explore_runner(*, sdk_runner, cwd, model, max_turns) that
returns an ExploreRunner-shaped callable backed by a read-only
Explore subagent (subagent_type='Explore').

Per the no-live-LLM project principle (CLAUDE.md), the factory takes
an injected sdk_runner. Production wiring constructs the real Anthropic
SDK runner; tests inject a recording fake. Defaults model to Haiku
because read-only mapping is cheap and benefits from speed over depth;
deep synthesis happens in Stage B (the single Opus call), not Stage A.

Three new behavioral tests:

  test_dispatches_each_scope_with_explore_subagent_type:
    With four default scopes, the SDK runner is called four times,
    each with subagent_type='Explore'. Reports carry the runner's
    text + token counts; total_input_tokens aggregates correctly.

  test_falls_back_when_sdk_runner_lacks_subagent_kwarg:
    Older runners without subagent_type kwarg are accommodated via
    TypeError fallback to the base signature. Forward/backward
    compatibility across SDK API evolution.

  test_uses_haiku_by_default:
    Default model is Haiku (read-only mapping should be cheap).

A local _LocalSDKResult stand-in keeps this branch independent of
sdk_dispatch.py; the real SDKResult is duck-compatible.

Closes #132.

* docs: retro for the #120 Claude-Code-native uplift initiative

Closes the tracking epic with a written retrospective covering:
  * what landed (15 children + the no-live-LLM guard PR)
  * the architecture delta (subprocess claude -p -> Claude Agent SDK,
    methodology in CLAUDE.md, parallel subagents replacing mega-sessions)
  * the token-budget delta with each lever and how to verify it on soak
  * how the no-structural-tests + no-live-LLM-calls discipline shaped
    the design (pluggable seams everywhere)
  * what's deferred to soak (criteria that genuinely need a real campaign)
  * follow-up work for the next initiative

Closes #120.

* ci: add pytest workflow for push and pull_request

Adds .github/workflows/tests.yml — runs pytest on Python 3.11 + 3.12
for every push to main/reflective and every PR targeting them.

The job intentionally strips OPENAI_API_KEY / OPENAI_BASE_URL /
ANTHROPIC_API_KEY from the runner env. The no-live-LLM project
principle (CLAUDE.md + tests/conftest.py autouse guard) says tests
must never call real LLMs; this CI step is the outer line of defence,
the conftest guard the inner.

Concurrency: in-flight runs on the same PR are cancelled when a new
push lands so we don't burn CI minutes on stale commits.

Flags:
  pytest -ra              — surface skipped/xfailed in the log so
                            silent skips don't hide regressions
  pytest --strict-markers — fail the build if a test references an
                            unknown marker. Keeps the test surface
                            honest.

* ci: drop pull_request base-branch filter so any PR runs CI

Long-running integration branches (e.g. tracking-N) get CI feedback
without contributors having to special-case the base branch in the
workflow.

* docs: pip install + git clone use the reflective branch (#120)

The default branch is main, but reflective is where new work lands
first. Users following the README from a fresh clone of main got an
older Nous than what's actively being developed.

Also documents the optional [sdk] extra for --agent sdk users.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sriumcp sriumcp deleted the feat/133-harness-worktrees branch May 25, 2026 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant