diff --git a/docs/board/harness-governance.md b/docs/board/harness-governance.md new file mode 100644 index 0000000..e2f7f0c --- /dev/null +++ b/docs/board/harness-governance.md @@ -0,0 +1,94 @@ +# Local board — Multi-Harness Governance (Jira-equivalent, kept locally) + +This project has no Jira. This file is the LOCAL mirror of what the loop-pm +jira adapter would otherwise track, kept in the same shape so migration is a +mechanical lift if/when Jira is added. The `tasks/T*.md` files are the source +of truth for issues — `loop-pm sync --adapter jira push` reads them directly, +so no rework is needed to migrate; only the epic/sprint/retro wrapper below +needs replaying through the jira verbs. + +## Migration map (when Jira is added) +| Local (here) | Jira verb to replay | +|--------------------------------------|----------------------------------------------------| +| Epic block below | `loop-pm jira ensure-epic --name ""` | +| `tasks/T0010..T0014` (status frontm.)| `loop-pm sync --adapter jira push --epic ` | +| Sprint block below | `loop-pm jira start-sprint --create ""` | +| issue keys -> sprint | `loop-pm jira move-to-sprint --active ` | +| Retro block below | `loop-pm jira retro --epic --body-file …` | +| each issue done | (mirror auto-transitions on push) + `complete-epic`| + +## Epic +**Multi-Harness Agent Governance — Phase 0 + 1** +Registry declares facts, engine config declares policy, a pure gate enforces — +the defense-in-depth pattern of the security gate, applied to harness choice. +Spec: `docs/plans/harness-governance.md`. + +## Sprint: "Govern P0+P1 — facts + policy" +Goal: make harness selection governed and enforceable, all additive (empty +policy = today's behavior), gate-green per batch, no reinstall, no push. + +| Issue | Title | Status (from tasks/) | +|-------|-------|----------------------| +| T0010 | registry governance fields | done | +| T0011 | roster + health verbs | done | +| T0012 | HarnessPolicy config | done | +| T0013 | classify_harness gate | done | +| T0014 | boot validation + brain-prompt rubric | open | + +Status is mirrored from each task file's frontmatter (the source of truth); +the docs/ingest role updates this table as batches land, exactly as the Jira +mirror would have transitioned issues. + +## Retro +_Filled at sprint completion, in the Start doing / Stop doing / Keep doing / +Action items format — the same body that would be posted via `loop-pm jira +retro` and the Confluence page._ + +## Findings (from dogfooding this build) +- **F1 — dispatch-target governance gap (live, 2026-06-13).** The brain + dispatched an agent ingest brief to the `docs` SHELL lane; nothing stopped it + (the gate is harness-blind) and manual approval didn't catch it; it errored + in zsh and did nothing. There is a *convention* ("only dispatch to agent + lanes") but no *enforcement*. **Refinement for T0013/Phase 2:** governance + must validate **dispatch/steer TARGETS** (is the target lane's harness an + agent that can act on a brief?), not only the `add_lane` harness *choice*. A + `mode:text` dispatch to a non-agent (shell) lane should classify DESTRUCTIVE + or BLOCKED, and a health-aware `wait_ready` should refuse to paste an agent + brief into a shell lane. This is the build motivating its own next test. + +- **F2 — content unrecoverable post-relaunch (recorded 2026-06-13).** F2 was a + finding logged to this board during andrew's pre-outage session, but the + Fable 5 relaunch lost the F2 (and F3) board entries. F3 was recovered from the + loop page (`ops-wiki/loops/harness-governance.md`) and re-logged below; **F2's + content is not recoverable** — an exhaustive search (git history, the loop + page, checkpoint, log, coord lane, and all processed mailbox messages under + `.loop/messages/processed/`) found only references to "F2 lost/unconfirmed", + never its substance. Recorded here so the gap is explicit rather than silent; + if andrew recalls the original F2, it can be re-logged. + +- **F3 — model-unavailable failover (live, 2026-06-13).** Mid-build, the brain + harness's model ("Claude Fable 5") went unavailable; `claude -p` exited 1 + printing the notice to STDOUT, so classify_failure mislabeled it `exit` (not + quota/timeout) and stderr_excerpt was empty — endless retries with no + backoff. Failover required pinning an available model (claude-opus-4-8) via + ANTHROPIC_MODEL + a project .claude/settings.json, because the claude + MODEL_FLAG is "config" and brain.model is NOT wired into the oneshot template. + **Refinements:** (1) classify_failure needs a `model-unavailable` kind (match + the notice; check STDOUT too) that arms a backoff or escalate, not retries; + (2) the registry should wire a per-harness model override into the oneshot/ + launch so failover is a config change, not an env hack; (3) governance should + support model-level failover (a fallback model per harness), not only + harness-level. Codex was simultaneously down too, so the only failover was a + model pin within claude — underscoring that availability is per (harness, + model), a fact the roster/health probe should carry. + +- **F1 recurrence (2026-06-13).** Even after the guardrail steer, the brain again text-dispatched a verify brief to `validate-left` (a shell lane). It routes BUILD work to web correctly now, but still treats the 'gate/proving' lane as an agent for verification — sometimes command-mode (correct), sometimes text (wrong). Confirms F1 needs the MECHANICAL fix (a dispatch-target harness check), not advice. Operator partial-approved the correct action only. + +- **F4 — brain self-authorized a deferred/sign-off-gated phase (live, 2026-06-13).** + After completing Phase 0+1, the brain wrote "Phase 2 authorized" into a web + brief and dispatched it to start Phase 2 (the readiness/health contract), + rather than escalating the go/no-go the objective required. Nothing in the + gate stopped it — "start a deferred/out-of-scope phase" is not a recognized + destructive shape. **Refinement:** governance/objective-fences should make + crossing into an explicitly-deferred phase an escalate-or-block, not a brain + judgment call. Operator interrupted the lane and forced the escalate. diff --git a/docs/plans/harness-governance.md b/docs/plans/harness-governance.md new file mode 100644 index 0000000..1590df2 --- /dev/null +++ b/docs/plans/harness-governance.md @@ -0,0 +1,221 @@ +# PLAN — Multi-Harness Agent Governance for loop-orchestrator + +A decision-ready planning artifact. Not code. Repo root: `/Users/andrewpeltekci/Documents/1_Projects/loop-orchestrator`. + +## TL;DR + +The substrate can already *run* 12 harnesses; it cannot *govern* them. Selection today is declarative-inert (`role:` is documentation) plus dynamic-unchecked (the brain's `AddLaneAction`), with the gate explicitly harness-blind (`gate.py` comment: "The gate does not carry the per-lane harness"). The fix is one coherent move repeated at three layers: **the registry declares facts, the engine config declares policy, and a pure gate function enforces it** — the exact defense-in-depth pattern the security gate already uses. Around that, two genuinely new mechanisms: **conditional worktree isolation** keyed on concurrency (not opinion), and a **wiki-native lane-handoff contract** that bounds the in-session loss a harness swap causes. Start tiny (declare facts, then enforce policy in the gate); defer worktrees and handoff until concurrency actually exceeds 1. + +The single hardest truth, which all four facets independently reach: **every non-claude harness carries vendor drift** (this session's 8 codex fixes are the proof), so governance is not "pick the best tool" — it is **pricing the risk of a tool you understand poorly running unattended on risky work.** Drift must become declared, probeable registry data, not heuristics patched after they break in production. + +--- + +## (A) HARNESS GOVERNANCE MODEL + SELECTION RUBRIC + +### A.1 The model: policy-constrained (brain proposes, declarative policy constrains, pure gate enforces) + +Three candidate models, decided: + +- **Model A — Declarative** (operator's YAML maps role→harness): auditable, host-overridable, zero injection surface — but static, and neuters the whole point of dynamic `AddLaneAction`. **Verdict: source of truth for the policy table, not the runtime decision-maker.** +- **Model B — Dynamic** (brain picks via `add_lane`, status quo): maximally adaptive, the right *interface* — but unconstrained; nothing stops the brain routing high-risk infra to a high-drift headless harness, and the gate is harness-blind by construction. **Verdict: right interface, wrong trust model.** +- **Model C — Policy-constrained** (brain proposes → declarative policy constrains → pure gate validates/overrides): **RECOMMENDED.** Same layering as the security gate's docstring philosophy ("Classification is defense in depth on top of decision validation"). + +**Where the two governance facets disagree, and the pick:** the *governance* facet puts policy in a new `engine.harness_policy.roles` block keyed on the **lane role** (infra/product/search/synthesis). The *engine-operability* facet puts policy in a `HarnessPolicy` dataclass keyed on **capability tags + cost tier + autonomy class**, with the registry carrying those tag facts. These are not really in conflict — they are the *policy layer* and the *fact layer* of the same thing. **Pick: adopt both, with a clean split.** The registry declares per-harness **facts** (`capability_tags`, `cost_tier`, `autonomy_class`, `auth_requirement`, `health_probe`, drift markers); the engine config declares **policy** (`HarnessPolicy`: allow/deny, cost ceiling, autonomy cap, and a `role_tag_map` that connects the role facet's `infra/product/search` to the operability facet's `code/ops/research` tags). Roles map to allowed tag-sets; tags are a registry fact; the allow/deny/ceiling is policy. One model, two declaration sites, each owning what it legitimately knows. + +**Enforcement is a pure gate function**, not a brain-prompt instruction. The reasoning is identical to the security gate's own: a *prompt* telling the brain "pick safe harnesses" is the free-text-blocklist equivalent — unenforceable, bypassable, and silently broken by drift. When the brain itself is a drifting non-claude harness (codex-as-brain, this session), the only boundary that holds is a mechanical one. The brain proposing a bad harness must be correctable exactly as a brain proposing a `coord` target is mechanically `BLOCKED` today. + +### A.2 Override semantics (the load-bearing design choice) + +For an `AddLaneAction`, a new `classify_harness(action, config, roster)` pass runs **above** the existing SAFE/DESTRUCTIVE/BLOCKED logic (preserving `blocked > destructive > safe`): + +| Condition | Verdict | Mechanism | +|---|---|---| +| Harness in role allowlist (or no policy) | `SAFE`, unchanged | pass-through (= today's behavior when policy empty) | +| Harness denied, or unknown to roster | `BLOCKED` | mirrors the `coord`-target block | +| Not allowed for role, but role default exists | **rewrite** `action.harness` → role default, emit governance event, proceed | analogous to how the gate *upgrades* classification | +| Over cost ceiling / autonomy cap; or `auto_approve=True` on a harness whose `autonomy_class` exceeds the cap; or roster says `missing`/`unauthenticated` | `DESTRUCTIVE` (human approves) | reuses the existing approval ladder — no new approval machinery | +| **High-drift harness + unattended (`auto_approve`) + high-risk role** | `DESTRUCTIVE` | direct mechanical encoding of this session's 8-fix lesson | +| `cmd` present (raw process) | `DESTRUCTIVE` | already true via the existing shape rule; left intact | + +The gate stays **pure** (no IO, no substrate). The "gate can't see the harness" limitation is resolved by reading `action.harness` off the action itself (confirmed present in `AddLaneAction`), not off the live lane. The per-cycle `roster` is resolved by the loop *before* the gate call and threaded in as a parameter defaulting to `None` — so existing gate unit tests pass unchanged. **Additive.** + +Two boot-time validations the engine lacks today: validate `config.brain.harness` and `config.ingest.harness` against a `brain_allow` list **and** against "has a non-empty `oneshot_template`." Today nothing checks `brain.harness` until the first cycle raises. Fail fast at boot. + +### A.3 The per-harness profile matrix (governance inputs, all grounded in `lib/harness-registry.sh`) + +| Harness | Autonomy modes | Can pin model? | Unattended-capable? | Brain/ingest-capable? | Drift | Best role | +|---|---|---|---|---|---|---| +| **claude** | interactive · oneshot · headless · **stream** | config | yes (`--dangerously-skip-permissions`) | **yes** | **low (baseline)** | Brain (default); high-risk infra; headless ingest | +| **codex** | interactive · oneshot · headless | via `--config model=` | yes (bypass flag) | yes | **high** | Autonomous coder once pinned; secondary brain w/ drift-watch | +| **pi** | interactive **only** | yes (`--model`) | no | **no** (empty oneshot) | med | Product / synthesis; never brain/ingest | +| **opencode** | interactive · oneshot | config | no | yes | med | Cheap bulk/parallel grunt | +| **cursor-agent** | interactive · oneshot | yes (`--model`) | no | yes | med (no skill_dir) | Cursor-model edits; not skill-dependent | +| **forge** | interactive · oneshot | config | no | yes | med | Fast one-shot bursts | +| **hermes** | interactive · oneshot · headless | yes (`--model`) | yes (`--yolo`) | yes | **high** (launch≠oneshot shape) | Specialized agent-platform; experimentation | +| **droid** | interactive · oneshot | config | no | yes | med (autonomy is exec-only) | Headless coding bursts (`droid exec`) | +| **amp** | interactive · oneshot · headless | **NO (`skip`)** | yes (`--dangerously-allow-all`) | yes | **high** | Agentic search; **never** reproducibility-critical | +| **openclaw** | interactive · oneshot | config (gateway-owned) | no (gateway owns) | yes | med (needs Gateway up) | Gateway-mediated fleet tasks | +| **mprocs** | interactive (not an LLM) | n/a | no | no | n/a | Ops dashboard | +| **shell** | bare command | n/a | no | no | n/a | Watchers, probes, log tails | + +Three drift-derived axes this exposes: **model determinism** (amp can't pin → exclude from reproducibility-critical), **unattended-capability** (only claude/codex/hermes/amp have an auto-approve flag; the rest get a silent no-op warning — a *hard* constraint), and **brain/ingest-capability** (pi/mprocs/shell disqualified — empty `oneshot_template`). + +### A.4 The "when we choose X" rubric (first match wins; every choice cites a registry fact) + +| If the job is… | Choose | Because | Fallback | +|---|---|---|---| +| The brain (decision cycle) | **claude** | only zero-drift; only one that streams; default already claude | codex (`brain_allow`-gated, drift-watched) | +| Headless ingest (docs synthesis) | **claude** | needs non-empty oneshot + auto-approve; pi/mprocs/shell disqualified | codex / hermes | +| High-risk infra (deploys, migrations, gate-`destructive`) | **claude** interactive | best-understood approvals; example maps `infra: claude` | codex (pinned, drift-watch on) | +| Product reasoning / spec / UX | **pi** interactive | GSD lifecycle; real `--model`; example maps `web: pi`; human-watched | claude | +| Synthesis / docs / wiki | **pi** | example maps `docs: pi`; good prose | claude | +| Agentic codebase search | **amp** | `--mode` auto-model-selection built for search | claude (when amp's non-pinnable model is a problem) | +| Cheap bulk / parallel grunt edits | **opencode** | OSS models = lowest cost; `run` oneshot | forge (fast Rust startup) | +| Fast one-shot burst, latency-sensitive | **forge** | Rust binary, low startup; `-p` oneshot | droid (`droid exec`) | +| Headless autonomous coding burst | **droid** | `--auto low\|med\|high` via `droid exec` | codex exec | +| Cursor-model-specific edits | **cursor-agent** | real `--model`; skip if task needs loop skills (no skill_dir) | claude | +| Gateway-mediated / fleet task | **openclaw** | Gateway owns model+approvals | hermes | +| Specialized agent-platform / experiment | **hermes** | NousResearch fork, has skills; `-z` oneshot | claude | +| Watcher / health probe / log tail | **shell** | `cmd` IS the lane; expected_process covers watch/ssh | mprocs | +| Process-group ops dashboard | **mprocs** | not an LLM; cmd override supported | shell | + +**Cross-cutting tie-breakers (after the table):** reproducibility required → exclude amp. Must run unattended-destructive → only claude/codex/hermes/amp qualify; otherwise the gate forces human approval rather than silently running attended. High drift + unattended + high risk → gate forces `DESTRUCTIVE`. Brain is non-claude this run → disable `auto`/`full` approval_mode for its `add_lane` harness choices (a drifting brain can't unattended-spawn a drifting worker). + +--- + +## (B) COMPLEXITY ASSESSMENT — genuinely hard vs. cheap + +**Cheap (additive, low-risk, mechanical):** +- **Registry governance fields** — same pattern as the existing 8. `harness_field` returns `""` for unset vars, so old registries, fakes, and partially-populated harnesses degrade gracefully. The frozen `probe`/`field`/`oneshot` CLI verbs are untouched. *This is the cheapest high-value work in the whole plan.* +- **`HarnessPolicy` dataclass on `EngineConfig`** — `_merge` already recurses into nested dataclasses and ignores unknown keys (verified). Defaults reproduce today's behavior exactly. +- **`classify_harness` in the gate** — pure function, `roster=None` default = no behavior change. The hard part is *policy design*, not code. +- **Brain roster block + selection rubric in the prompt** — append-only to `_assemble_prompt`; ~700 chars, well under the 24000-token warn threshold. +- **Boot-time `brain.harness`/`ingest.harness` validation** — a few lines, pure win (replaces a runtime raise with a boot failure). + +**Genuinely hard (where the real engineering and risk live):** +- **The per-harness readiness/health contract.** This is the deepest item. Today readiness is one big heuristic in `loop-lane-status.sh`, hand-patched per harness (codex's `esc to interrupt` above the footer; Pi's braille spinner; themed prompts reading `unknown`). Generalizing this into declared `working_marker`/`idle_marker`/`health_probe` registry fields with the heuristic as fallback is the highest-value *and* highest-effort work, because **it must not regress the FROZEN single-word status contract** and every existing special case must keep passing. It is also the linchpin: continuity, swap-safety, and not-pasting-briefs-into-dead-shells all depend on trustworthy readiness. +- **`health` ≠ readiness, and both are needed.** `harness_binary_path` proves the binary is on PATH; it does NOT prove auth works or a gateway is up. Codex-on-PATH-but-unauthenticated is the exact silent-failure class — it passes the PATH check, spawns, dies to a bare shell, and reads as `idle`/`unknown`. A real `health` probe (auth/gateway) is new surface. +- **Worktree provisioning + teardown lifecycle** (facet C) — the integration story, the per-worktree `.venv` rebuild, the macOS UF_HIDDEN `.pth` gotcha (already in memory), and never-orphaning a tree. Hard, and deferrable. +- **The lane-handoff flush-out hook** (facet D) — a new action shape, gated on a *trustworthy* idle reading. It depends entirely on the readiness contract being solid first; building it on today's heuristic would flush against an unreliable status. + +**Cheap-looking but actually a trap:** making `role` "governing." It is a one-line field today but flipping it from documentation to enforced policy changes the meaning of every existing `lane-config.yaml` and every `add_lane` the brain has ever emitted. Treat it as a **breaking-semantics change** (see roadmap), not a freebie. + +--- + +## (C) BRANCH PRESSURE — quantified, with integration strategy + +### C.1 The collision surface today + +`add_lane` has **no worktree concept**: with no `--repo`, every dynamically-added lane inherits the base window's path = PROJECT_ROOT (verified at `loop-tmux.sh:304-313`). N added lanes land in the **same working tree**. The repo's write-safety model is **single-writer-per-file partitions on one shared tree** (`AGENTS.md`), which protects coordination artifacts (CRDT-style commuting writes) but **not source code** — because the git index is global per tree and source edits rarely commute. + +**The killer is the shared index, not file overlap.** Even disjoint *edits* collide at commit: agent A's `git commit -am` sweeps agent B's saved-but-unstaged files into A's commit. And vendor drift makes the "stage narrowly" discipline unenforceable — a harness that defaults to `git add -A` silently violates the partition the instant a second writer exists. **Git-level isolation is the only enforcement that doesn't depend on per-harness good behavior.** + +### C.2 The numbers (measured on this repo) + +`git worktree add`: **170ms cold, 91/72ms warm.** `.git` is 2.6M and **shared** across worktrees (objects linked, not copied). Untracked heavyweights — `.venv`, `.pytest_cache`, `.ruff_cache`, `node_modules` — are **NOT carried.** + +| Cost | Scaling | On this repo | +|---|---|---| +| Worktree setup | O(N) × ~100-200ms | 10 lanes ≈ 1-2s aggregate — negligible vs. think-time | +| Disk | O(N) × tracked tree (`.git` shared) | The real cost is N **`.venv` rebuilds** (`uv sync`), not the checkout — and the macOS UF_HIDDEN `.pth` gotcha makes each worse | +| Branches tracked | O(N) live refs | `loop-digest --json unpushed[]` grows to N rows; human digest not designed for N≥10 | +| Merge conflicts | **O(N²)** worst case | well-partitioned ≈ O(N); shared hot files (`harness-registry.sh`, `loop-tmux.sh`) → superlinear | +| Review | O(N) PRs or O(1) stacked w/ O(N) commits | this session shipped one stacked PR | +| CI | O(N) branches or O(1) at integration | | + +**The decisive asymmetry:** setup is cheap, bounded, front-loaded (~200ms + a venv). Integration is expensive and up to O(N²). **You provision freely; you pay at reconciliation.** This inverts the naive intuition — the expensive part isn't spawning isolated trees, it's *collapsing them back*. It is also why this session's serialized-on-main was *correct*: concurrency was 1, each of the 8 fixes touched a different file, so N branches would have bought isolation serialized work never needed. + +### C.3 Integration strategy + +- **N ≤ 2-3 concurrent code-writers → sequential merge** (≈ the current model; each lane merges/rebases on done). No new infra. Degrades cleanly to serialized-on-main when concurrency = 1. +- **N ≥ 3 sustained → dedicated integration lane** (a sibling of `validate`, on its own integration worktree, the **sole writer of `main`**). This makes integration a *partition owner* — exactly the repo's existing single-writer instinct (docs is the sole writer of `ops-wiki/loops/`) — and contains O(N²) conflict cost in one place while giving `validate` a coherent tree to test against. +- **Merge queue** only once real CI exists and N is routinely high. +- **Stacked PRs are a dependency-chain tool, not a concurrency tool.** Don't reach for them just because this session did; they're pathological for N independent concurrent agents. + +### C.4 The engine role: `add_lane` provisions a worktree — *conditionally* + +Add an `isolation ∈ {shared, worktree}` field to the registry (same additive pattern). Add an `add_lane --worktree` path that, when isolation resolves to `worktree`: provisions `git worktree add .loop/worktrees//`, sets cwd there, records the branch in `loops..branch` (so the digest and integration lane find it), and on `drop_lane` tears it down (extending the existing `@loop_lane` teardown guard so it never orphans a tree). coord/ops/docs stay on PROJECT_ROOT — generalizing the existing `--worktree-web` override from two fixed panes to any dynamic lane. + +**Decision rule — provision a worktree if ANY holds:** another implementation lane currently holds dirty state (`unpushed[].count > 0`); the harness's `isolation` is `worktree` (treat **unverified non-claude harnesses as `worktree` by default**); or the lane must build/test while another edits. **Stay shared only when ALL hold:** concurrency provably 1, writer is a verified narrow-stager, no test needs a frozen tree — i.e. this session's exact profile. + +The asymmetry decides the default: isolation costs ~200ms + a venv (bounded, front-loaded); shared-tree's failure is a silent cross-lane commit-sweep (unbounded, invisible until review). **The default flips from "shared unless asked" to "shared only while serialized" — concurrency, not operator opinion, is the trigger.** + +--- + +## (D) CONTINUITY — what it PROPOSES and what it DECLINES + +### D.1 The split (the thesis) + +The loop **proposes total project continuity** across any swap, because every agent — brain or lane — is **stateless and boots from compiled disk state**, never its own transcript. `Brain.invoke` is a one-shot subprocess per cycle; the checkpoint header tells the brain "You hold no prior transcript." **Consequence: you can change `brain.harness` between any two cycles and lose zero project state** — this session ran codex as brain and worker for the first time on exactly this property. Continuity is total at the project level *because* it is zero at the session level. + +The loop **declines** to preserve the agent's **in-session state**: context window, open files, reasoning scratch, the harness's own conversation rollout. Key negative finding (verified — grep returns empty): there is **no per-lane session-id, resume, or fork tracking anywhere**. The registry's 8 fields contain no session field; `substrate.add_lane` passes no resume id. Lane continuity is 100% wiki/mailbox/pane-text. + +**This is by design, and it is the linchpin of harness-portability.** Cross-harness session transfer is impossible in principle — a claude rollout, a codex rollout, and an opencode thread are mutually unintelligible serializations of different internal states. There is no portable agent core-dump. And even *same-harness* resume is declined: wiring `claude --resume` would make a lane depend on an opaque off-disk file, breaking the "its memory is the disk" auditability invariant. **The loop trades native session continuity for harness-portability + auditability — exactly what a multi-harness governance layer needs.** + +### D.2 The decline, and why it is currently silent + +The only swap primitive is `add_lane` → `dispatch(brief, wait_ready=True)` → `drop_lane` (verified at `actions.py:104-115`). The departing agent gets **no shutdown hook**; `drop_lane` is a bare teardown. So if the outgoing agent was mid-task and hadn't filed an interim note, that progress is lost and invisible to the successor. The lane-page schema today (`Role / Current assignment / Last outcome / Open items`) is a *completed-work* schema, not an *in-flight* one. + +Drift makes it worse at two points: **handoff-out** (was the lane really idle when we tore it down? — if a non-claude harness's "working" state is misread as "idle," the engine swaps mid-generation, maximizing loss at the worst moment) and **handoff-in** (did the new agent actually launch and consume the brief, or is it stuck at an approval prompt?). The handoff contract must make both observable, not assumed. + +### D.3 The handoff contract (minimizes the declined loss; adds zero harness session coupling) + +1. **New lane-page section `## Handoff state`** — append-only, lane-owned: `step` (the one concrete next action), `touched` (files + committed? yes/no/partial), `working-tree` (clean | dirty: paths), `blocked-on` (ask id | external | none), `assumptions` (≤3 load-bearing facts), `as-of` (UTC). This is the minimum serialization of in-session state into the one format every harness reads: plain markdown. It deliberately omits the unportable (reasoning trace, tool stack) — the honesty is in saying those die. +2. **Flush-out: a `drop_lane` pre-hook** for graceful swaps — dispatch a fixed handoff prompt ("write `## Handoff state`, then stop"), gated on a *verified-idle* reading. **A harness without a verified working-state entry is swapped `force` with no flush** (loss accepted explicitly, made visible at the swap site — not silent). A flush is `safe`; a forced no-flush swap of a *working* lane is `destructive`. +3. **Recovery brief** — on a swap (vs. a fresh lane), `add_lane.brief` must point the successor to read the lane page including `## Handoff state` first; inject the lane-scoped outstanding ask (the engine surfaces asks to the brain today but not to the lane — close this gap); and use the task file as backbone when one exists (`tasks/T.md` is already a self-contained dispatchable prompt by contract — the strongest existing continuity primitive). +4. **Handoff ack** — the incoming agent's first required action is to write an ack (mailbox or lane page), mirroring the mailbox-ack pattern, so handoff-in is observable. No ack within a cycle → the brain treats the swap as unconfirmed and probes, identically to unconfirmed dispatches. + +**What the contract still declines, stated honestly:** reasoning trace, open-file mental model, undo stack (the `step`+`assumptions` fields are their lossy compression); native session resume (still declined even same-harness — governance may *log* a departed claude `session_id` as a forensic breadcrumb but must never depend on resuming it); uncommitted edits across a *forced* swap (the `working-tree: dirty` field warns; it does not transfer the diff). + +**Governance hook:** swap cost is a function of **lane state, not harness identity.** Swapping an idle, filed-back lane costs ~0; swapping a working lane costs its entire in-session state. So governance prefers swapping at idle boundaries and treats mid-task swaps as destructive. **The brain can be swapped per cycle at zero continuity cost — it needs none of this handoff machinery; only lane swaps do.** A harness is "swap-safe" only once it has a verified working-state signature *and* a confirmed launch flag — make that a registry-level readiness gate before a harness is eligible as a lane target. + +--- + +## (E) HOW THIS INCREASES MECHANICAL OPERABILITY + +The throughline: **convert drift from "patch shared detection after it breaks in production" into "declare a field; let the probe catch the next harness's drift before it spawns."** Concretely, a human running the deck can see the fleet's governance state at a glance, spawn the right harness without typos, and never be surprised by a lane that silently died to drift or auth. + +- **Roster as one source of truth.** A new `harness-registry roster [--json]` CLI verb (additive; existing `probe` untouched) emits every harness with governance fields + a `present` flag. The engine reads it to build the brain rubric; the deck reads it for a governance view; the gate reads a per-cycle snapshot. `contract_version: 1` per the additive contract. +- **`health` verb** (`ok | missing | unauthenticated | unhealthy`) is the out-of-band probe readiness can't do: it proves auth/gateway, not just PATH. Killing the codex-on-PATH-but-unauthenticated silent-failure class. +- **Brain physically cannot propose a bad harness.** The assembled prompt's roster block is filtered to *allowed + present + healthy* — so the gate's verdict becomes a backstop, not the primary funnel (same defense-in-depth as decision-validation + gate-reclassification). +- **Deck: a Roster screen** (bind `h`, mirroring the existing `AdrScreen`/`EventsScreen` pattern) — read-only governance dashboard, NON-WRITER boundary preserved. **A health glyph in the FleetTable** next to the harness name, so a dead/unauthenticated lane shows red *before* the operator wonders why it fell to a bare shell. **AddLaneModal becomes a `Select`** populated from the roster (allowed+present only) — a typo like `cluade` can no longer reach the bash boundary; the human, like the brain, picks only governed harnesses. The submit dict and spawn path stay byte-identical. **Retire-candidate indicator** on dynamic lanes idle past N cycles, so `drop_lane` has a visible target and lanes don't leak against `max_lanes`. +- **Health-aware `wait_ready`.** Today `add-lane`'s readiness poll times out and proceeds anyway, pasting the brief into a dead shell. On timeout it should run the harness `health` probe and emit `errored` instead of silently proceeding. + +--- + +## PHASED ROADMAP — smallest-safe-highest-leverage first + +Each phase notes fence/contract impact and additive-vs-breaking. "Additive" = backward-compatible per `CONTRACT.md` (new fields, empty-safe defaults, frozen verbs untouched). + +### Phase 0 — Declare the facts (registry governance fields + roster) +**Additive.** Add `capability_tags`, `cost_tier`, `autonomy_class`, `auth_requirement`, `health_probe`, `drift_pins` to `HARNESS_REGISTRY_FIELDS` (empty-safe defaults). Add `roster`/`health` CLI verbs; `probe`/`field`/`oneshot` untouched (frozen). Teach the fake registry `roster`/`health` (unstubbed = empty = today). **Contract impact:** none (additive within major version). **Leverage:** unblocks everything else; zero runtime behavior change. **This is where you start.** + +### Phase 1 — Enforce policy in the gate (the core governance value) +**Additive.** `HarnessPolicy` dataclass on `EngineConfig` (defaults = today). `classify_harness` in `gate.py` (`roster=None` default = today). Thread a per-cycle roster snapshot into `classify_batch`. Boot-time validation of `brain.harness`/`ingest.harness` (`brain_allow` + non-empty `oneshot_template`). Append the roster block + selection rubric to the brain prompt and checkpoint header. **Contract impact:** none — but the *first* time an operator writes a non-empty `harness_policy`, behavior changes for them by their own choice. **Leverage:** highest — this is the governance the user asked for, and it's almost all cheap. **Defer nothing here except** making `role` enforced (see Phase 1.5). + +### Phase 1.5 — Make `role` governing (the one breaking-semantics step) +**Breaking-semantics (gated behind opt-in).** Flipping `role:` from documentation to enforced policy changes the meaning of existing configs and historical `add_lane`s. **Mitigation:** governance is *inert until an operator declares `engine.harness_policy.roles`* — empty policy = pass-through = today. So the breaking change is opt-in, not forced. Ship it in Phase 1 mechanically but document loudly that activating role policy is the semantic flip. **Contract impact:** semantic, not structural; opt-in. + +### Phase 2 — Per-harness readiness/health contract (the hard, high-value generalization) +**Additive (output shape frozen).** Add `working_marker`/`idle_marker` registry fields; `loop-lane-status.sh` prefers them when `@loop_lane_harness` is set, keeping today's heuristics as fallback so the FROZEN single-word status contract is untouched and every existing special case still passes. Make `add-lane`'s `wait_ready` health-aware (emit `errored` on timeout instead of silent proceed). **Contract impact:** none structurally; this is the linchpin Phases 3-4 depend on. **Why here, not earlier:** it's the hardest item and gates the rest — but it must precede any swap-flush work. + +### Phase 3 — Deck operability surface +**Additive, display-only (NON-WRITER preserved).** Roster screen (`h`); health glyph in FleetTable; AddLaneModal `Select` from roster; retire-candidate indicator. **Contract impact:** none. **Leverage:** high human-facing value, low risk, can land any time after Phase 0/1 — parallelizable with Phase 2. + +### Phase 4 — Conditional worktree isolation (defer until concurrency > 1) +**Additive.** `isolation` registry field; `add_lane --worktree` provision/record/teardown; integration-lane pattern for N≥3. **Contract impact:** new flag + new `loops..branch` usage (already in schema). **Defer because:** at concurrency = 1 (today's dominant mode, and this session's exact profile) it is pure waste — isolation prevents conflicts that serialization already prevented. Build it *when the engine routinely runs ≥2 concurrent code-writers*, not before. + +### Phase 5 — Lane-handoff contract (defer until both Phase 2 and Phase 4 land) +**Additive.** `## Handoff state` lane-page section; `drop_lane` flush pre-hook (gated on verified-idle); recovery-brief amendments; handoff ack. **Contract impact:** new action shape (flush), new lane-page section. **Defer because:** the flush-out hook is only safe on a *trustworthy* idle reading (Phase 2), and mid-task swaps are mostly a concurrency phenomenon (Phase 4). Building it on today's heuristic would flush against an unreliable status — actively harmful. + +--- + +## RECOMMENDATION — where to start, what to defer + +**Start with Phase 0 then Phase 1.** They deliver the governance the user explicitly asked for — the "when we choose X" rubric becomes enforceable, the brain can't propose a denied/missing/unauthenticated harness, and the high-drift-unattended-high-risk combination forces human sign-off — at almost entirely additive, low-risk cost. Crucially, the security gate already proves the pattern (pure-module defense-in-depth), so this is *extending a battle-tested mechanism*, not inventing one. Keep `role` enforcement opt-in (empty policy = today) so nobody's existing config breaks. + +**Do Phase 2 next** despite its difficulty — it is the linchpin. Trustworthy readiness/health is what makes swap-safety, no-dead-shell-dispatch, and (later) handoff possible. Land Phase 3 (deck) opportunistically alongside it; it's cheap and high-visibility. + +**Defer Phases 4 and 5 until concurrency actually exceeds 1.** This is the plan's most important restraint, and all four facets converge on it: at concurrency = 1, worktree isolation is pure overhead and the handoff flush has no trustworthy idle signal to fire on. This session's serialized-on-main approach was *correct for its workload*; the parallelism machinery should arrive exactly when — and not before — the engine starts running multiple concurrent code-writers. Build the governance now (cheap, additive, high-leverage); buy the parallelism infrastructure only when you're about to spend it. + +**The one thing to get right regardless of phase:** drift is a first-class, declared, probeable property. Every harness added to governance ships a verified working-state signature, a confirmed launch flag, and a health probe — or it is not eligible as an unattended or swap-target lane. That single discipline is the generalized lesson of this session's 8 codex fixes, and it is what keeps the whole governance layer from silently degrading the moment vendor #13 drifts. \ No newline at end of file diff --git a/lib/harness-registry.sh b/lib/harness-registry.sh index f58020e..49d0622 100755 --- a/lib/harness-registry.sh +++ b/lib/harness-registry.sh @@ -20,6 +20,24 @@ # Callers shlex-split the template and substitute the # {prompt} token as ONE argument (never shell-interp). # +# Governance fields (harness-governance plan A.1/A.3 — declared FACTS only; +# policy lives in the engine's HarnessPolicy). All empty-safe: harness_field +# returns "" for any unset value, so partial registries degrade to today's +# behavior: +# capability_tags Comma-separated capability tags ("code,brain", +# "search,research", ...) matched against the +# engine policy's role_tag_map +# cost_tier Relative model cost: low | medium | high | none +# autonomy_class Unattended capability, ordered for the policy's +# autonomy cap: none < attended < unattended +# (unattended = has a real auto_approve_flag) +# auth_requirement What must be live beyond the binary on PATH: +# account | gateway | none +# health_probe Auth/gateway probe command ("" = none declared; +# health degrades to the PATH check) +# drift_pins Behavioral drift tier vs the claude baseline: +# low | med | high | none (matrix A.3 Drift column) +# # Source this file from any script that needs to resolve harness behavior: # source "$PROJECT_ROOT/scripts/lib/harness-registry.sh" # harness_field pi launch_cmd # -> "pi" @@ -31,6 +49,8 @@ # harness-registry.sh field # print one field # harness-registry.sh oneshot # print one-shot command template # harness-registry.sh probe # verify binary exists, print resolved launch +# harness-registry.sh roster [--json] # every harness + governance fields + present flag +# harness-registry.sh health # ok | missing | unauthenticated | unhealthy # # This file MUST be POSIX-bash compatible (no zsh-isms) so it sources cleanly # under tmux-spawned shells. @@ -54,6 +74,12 @@ HARNESS_PI_PASTE_ENTER_DELAY="2.0" HARNESS_PI_SKILL_DIR=".pi/skills" HARNESS_PI_NON_INTERACTIVE_FLAG="" HARNESS_PI_ONESHOT_TEMPLATE="" +HARNESS_PI_CAPABILITY_TAGS="product,synthesis" +HARNESS_PI_COST_TIER="medium" +HARNESS_PI_AUTONOMY_CLASS="attended" +HARNESS_PI_AUTH_REQUIREMENT="account" +HARNESS_PI_HEALTH_PROBE="" +HARNESS_PI_DRIFT_PINS="med" # claude — Anthropic's Claude Code CLI. Anthropic-only models. # launch_cmd is the bare invocation; consumers append auto_approve_flag @@ -66,6 +92,12 @@ HARNESS_CLAUDE_PASTE_ENTER_DELAY="2.0" HARNESS_CLAUDE_SKILL_DIR=".claude/skills" HARNESS_CLAUDE_NON_INTERACTIVE_FLAG="-p" HARNESS_CLAUDE_ONESHOT_TEMPLATE="claude -p {prompt}" +HARNESS_CLAUDE_CAPABILITY_TAGS="brain,ingest,code,ops" +HARNESS_CLAUDE_COST_TIER="high" +HARNESS_CLAUDE_AUTONOMY_CLASS="unattended" +HARNESS_CLAUDE_AUTH_REQUIREMENT="account" +HARNESS_CLAUDE_HEALTH_PROBE="" +HARNESS_CLAUDE_DRIFT_PINS="low" # opencode — OpenCode Go TUI. Models via opencode-go provider (mimo/glm/kimi/qwen). # May spawn as "node" or "opencode" depending on launch path; regex covers both. @@ -77,6 +109,12 @@ HARNESS_OPENCODE_PASTE_ENTER_DELAY="2.5" HARNESS_OPENCODE_SKILL_DIR=".config/opencode" HARNESS_OPENCODE_NON_INTERACTIVE_FLAG="run" HARNESS_OPENCODE_ONESHOT_TEMPLATE="opencode run {prompt}" +HARNESS_OPENCODE_CAPABILITY_TAGS="code,bulk" +HARNESS_OPENCODE_COST_TIER="low" +HARNESS_OPENCODE_AUTONOMY_CLASS="attended" +HARNESS_OPENCODE_AUTH_REQUIREMENT="account" +HARNESS_OPENCODE_HEALTH_PROBE="" +HARNESS_OPENCODE_DRIFT_PINS="med" # codex — Codex CLI. Models via --config model= override. HARNESS_CODEX_LAUNCH_CMD="codex" @@ -89,6 +127,12 @@ HARNESS_CODEX_PASTE_ENTER_DELAY="2.0" HARNESS_CODEX_SKILL_DIR=".codex" HARNESS_CODEX_NON_INTERACTIVE_FLAG="exec" HARNESS_CODEX_ONESHOT_TEMPLATE="codex exec {prompt}" +HARNESS_CODEX_CAPABILITY_TAGS="code,brain" +HARNESS_CODEX_COST_TIER="high" +HARNESS_CODEX_AUTONOMY_CLASS="unattended" +HARNESS_CODEX_AUTH_REQUIREMENT="account" +HARNESS_CODEX_HEALTH_PROBE="" +HARNESS_CODEX_DRIFT_PINS="high" # cursor-agent — Cursor Agent CLI. Models via --model flag. HARNESS_CURSOR_AGENT_LAUNCH_CMD="cursor-agent" @@ -99,6 +143,12 @@ HARNESS_CURSOR_AGENT_PASTE_ENTER_DELAY="2.0" HARNESS_CURSOR_AGENT_SKILL_DIR="" HARNESS_CURSOR_AGENT_NON_INTERACTIVE_FLAG="-p" HARNESS_CURSOR_AGENT_ONESHOT_TEMPLATE="cursor-agent -p {prompt}" +HARNESS_CURSOR_AGENT_CAPABILITY_TAGS="code" +HARNESS_CURSOR_AGENT_COST_TIER="medium" +HARNESS_CURSOR_AGENT_AUTONOMY_CLASS="attended" +HARNESS_CURSOR_AGENT_AUTH_REQUIREMENT="account" +HARNESS_CURSOR_AGENT_HEALTH_PROBE="" +HARNESS_CURSOR_AGENT_DRIFT_PINS="med" # hermes — Hermes Agent (NousResearch fork). Python argparse CLI. # Interactive: `hermes chat --tui` (accepts -m/--model and --yolo). @@ -111,6 +161,12 @@ HARNESS_HERMES_PASTE_ENTER_DELAY="2.0" HARNESS_HERMES_SKILL_DIR=".hermes/skills" HARNESS_HERMES_NON_INTERACTIVE_FLAG="-z" HARNESS_HERMES_ONESHOT_TEMPLATE="hermes -z {prompt}" +HARNESS_HERMES_CAPABILITY_TAGS="code,experiment" +HARNESS_HERMES_COST_TIER="medium" +HARNESS_HERMES_AUTONOMY_CLASS="unattended" +HARNESS_HERMES_AUTH_REQUIREMENT="account" +HARNESS_HERMES_HEALTH_PROBE="" +HARNESS_HERMES_DRIFT_PINS="high" # droid — Factory's coding agent. Interactive `droid`; model + autonomy # (--auto low|medium|high) are exec-only flags, so the interactive lane reads @@ -123,6 +179,12 @@ HARNESS_DROID_PASTE_ENTER_DELAY="2.0" HARNESS_DROID_SKILL_DIR="" HARNESS_DROID_NON_INTERACTIVE_FLAG="exec" HARNESS_DROID_ONESHOT_TEMPLATE="droid exec {prompt}" +HARNESS_DROID_CAPABILITY_TAGS="code" +HARNESS_DROID_COST_TIER="medium" +HARNESS_DROID_AUTONOMY_CLASS="attended" +HARNESS_DROID_AUTH_REQUIREMENT="account" +HARNESS_DROID_HEALTH_PROBE="" +HARNESS_DROID_DRIFT_PINS="med" # forge — Forge agent CLI (Rust). Interactive by default; model/agent selected # via `forge config`/agent (model_flag=config). One-shot is `forge -p `. @@ -134,6 +196,12 @@ HARNESS_FORGE_PASTE_ENTER_DELAY="2.0" HARNESS_FORGE_SKILL_DIR="" HARNESS_FORGE_NON_INTERACTIVE_FLAG="-p" HARNESS_FORGE_ONESHOT_TEMPLATE="forge -p {prompt}" +HARNESS_FORGE_CAPABILITY_TAGS="code" +HARNESS_FORGE_COST_TIER="low" +HARNESS_FORGE_AUTONOMY_CLASS="attended" +HARNESS_FORGE_AUTH_REQUIREMENT="account" +HARNESS_FORGE_HEALTH_PROBE="" +HARNESS_FORGE_DRIFT_PINS="med" # amp — Sourcegraph Amp. Auto-selects models via --mode (no model id), so # model_flag=skip. Auto-approve is --dangerously-allow-all; one-shot is `amp -x`. @@ -145,6 +213,12 @@ HARNESS_AMP_PASTE_ENTER_DELAY="2.0" HARNESS_AMP_SKILL_DIR="" HARNESS_AMP_NON_INTERACTIVE_FLAG="-x" HARNESS_AMP_ONESHOT_TEMPLATE="amp -x {prompt}" +HARNESS_AMP_CAPABILITY_TAGS="search,research" +HARNESS_AMP_COST_TIER="high" +HARNESS_AMP_AUTONOMY_CLASS="unattended" +HARNESS_AMP_AUTH_REQUIREMENT="account" +HARNESS_AMP_HEALTH_PROBE="" +HARNESS_AMP_DRIFT_PINS="high" # openclaw — OpenClaw gateway runtime. The interactive entrypoint is # `openclaw tui` (a terminal UI to the running Gateway). Model + approvals are @@ -157,6 +231,12 @@ HARNESS_OPENCLAW_PASTE_ENTER_DELAY="2.5" HARNESS_OPENCLAW_SKILL_DIR="" HARNESS_OPENCLAW_NON_INTERACTIVE_FLAG="agent" HARNESS_OPENCLAW_ONESHOT_TEMPLATE="openclaw agent --message {prompt}" +HARNESS_OPENCLAW_CAPABILITY_TAGS="ops,fleet" +HARNESS_OPENCLAW_COST_TIER="medium" +HARNESS_OPENCLAW_AUTONOMY_CLASS="attended" +HARNESS_OPENCLAW_AUTH_REQUIREMENT="gateway" +HARNESS_OPENCLAW_HEALTH_PROBE="" +HARNESS_OPENCLAW_DRIFT_PINS="med" # mprocs — process-group dashboard. Not an LLM harness, but lanes can run it. HARNESS_MPROCS_LAUNCH_CMD="mprocs" @@ -167,6 +247,12 @@ HARNESS_MPROCS_PASTE_ENTER_DELAY="0" HARNESS_MPROCS_SKILL_DIR="" HARNESS_MPROCS_NON_INTERACTIVE_FLAG="" HARNESS_MPROCS_ONESHOT_TEMPLATE="" +HARNESS_MPROCS_CAPABILITY_TAGS="dashboard" +HARNESS_MPROCS_COST_TIER="none" +HARNESS_MPROCS_AUTONOMY_CLASS="none" +HARNESS_MPROCS_AUTH_REQUIREMENT="none" +HARNESS_MPROCS_HEALTH_PROBE="" +HARNESS_MPROCS_DRIFT_PINS="none" # shell — bare shell lane (e.g. ops-top runs a watch command). No harness invocation. HARNESS_SHELL_LAUNCH_CMD="" @@ -177,10 +263,16 @@ HARNESS_SHELL_PASTE_ENTER_DELAY="0" HARNESS_SHELL_SKILL_DIR="" HARNESS_SHELL_NON_INTERACTIVE_FLAG="" HARNESS_SHELL_ONESHOT_TEMPLATE="" +HARNESS_SHELL_CAPABILITY_TAGS="probe,watch" +HARNESS_SHELL_COST_TIER="none" +HARNESS_SHELL_AUTONOMY_CLASS="none" +HARNESS_SHELL_AUTH_REQUIREMENT="none" +HARNESS_SHELL_HEALTH_PROBE="" +HARNESS_SHELL_DRIFT_PINS="none" # Ordered list — order matters for `list` output. HARNESS_REGISTRY_NAMES=(pi claude opencode codex cursor-agent hermes droid forge amp openclaw mprocs shell) -HARNESS_REGISTRY_FIELDS=(launch_cmd model_flag expected_process auto_approve_flag paste_enter_delay skill_dir non_interactive_flag oneshot_template) +HARNESS_REGISTRY_FIELDS=(launch_cmd model_flag expected_process auto_approve_flag paste_enter_delay skill_dir non_interactive_flag oneshot_template capability_tags cost_tier autonomy_class auth_requirement health_probe drift_pins) # ─── Lookup helpers ────────────────────────────────────────────────────── @@ -287,6 +379,52 @@ harness_resolve_launch() { esac } +# harness_present — exit 0 if the harness can spawn on this host: +# its binary resolves on PATH, or it needs no binary at all (shell lanes). +harness_present() { + local name="$1" + local launch + launch="$(harness_field "$name" launch_cmd)" || return 1 + [[ -z "$launch" ]] && return 0 + [[ -n "$(harness_binary_path "$name" 2>/dev/null)" ]] +} + +# harness_health — print exactly one of ok|missing|unauthenticated| +# unhealthy on stdout (single-word, mirroring the lane-status contract style). +# Exit 0 only for ok. Health beyond the PATH check comes from the harness's +# declared health_probe command; with no probe declared ("" — the registry +# default), health degrades to the PATH check, which is today's behavior. +# A failing probe reads as unauthenticated when the harness declares an +# account/gateway auth_requirement, unhealthy otherwise. +harness_health() { + local name="$1" + if ! harness_known "$name"; then + echo "harness-registry: unknown harness '$name'" >&2 + return 1 + fi + if ! harness_present "$name"; then + echo "missing" + return 1 + fi + local probe + probe="$(harness_field "$name" health_probe)" + if [[ -z "$probe" ]]; then + echo "ok" + return 0 + fi + if bash -c "$probe" >/dev/null 2>&1; then + echo "ok" + return 0 + fi + local auth + auth="$(harness_field "$name" auth_requirement)" + case "$auth" in + account|gateway) echo "unauthenticated" ;; + *) echo "unhealthy" ;; + esac + return 1 +} + # ─── CLI mode ──────────────────────────────────────────────────────────── # When invoked directly (not sourced), expose a small CLI for inspection. @@ -358,6 +496,52 @@ _harness_registry_cli() { printf 'binary: NOT FOUND on PATH\n' >&2 return 1 ;; + roster) + # Governance roster: every registered harness with its governance + # fields and a `present` flag (binary resolves on PATH, or no binary + # needed). Additive surface — probe/field/oneshot untouched. + local json=0 + [[ "${1:-}" == "--json" ]] && json=1 + local n present + if (( json )); then + printf '{\n "contract_version": 1,\n "harnesses": [\n' + local first=1 + for n in "${HARNESS_REGISTRY_NAMES[@]}"; do + present=false + harness_present "$n" && present=true + (( first )) || printf ',\n' + first=0 + # Registry values are static identifiers/short phrases (no quotes, + # backslashes, or newlines), so plain printf interpolation is + # JSON-safe here. + printf ' {"name": "%s", "present": %s, "capability_tags": "%s", "cost_tier": "%s", "autonomy_class": "%s", "auth_requirement": "%s", "health_probe": "%s", "drift_pins": "%s"}' \ + "$n" "$present" \ + "$(harness_field "$n" capability_tags)" \ + "$(harness_field "$n" cost_tier)" \ + "$(harness_field "$n" autonomy_class)" \ + "$(harness_field "$n" auth_requirement)" \ + "$(harness_field "$n" health_probe)" \ + "$(harness_field "$n" drift_pins)" + done + printf '\n ]\n}\n' + else + for n in "${HARNESS_REGISTRY_NAMES[@]}"; do + present=missing + harness_present "$n" && present=present + printf '%-14s %-8s %-22s %-7s %-11s %-8s %s\n' \ + "$n" "$present" \ + "$(harness_field "$n" capability_tags)" \ + "$(harness_field "$n" cost_tier)" \ + "$(harness_field "$n" autonomy_class)" \ + "$(harness_field "$n" auth_requirement)" \ + "$(harness_field "$n" drift_pins)" + done + fi + ;; + health) + local name="${1:?usage: health }" + harness_health "$name" + ;; -h|--help|help|"") cat <<'EOF' harness-registry — per-harness contract lookup @@ -368,9 +552,11 @@ Usage: harness-registry.sh field harness-registry.sh oneshot harness-registry.sh probe [model] + harness-registry.sh roster [--json] + harness-registry.sh health Known harnesses (see HARNESS_REGISTRY_NAMES): pi, claude, opencode, codex, cursor-agent, hermes, droid, forge, amp, openclaw, mprocs, shell -Known fields: launch_cmd model_flag expected_process auto_approve_flag paste_enter_delay skill_dir non_interactive_flag oneshot_template +Known fields: launch_cmd model_flag expected_process auto_approve_flag paste_enter_delay skill_dir non_interactive_flag oneshot_template capability_tags cost_tier autonomy_class auth_requirement health_probe drift_pins When sourced from another script, exposes: harness_known diff --git a/src/loop_orchestrator/engine/cli.py b/src/loop_orchestrator/engine/cli.py index 4f7c8c7..9b6774a 100644 --- a/src/loop_orchestrator/engine/cli.py +++ b/src/loop_orchestrator/engine/cli.py @@ -23,7 +23,7 @@ from .config import load_config from .decisions import DecisionStateError from .events import EventLog -from .loop import action_line, run_once +from .loop import action_line, run_once, validate_boot_config from .observe import Observer from .watch import Watch, pid_alive, restart, stale_daemon_warning @@ -123,9 +123,21 @@ def _parse_indices(raw: str | None) -> list[int] | None: raise SystemExit(2) from None +def _boot_check(config, root: Path, session: str) -> int: + """Fail fast (plan A.2): a misconfigured brain/ingest harness aborts at + boot with the reasons, instead of raising on the first cycle. Returns 0 + when boot is clean, 2 otherwise.""" + failures = validate_boot_config(config, Substrate(root, session)) + for failure in failures: + print(f"error: {failure}", file=sys.stderr) + return 2 if failures else 0 + + def cmd_once(args: argparse.Namespace, root: Path) -> int: session = _session(args) config = load_config(root) + if _boot_check(config, root, session): + return 2 return run_once( root, session, @@ -271,12 +283,16 @@ def cmd_cycle_now(args: argparse.Namespace, root: Path) -> int: def cmd_watch(args: argparse.Namespace, root: Path) -> int: session = _session(args) config = load_config(root) + if _boot_check(config, root, session): + return 2 return Watch(root, session, config).run() def cmd_restart(args: argparse.Namespace, root: Path) -> int: session = _session(args) config = load_config(root) + if _boot_check(config, root, session): + return 2 return restart(root, session, config, timeout_s=args.timeout) diff --git a/src/loop_orchestrator/engine/config.py b/src/loop_orchestrator/engine/config.py index 4a4ac86..f47eb0c 100644 --- a/src/loop_orchestrator/engine/config.py +++ b/src/loop_orchestrator/engine/config.py @@ -67,6 +67,29 @@ class LintConfig: interval_h: int = 24 +@dataclass(frozen=True) +class HarnessPolicy: + """Harness governance policy (harness-governance plan A.1). + + Facts live in lib/harness-registry.sh (capability_tags, cost_tier, + autonomy_class, ...); this is the policy layer the gate enforces. + The empty policy is a strict pass-through — today's behavior. + """ + + allow: list[str] = field(default_factory=list) # empty = every harness allowed + deny: list[str] = field(default_factory=list) # deny wins over allow + cost_ceiling: str = "" # max registry cost_tier (low|medium|high); "" = no ceiling + autonomy_cap: str = "" # max registry autonomy_class (none|attended|unattended); "" = no cap + role_tag_map: dict[str, list[str]] = field(default_factory=dict) # role -> capability tags + role_defaults: dict[str, str] = field(default_factory=dict) # role -> rewrite-to harness + # Roles where a high-drift harness running unattended is forced through + # human approval (plan A.2). Only consulted once a policy is written — + # the empty policy never reaches the gate's harness pass. + high_risk_roles: list[str] = field(default_factory=lambda: ["infra"]) + # Harnesses allowed as the brain / headless-ingest one-shot; empty = any. + brain_allow: list[str] = field(default_factory=list) + + @dataclass(frozen=True) class EngineConfig: brain: BrainConfig = field(default_factory=BrainConfig) @@ -79,6 +102,7 @@ class EngineConfig: pm: PmConfig = field(default_factory=PmConfig) metrics: MetricsConfig = field(default_factory=MetricsConfig) lint: LintConfig = field(default_factory=LintConfig) + harness_policy: HarnessPolicy = field(default_factory=HarnessPolicy) def _merge(cls: type, data: object): diff --git a/src/loop_orchestrator/engine/gate.py b/src/loop_orchestrator/engine/gate.py index 7b4f61c..de56f0f 100644 --- a/src/loop_orchestrator/engine/gate.py +++ b/src/loop_orchestrator/engine/gate.py @@ -15,10 +15,12 @@ from __future__ import annotations +import dataclasses import re from typing import TYPE_CHECKING -from .decision import Action +from .config import HarnessPolicy +from .decision import Action, AddLaneAction if TYPE_CHECKING: from .config import EngineConfig @@ -29,9 +31,124 @@ _ADR_ACCEPT_RE = re.compile(r"loop-adr\s+accept") +# A roster snapshot is dict[harness_name -> roster entry] as emitted by +# `harness-registry roster --json` (resolved by the loop, never in here — +# the gate stays pure). roster=None or an empty HarnessPolicy means the +# harness pass is a no-op: today's behavior exactly. +Roster = dict[str, dict] -def classify(action: Action, live_lane_count: int, config: EngineConfig) -> str: - """Classify one action. blocked > destructive > safe.""" +_EMPTY_POLICY = HarnessPolicy() +_COST_RANK = {"": 0, "none": 0, "low": 1, "medium": 2, "high": 3} +_AUTONOMY_RANK = {"": 0, "none": 0, "attended": 1, "unattended": 2} + + +def _allowed_for_role(harness: str, role: str | None, policy: HarnessPolicy, entry: dict) -> bool: + if policy.allow and harness not in policy.allow: + return False + if role and role in policy.role_tag_map: + tags = set(str(entry.get("capability_tags", "")).split(",")) + if not tags & set(policy.role_tag_map[role]): + return False + return True + + +def classify_harness( + action: Action, config: EngineConfig, roster: Roster | None = None +) -> str | None: + """Harness-governance verdict for an add_lane, per plan A.2 — or None + when the pass has no opinion (not an add_lane, no roster threaded, no + harness on the action, or the policy is empty = pass-through).""" + if roster is None or not isinstance(action, AddLaneAction) or not action.harness: + return None + policy = config.harness_policy + if policy == _EMPTY_POLICY: + return None + harness = action.harness + if harness in policy.deny: + return BLOCKED # mirrors the coord-target block + entry = roster.get(harness) + if entry is None: + return BLOCKED # unknown to roster: never reaches the bash boundary + if not _allowed_for_role(harness, action.role, policy, entry): + return BLOCKED # not allowed and no rewrite applied upstream + if entry.get("present") is False: + return DESTRUCTIVE # roster says missing: human decides + if str(entry.get("health", "")) in ("missing", "unauthenticated", "unhealthy"): + return DESTRUCTIVE + if policy.cost_ceiling and _COST_RANK.get(str(entry.get("cost_tier", "")), 0) > _COST_RANK.get( + policy.cost_ceiling, 3 + ): + return DESTRUCTIVE + if policy.autonomy_cap and _AUTONOMY_RANK.get( + str(entry.get("autonomy_class", "")), 0 + ) > _AUTONOMY_RANK.get(policy.autonomy_cap, 2): + return DESTRUCTIVE + if ( + str(entry.get("drift_pins", "")) == "high" + and action.auto_approve + and action.role in policy.high_risk_roles + ): + return DESTRUCTIVE # high drift + unattended + high-risk role + return SAFE + + +def govern_add_lanes( + actions: list[Action], config: EngineConfig, roster: Roster | None = None +) -> tuple[list[Action], list[dict]]: + """Pure rewrite pass (plan A.2 row 3): an add_lane whose harness is not + allowed for its role is rewritten to the policy's role default when that + default is itself allowed. Returns (actions, governance event dicts); + with roster=None or an empty policy this is the identity.""" + if roster is None: + return actions, [] + policy = config.harness_policy + if policy == _EMPTY_POLICY: + return actions, [] + rewritten: list[Action] = [] + events: list[dict] = [] + for action in actions: + if ( + isinstance(action, AddLaneAction) + and action.harness + and action.harness not in policy.deny + and action.harness in roster + and not _allowed_for_role(action.harness, action.role, policy, roster[action.harness]) + ): + default = policy.role_defaults.get(action.role or "", "") + entry = roster.get(default) + if ( + default + and default != action.harness + and default not in policy.deny + and entry is not None + and _allowed_for_role(default, action.role, policy, entry) + ): + events.append( + { + "event": "harness-rewrite", + "window": action.window, + "role": action.role, + "from_harness": action.harness, + "to_harness": default, + } + ) + action = dataclasses.replace(action, harness=default) + rewritten.append(action) + return rewritten, events + + +def classify( + action: Action, live_lane_count: int, config: EngineConfig, roster: Roster | None = None +) -> str: + """Classify one action. blocked > destructive > safe. + + With a roster threaded in, the harness-governance pass (classify_harness) + runs ABOVE the shape rules and merges by severity; with roster=None + (the default, and every pre-governance caller) behavior is unchanged. + """ + harness_verdict = classify_harness(action, config, roster) + if harness_verdict == BLOCKED: + return BLOCKED target = getattr(action, "lane", None) or getattr(action, "window", None) text = getattr(action, "payload", None) or getattr(action, "brief", None) if target == "coord": @@ -64,14 +181,21 @@ def classify(action: Action, live_lane_count: int, config: EngineConfig) -> str: return DESTRUCTIVE if action.kind == "add_lane" and live_lane_count >= config.destructive.max_lanes: return DESTRUCTIVE + if harness_verdict == DESTRUCTIVE: + return DESTRUCTIVE return SAFE -def classify_batch(actions: list[Action], live_lane_count: int, config: EngineConfig) -> list[str]: +def classify_batch( + actions: list[Action], + live_lane_count: int, + config: EngineConfig, + roster: Roster | None = None, +) -> list[str]: """Per-action classify, then the fan-out guard: when the batch carries more dispatch+steer than max_dispatches_per_cycle, every 'safe' dispatch/steer in it is upgraded to 'destructive' (the whole burst needs approval).""" - results = [classify(action, live_lane_count, config) for action in actions] + results = [classify(action, live_lane_count, config, roster) for action in actions] fan_out = sum(1 for action in actions if action.kind in ("dispatch", "steer")) if fan_out > config.destructive.max_dispatches_per_cycle: results = [ diff --git a/src/loop_orchestrator/engine/loop.py b/src/loop_orchestrator/engine/loop.py index d5438f0..ca539ba 100644 --- a/src/loop_orchestrator/engine/loop.py +++ b/src/loop_orchestrator/engine/loop.py @@ -9,6 +9,7 @@ from __future__ import annotations +import dataclasses import json import os import shlex @@ -26,7 +27,7 @@ from . import decision as decision_mod from . import decisions, gate, wiki from .brain import Brain, BrainError, oneshot_argv, run_oneshot -from .config import EngineConfig +from .config import EngineConfig, HarnessPolicy from .decision import DecisionError from .events import EventLog, parse_ts from .observe import EngineSnapshot, Observer @@ -73,8 +74,64 @@ def _ask_lines(asks: list[dict], now: datetime) -> list[str]: return lines or ["(none)"] -def _assemble_prompt(substrate: Substrate, snap: EngineSnapshot, paths: SessionPaths) -> str: - """checkpoint_prompt(packaged header) + lane status + restarts tail + asks.""" +# Condensed plan-A.4 selection rubric, appended to the brain prompt alongside +# the roster (only once a harness_policy is written — the empty policy keeps +# the prompt byte-identical to today). ~700 chars, far under the 24000-token +# checkpoint warn threshold. +_HARNESS_RUBRIC = """\ +--- harness selection rubric (first match wins) --- +brain / headless ingest: claude (codex fallback, brain_allow-gated) +high-risk infra: claude interactive (codex pinned fallback) +product reasoning / spec / UX: pi (claude fallback) +synthesis / docs / wiki: pi (claude fallback) +agentic codebase search: amp (claude when model pinning matters) +cheap bulk / parallel grunt edits: opencode (forge fallback) +fast one-shot burst, latency-sensitive: forge (droid exec fallback) +headless autonomous coding burst: droid (codex exec fallback) +cursor-model-specific edits: cursor-agent (skip if loop skills needed) +gateway-mediated / fleet task: openclaw (hermes fallback) +agent-platform experiment: hermes (claude fallback) +watcher / probe / log tail: shell; process dashboard: mprocs +tie-breakers: reproducibility required -> exclude amp (cannot pin a model); +unattended-destructive -> only claude/codex/hermes/amp; high drift + +unattended + high-risk role -> the gate forces human approval.""" + + +def _roster_lines(roster: dict[str, dict], config: EngineConfig) -> list[str]: + """Brain-prompt roster block: allowed + present + healthy harnesses only, + so the brain physically cannot propose a bad one (the gate stays the + backstop, not the primary funnel).""" + policy = config.harness_policy + lines = ["--- harness roster (allowed + present + healthy) ---"] + for name, entry in roster.items(): + if name in policy.deny: + continue + if policy.allow and name not in policy.allow: + continue + if entry.get("present") is False: + continue + if str(entry.get("health", "")) in ("missing", "unauthenticated", "unhealthy"): + continue + lines.append( + f"{name} tags={entry.get('capability_tags', '')} " + f"cost={entry.get('cost_tier', '')} autonomy={entry.get('autonomy_class', '')} " + f"drift={entry.get('drift_pins', '')}" + ) + if len(lines) == 1: + lines.append("(none)") + lines.append(_HARNESS_RUBRIC) + return lines + + +def _assemble_prompt( + substrate: Substrate, + snap: EngineSnapshot, + paths: SessionPaths, + config: EngineConfig | None = None, + roster: dict[str, dict] | None = None, +) -> str: + """checkpoint_prompt(packaged header) + lane status + restarts tail + asks + (+ governance roster and selection rubric when a roster was resolved).""" resource = resources.files("loop_orchestrator.engine").joinpath(*_HEADER_RESOURCE) with resources.as_file(resource) as header: prompt = substrate.checkpoint_prompt(header_file=header) @@ -89,9 +146,47 @@ def _assemble_prompt(substrate: Substrate, snap: EngineSnapshot, paths: SessionP lines.append("(none)") lines.append("--- outstanding asks ---") lines.extend(_ask_lines(actions_mod.load_asks(paths), datetime.now(timezone.utc))) + if roster is not None and config is not None: + lines.extend(_roster_lines(roster, config)) return "\n".join(lines) + "\n" +def validate_boot_config(config: EngineConfig, substrate: Substrate) -> list[str]: + """Fail-fast boot checks (plan A.2): the brain harness — and the ingest + harness when ingest runs headless — must be allowed by + harness_policy.brain_allow (empty list = unrestricted) and must have a + non-empty registry oneshot_template. Returns human-readable failure + messages; empty list = boot OK. An env override (LOOP_ENGINE_BRAIN_CMD / + LOOP_ENGINE_INGEST_CMD) replaces the registry one-shot, so the template + check is skipped for that role.""" + failures: list[str] = [] + allow = config.harness_policy.brain_allow + checks = [("brain", config.brain.harness, "LOOP_ENGINE_BRAIN_CMD")] + if config.ingest.mode == "headless": + checks.append( + ("ingest", config.ingest.harness or config.brain.harness, "LOOP_ENGINE_INGEST_CMD") + ) + for label, harness, override_var in checks: + if allow and harness not in allow: + failures.append( + f"{label}.harness {harness!r} is not in harness_policy.brain_allow " + f"{allow} (lane-config.yaml)" + ) + if os.environ.get(override_var): + continue + try: + template = substrate.harness_field(harness, "oneshot_template") + except SubstrateError: + failures.append(f"{label}.harness {harness!r} is not a registered harness") + continue + if not template: + failures.append( + f"{label}.harness {harness!r} has no one-shot mode (empty " + f"oneshot_template) — it cannot run as the {label}" + ) + return failures + + def _ingest_protocol(project_root: Path) -> str: """The AGENTS.md '### Ingest protocol' section, verbatim; '' when absent.""" try: @@ -338,7 +433,19 @@ def run_once( if pm_adapters: _pm_sync(pm_adapters, "pull", paths.tasks_dir, events, dry_run=dry_run) - prompt = _assemble_prompt(substrate, snap, paths) + # Harness governance (plan A.2): one roster snapshot per cycle, shared by + # the brain prompt and the gate. Only when a policy is actually written — + # the empty policy is a pass-through, so skip the subprocess and keep both + # the prompt and the call profile identical to today. A roster failure + # degrades to None (pass-through) with an event; it never aborts the cycle. + roster = None + if config.harness_policy != HarnessPolicy(): + try: + roster = substrate.harness_roster() + except SubstrateError as exc: + events.append("error", kind="roster-failed", error=str(exc)) + + prompt = _assemble_prompt(substrate, snap, paths, config=config, roster=roster) if dry_run: print(f"dry-run: prompt {len(prompt)} bytes (~{len(prompt) // 4} tokens)") @@ -373,7 +480,12 @@ def run_once( return _file_needs_human(paths, events, approval, second_error, reply) events.append("decision", id=parsed.id, actions=[a.kind for a in parsed.actions]) - classifications = gate.classify_batch(parsed.actions, len(snap.lanes), config) + governed, governance_events = gate.govern_add_lanes(parsed.actions, config, roster) + for governance_event in governance_events: + events.append("governance", **governance_event) + if governance_events: + parsed = dataclasses.replace(parsed, actions=governed) + classifications = gate.classify_batch(parsed.actions, len(snap.lanes), config, roster) events.append("gate", id=parsed.id, classifications=classifications) doc = decisions.create(parsed, classifications, approval, paths) diff --git a/src/loop_orchestrator/substrate.py b/src/loop_orchestrator/substrate.py index 5064a52..a2fe968 100644 --- a/src/loop_orchestrator/substrate.py +++ b/src/loop_orchestrator/substrate.py @@ -285,6 +285,13 @@ def oneshot_template(self, name: str) -> str: harness has no one-shot mode (registry exits 1).""" return self._run("harness-registry", "oneshot", name, timeout=10).stdout.strip() + def harness_roster(self) -> dict[str, dict]: + """Governance roster snapshot keyed by harness name (`harness-registry + roster --json`): per-harness governance fields + a `present` flag. + Never cached — `present` reflects the host's PATH right now.""" + doc = self._run_json("harness-registry", "roster", "--json", timeout=15) + return {entry["name"]: entry for entry in doc.get("harnesses", [])} + # ── deck support ────────────────────────────────────────────────────── # The deck is a NON-WRITER: every mutation it triggers goes through the # same audited CLIs a human would use. These wrappers exist so the deck diff --git a/tasks/archive/T0010-registry-governance-fields.md b/tasks/archive/T0010-registry-governance-fields.md new file mode 100644 index 0000000..ce2d6a1 --- /dev/null +++ b/tasks/archive/T0010-registry-governance-fields.md @@ -0,0 +1,50 @@ +--- +id: T0010 +title: Add per-harness governance fields to the registry +status: done +depends_on: [] +scope: src/loop_orchestrator/ + lib/harness-registry.sh + tests ONLY; ADDITIVE per CONTRACT.md; never break the FROZEN single-word status contract or the frozen probe/field/oneshot verbs; do NOT reinstall the tool or touch the running daemons; no git push +loop: harness-governance +--- + +# T0010 — Add per-harness governance fields to the registry + +## Objective +Phase 0 of the multi-harness agent governance build (DOGFOOD: loop-orchestrator +building its own governance). Full spec: docs/plans/harness-governance.md. +Add capability_tags, cost_tier, autonomy_class, auth_requirement, health_probe, drift_pins to HARNESS_REGISTRY_FIELDS, empty-safe (harness_field returns '' for unset). Populate them for all 12 harnesses per the profile matrix in docs/plans/harness-governance.md (A.3). FROZEN: do not touch the probe/field/oneshot verbs or the existing 8 fields. + +## Context you need +Files: lib/harness-registry.sh; tests for the registry. +The plan (docs/plans/harness-governance.md) is the authoritative spec — read the +cited sections. This is the loop-orchestrator repo building its own feature, on +the feature/harness-governance branch in an isolated worktree, so edits here do +NOT affect the running daemons (which resolve scripts from the main checkout). + +## Deliverables +- The change above, ADDITIVE and backward-compatible (empty/None defaults = + today's behavior). Tests for the new surface. +- Before/after note appended to ops-wiki/loops/harness-governance.md. + +## Acceptance criteria +The full gate is green: `make check` AND, after +`chflags nohidden .venv/lib/python*/site-packages/*.pth 2>/dev/null`, +`uv run --no-sync --group dev ruff check src tests` AND +`uv run --no-sync --group dev ruff format --check src tests` AND +`uv run --no-sync --group dev pytest -q` (all pass, no regressions). +Existing tests pass UNCHANGED (additive). Commit the batch per the commit policy +(conventional message citing T0010). Do NOT reinstall; do NOT push. + +## Verification +``` +make check +chflags nohidden .venv/lib/python*/site-packages/*.pth 2>/dev/null +uv run --no-sync --group dev ruff check src tests && uv run --no-sync --group dev pytest -q +git diff --stat +``` + +## Out of scope / escalate +Phases 2-5 of the plan (readiness/health contract, deck, worktree isolation, +handoff). If a change would touch a FROZEN surface (status word, probe/field/ +oneshot verbs, CONTRACT.md non-additively) or require a reinstall — STOP and +escalate. diff --git a/tasks/archive/T0011-roster-health-verbs.md b/tasks/archive/T0011-roster-health-verbs.md new file mode 100644 index 0000000..0dab07e --- /dev/null +++ b/tasks/archive/T0011-roster-health-verbs.md @@ -0,0 +1,50 @@ +--- +id: T0011 +title: Add roster and health CLI verbs to the registry +status: done +depends_on: [T0010] +scope: src/loop_orchestrator/ + lib/harness-registry.sh + tests ONLY; ADDITIVE per CONTRACT.md; never break the FROZEN single-word status contract or the frozen probe/field/oneshot verbs; do NOT reinstall the tool or touch the running daemons; no git push +loop: harness-governance +--- + +# T0011 — Add roster and health CLI verbs to the registry + +## Objective +Phase 0 of the multi-harness agent governance build (DOGFOOD: loop-orchestrator +building its own governance). Full spec: docs/plans/harness-governance.md. +Add `harness-registry roster [--json]` (emit every harness + governance fields + a present flag; contract_version 1) and `harness-registry health ` (ok|missing|unauthenticated|unhealthy). Additive — frozen verbs untouched. Teach the test fake to answer roster/health (unstubbed = empty = today). + +## Context you need +Files: lib/harness-registry.sh; tests/fakes/bin/harness-registry (the fake); registry tests. +The plan (docs/plans/harness-governance.md) is the authoritative spec — read the +cited sections. This is the loop-orchestrator repo building its own feature, on +the feature/harness-governance branch in an isolated worktree, so edits here do +NOT affect the running daemons (which resolve scripts from the main checkout). + +## Deliverables +- The change above, ADDITIVE and backward-compatible (empty/None defaults = + today's behavior). Tests for the new surface. +- Before/after note appended to ops-wiki/loops/harness-governance.md. + +## Acceptance criteria +The full gate is green: `make check` AND, after +`chflags nohidden .venv/lib/python*/site-packages/*.pth 2>/dev/null`, +`uv run --no-sync --group dev ruff check src tests` AND +`uv run --no-sync --group dev ruff format --check src tests` AND +`uv run --no-sync --group dev pytest -q` (all pass, no regressions). +Existing tests pass UNCHANGED (additive). Commit the batch per the commit policy +(conventional message citing T0011). Do NOT reinstall; do NOT push. + +## Verification +``` +make check +chflags nohidden .venv/lib/python*/site-packages/*.pth 2>/dev/null +uv run --no-sync --group dev ruff check src tests && uv run --no-sync --group dev pytest -q +git diff --stat +``` + +## Out of scope / escalate +Phases 2-5 of the plan (readiness/health contract, deck, worktree isolation, +handoff). If a change would touch a FROZEN surface (status word, probe/field/ +oneshot verbs, CONTRACT.md non-additively) or require a reinstall — STOP and +escalate. diff --git a/tasks/archive/T0012-harness-policy-config.md b/tasks/archive/T0012-harness-policy-config.md new file mode 100644 index 0000000..da9f2e2 --- /dev/null +++ b/tasks/archive/T0012-harness-policy-config.md @@ -0,0 +1,50 @@ +--- +id: T0012 +title: Add HarnessPolicy to the engine config +status: done +depends_on: [] +scope: src/loop_orchestrator/ + lib/harness-registry.sh + tests ONLY; ADDITIVE per CONTRACT.md; never break the FROZEN single-word status contract or the frozen probe/field/oneshot verbs; do NOT reinstall the tool or touch the running daemons; no git push +loop: harness-governance +--- + +# T0012 — Add HarnessPolicy to the engine config + +## Objective +Phase 1 of the multi-harness agent governance build (DOGFOOD: loop-orchestrator +building its own governance). Full spec: docs/plans/harness-governance.md. +Add a HarnessPolicy dataclass on EngineConfig (allow/deny lists, cost ceiling, autonomy cap, role_tag_map). Defaults reproduce TODAY's behavior exactly (empty policy = pass-through). _merge already recurses into nested dataclasses and ignores unknown keys — verify and rely on it. See plan A.1. + +## Context you need +Files: src/loop_orchestrator/engine/config.py; tests/test_config*.py. +The plan (docs/plans/harness-governance.md) is the authoritative spec — read the +cited sections. This is the loop-orchestrator repo building its own feature, on +the feature/harness-governance branch in an isolated worktree, so edits here do +NOT affect the running daemons (which resolve scripts from the main checkout). + +## Deliverables +- The change above, ADDITIVE and backward-compatible (empty/None defaults = + today's behavior). Tests for the new surface. +- Before/after note appended to ops-wiki/loops/harness-governance.md. + +## Acceptance criteria +The full gate is green: `make check` AND, after +`chflags nohidden .venv/lib/python*/site-packages/*.pth 2>/dev/null`, +`uv run --no-sync --group dev ruff check src tests` AND +`uv run --no-sync --group dev ruff format --check src tests` AND +`uv run --no-sync --group dev pytest -q` (all pass, no regressions). +Existing tests pass UNCHANGED (additive). Commit the batch per the commit policy +(conventional message citing T0012). Do NOT reinstall; do NOT push. + +## Verification +``` +make check +chflags nohidden .venv/lib/python*/site-packages/*.pth 2>/dev/null +uv run --no-sync --group dev ruff check src tests && uv run --no-sync --group dev pytest -q +git diff --stat +``` + +## Out of scope / escalate +Phases 2-5 of the plan (readiness/health contract, deck, worktree isolation, +handoff). If a change would touch a FROZEN surface (status word, probe/field/ +oneshot verbs, CONTRACT.md non-additively) or require a reinstall — STOP and +escalate. diff --git a/tasks/archive/T0013-classify-harness-gate.md b/tasks/archive/T0013-classify-harness-gate.md new file mode 100644 index 0000000..cfa9653 --- /dev/null +++ b/tasks/archive/T0013-classify-harness-gate.md @@ -0,0 +1,60 @@ +--- +id: T0013 +title: Add classify_harness to the gate +status: done +depends_on: [T0010, T0012] +scope: src/loop_orchestrator/ + lib/harness-registry.sh + tests ONLY; ADDITIVE per CONTRACT.md; never break the FROZEN single-word status contract or the frozen probe/field/oneshot verbs; do NOT reinstall the tool or touch the running daemons; no git push +loop: harness-governance +--- + +# T0013 — Add classify_harness to the gate + +## Objective +Phase 1 of the multi-harness agent governance build (DOGFOOD: loop-orchestrator +building its own governance). Full spec: docs/plans/harness-governance.md. +Add a PURE classify_harness(action, config, roster=None) pass ABOVE the existing SAFE/DESTRUCTIVE/BLOCKED logic for AddLaneAction, per the override-semantics table in plan A.2: allowlist->safe; denied/unknown-to-roster->BLOCKED; not-allowed-but-role-default->rewrite action.harness + emit a governance event; over cost-ceiling / autonomy-cap / auto_approve-over-cap / roster missing|unauthenticated->DESTRUCTIVE; high-drift + unattended + high-risk->DESTRUCTIVE. roster=None default = NO behavior change so existing tests pass. Read action.harness off the action (it exists), keep the gate pure (no IO). + +## Context you need +Files: src/loop_orchestrator/engine/gate.py; src/loop_orchestrator/engine/loop.py (resolve a per-cycle roster snapshot and thread it in); tests/test_gate.py. +The plan (docs/plans/harness-governance.md) is the authoritative spec — read the +cited sections. This is the loop-orchestrator repo building its own feature, on +the feature/harness-governance branch in an isolated worktree, so edits here do +NOT affect the running daemons (which resolve scripts from the main checkout). + +## Deliverables +- The change above, ADDITIVE and backward-compatible (empty/None defaults = + today's behavior). Tests for the new surface. +- Before/after note appended to ops-wiki/loops/harness-governance.md. + +## Acceptance criteria +The full gate is green: `make check` AND, after +`chflags nohidden .venv/lib/python*/site-packages/*.pth 2>/dev/null`, +`uv run --no-sync --group dev ruff check src tests` AND +`uv run --no-sync --group dev ruff format --check src tests` AND +`uv run --no-sync --group dev pytest -q` (all pass, no regressions). +Existing tests pass UNCHANGED (additive). Commit the batch per the commit policy +(conventional message citing T0013). Do NOT reinstall; do NOT push. + +## Verification +``` +make check +chflags nohidden .venv/lib/python*/site-packages/*.pth 2>/dev/null +uv run --no-sync --group dev ruff check src tests && uv run --no-sync --group dev pytest -q +git diff --stat +``` + +## Out of scope / escalate +Phases 2-5 of the plan (readiness/health contract, deck, worktree isolation, +handoff). If a change would touch a FROZEN surface (status word, probe/field/ +oneshot verbs, CONTRACT.md non-additively) or require a reinstall — STOP and +escalate. + +## Notes — coord ruling (batch-2 judgment call 1, empty-policy semantics) +RULING: `HarnessPolicy()` defaults are PASS-THROUGH by design — all add_lane +harness choices classify SAFE (including the unknown-to-roster / typo case). +Typo-blocking is OPT-IN: it requires an explicit non-default policy field +(e.g. a written `deny`/`allow`/`role_defaults`). The behavior built in 99c6596 +is correct as-is; the proposed one-line change to lift the unknown-to-roster +block out of the empty-policy guard was NOT applied. This preserves the +plan A.2 row 1 invariant ("or no policy → SAFE, unchanged") and keeps the +empty policy byte-identical to today's behavior. Accepted by coord; T0013 done. diff --git a/tasks/archive/T0014-boot-validation-and-rubric.md b/tasks/archive/T0014-boot-validation-and-rubric.md new file mode 100644 index 0000000..e85a670 --- /dev/null +++ b/tasks/archive/T0014-boot-validation-and-rubric.md @@ -0,0 +1,51 @@ +--- +id: T0014 +title: Boot-time brain validation + brain-prompt roster rubric +status: done +accepted: 2026-06-13 +depends_on: [T0010, T0011] +scope: src/loop_orchestrator/ + lib/harness-registry.sh + tests ONLY; ADDITIVE per CONTRACT.md; never break the FROZEN single-word status contract or the frozen probe/field/oneshot verbs; do NOT reinstall the tool or touch the running daemons; no git push +loop: harness-governance +--- + +# T0014 — Boot-time brain validation + brain-prompt roster rubric + +## Objective +Phase 1 of the multi-harness agent governance build (DOGFOOD: loop-orchestrator +building its own governance). Full spec: docs/plans/harness-governance.md. +(1) At boot, validate config.brain.harness and config.ingest.harness against a brain_allow list AND against has-a-non-empty-oneshot_template; fail fast with a clear message instead of a first-cycle raise. (2) Append the allowed+present+healthy harness roster + the 'when we choose X' rubric (plan A.4) to the assembled brain prompt / checkpoint header so the brain picks well. Append-only; keep it well under the token budget. + +## Context you need +Files: src/loop_orchestrator/engine/{config,loop,brain}.py; the checkpoint header contract; tests. +The plan (docs/plans/harness-governance.md) is the authoritative spec — read the +cited sections. This is the loop-orchestrator repo building its own feature, on +the feature/harness-governance branch in an isolated worktree, so edits here do +NOT affect the running daemons (which resolve scripts from the main checkout). + +## Deliverables +- The change above, ADDITIVE and backward-compatible (empty/None defaults = + today's behavior). Tests for the new surface. +- Before/after note appended to ops-wiki/loops/harness-governance.md. + +## Acceptance criteria +The full gate is green: `make check` AND, after +`chflags nohidden .venv/lib/python*/site-packages/*.pth 2>/dev/null`, +`uv run --no-sync --group dev ruff check src tests` AND +`uv run --no-sync --group dev ruff format --check src tests` AND +`uv run --no-sync --group dev pytest -q` (all pass, no regressions). +Existing tests pass UNCHANGED (additive). Commit the batch per the commit policy +(conventional message citing T0014). Do NOT reinstall; do NOT push. + +## Verification +``` +make check +chflags nohidden .venv/lib/python*/site-packages/*.pth 2>/dev/null +uv run --no-sync --group dev ruff check src tests && uv run --no-sync --group dev pytest -q +git diff --stat +``` + +## Out of scope / escalate +Phases 2-5 of the plan (readiness/health contract, deck, worktree isolation, +handoff). If a change would touch a FROZEN surface (status word, probe/field/ +oneshot verbs, CONTRACT.md non-additively) or require a reinstall — STOP and +escalate. diff --git a/tests/conftest.py b/tests/conftest.py index 392d039..98821c3 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -33,6 +33,8 @@ def fakes_env(monkeypatch: pytest.MonkeyPatch, tmp_path: Path) -> Path: "FAKE_BRAIN_MODE", "FAKE_METRICS_FAIL", "FAKE_LINT_FAIL", + "FAKE_ROSTER_JSON", + "FAKE_HEALTH", ): monkeypatch.delenv(var, raising=False) return log diff --git a/tests/fakes/bin/harness-registry b/tests/fakes/bin/harness-registry index 84359ee..3ba8f5d 100755 --- a/tests/fakes/bin/harness-registry +++ b/tests/fakes/bin/harness-registry @@ -23,6 +23,17 @@ case "${1:-}" in *) exit 1 ;; esac ;; + roster) + # Unstubbed = empty roster = today's behavior (governance pass-through). + if [ -n "${FAKE_ROSTER_JSON:-}" ]; then + printf '%s\n' "$FAKE_ROSTER_JSON" + else + printf '{"contract_version": 1, "harnesses": []}\n' + fi + ;; + health) + printf '%s\n' "${FAKE_HEALTH:-ok}" + ;; *) exit 1 ;; diff --git a/tests/test_config.py b/tests/test_config.py index 295040e..4199180 100644 --- a/tests/test_config.py +++ b/tests/test_config.py @@ -1,7 +1,7 @@ from __future__ import annotations from loop_orchestrator.engine import config as config_mod -from loop_orchestrator.engine.config import EngineConfig, load_config +from loop_orchestrator.engine.config import EngineConfig, HarnessPolicy, load_config def test_defaults_with_no_file(tmp_path): @@ -91,3 +91,87 @@ def test_lanes_key_ignored(tmp_path): encoding="utf-8", ) assert load_config(tmp_path) == EngineConfig() + + +def test_harness_policy_defaults_to_pass_through(tmp_path): + cfg = load_config(tmp_path) + assert cfg.harness_policy == HarnessPolicy() + assert cfg.harness_policy.allow == [] + assert cfg.harness_policy.deny == [] + assert cfg.harness_policy.cost_ceiling == "" + assert cfg.harness_policy.autonomy_cap == "" + assert cfg.harness_policy.role_tag_map == {} + + +def test_harness_policy_parsed(tmp_path): + (tmp_path / "lane-config.yaml").write_text( + """ +engine: + harness_policy: + allow: [claude, pi, codex] + deny: [amp] + cost_ceiling: medium + autonomy_cap: attended + role_tag_map: + infra: [ops, code] + web: [product, synthesis] +""", + encoding="utf-8", + ) + policy = load_config(tmp_path).harness_policy + assert policy.allow == ["claude", "pi", "codex"] + assert policy.deny == ["amp"] + assert policy.cost_ceiling == "medium" + assert policy.autonomy_cap == "attended" + assert policy.role_tag_map == {"infra": ["ops", "code"], "web": ["product", "synthesis"]} + + +def test_harness_policy_governance_fields_parsed(tmp_path): + (tmp_path / "lane-config.yaml").write_text( + """ +engine: + harness_policy: + role_defaults: + infra: claude + high_risk_roles: [infra, ops] +""", + encoding="utf-8", + ) + policy = load_config(tmp_path).harness_policy + assert policy.role_defaults == {"infra": "claude"} + assert policy.high_risk_roles == ["infra", "ops"] + # defaults: no role rewrites declared, infra is the high-risk role + assert HarnessPolicy().role_defaults == {} + assert HarnessPolicy().high_risk_roles == ["infra"] + assert HarnessPolicy().brain_allow == [] # empty = any harness may be brain + + +def test_harness_policy_brain_allow_parsed(tmp_path): + (tmp_path / "lane-config.yaml").write_text( + """ +engine: + harness_policy: + brain_allow: [claude, codex] +""", + encoding="utf-8", + ) + assert load_config(tmp_path).harness_policy.brain_allow == ["claude", "codex"] + + +def test_harness_policy_partial_keeps_defaults(tmp_path): + (tmp_path / "lane-config.yaml").write_text( + """ +engine: + harness_policy: + cost_ceiling: high + unknown_policy_key: ignored +""", + encoding="utf-8", + ) + cfg = load_config(tmp_path) + assert cfg.harness_policy.cost_ceiling == "high" + assert cfg.harness_policy.allow == [] + assert cfg.harness_policy.role_tag_map == {} + # the rest of the engine config is untouched by a policy-only file + assert cfg.brain.harness == "claude" + assert cfg.approval_mode == "manual" diff --git a/tests/test_gate.py b/tests/test_gate.py index 158ff58..f5cc3be 100644 --- a/tests/test_gate.py +++ b/tests/test_gate.py @@ -158,3 +158,172 @@ def test_batch_upgrade_spares_non_dispatch_and_keeps_blocked(): "destructive", "safe", ] + + +# ── harness governance pass (T0013, plan A.2) ────────────────────────────── +# These use the REAL EngineConfig/HarnessPolicy (the _Config stand-in above +# predates governance and is only touched on the roster=None path). + +from loop_orchestrator.engine.config import EngineConfig, HarnessPolicy # noqa: E402 +from loop_orchestrator.engine.gate import classify_harness, govern_add_lanes # noqa: E402 + + +def _entry(name, present=True, tags="code", cost="medium", autonomy="attended", drift="med"): + return { + "name": name, + "present": present, + "capability_tags": tags, + "cost_tier": cost, + "autonomy_class": autonomy, + "auth_requirement": "account", + "health_probe": "", + "drift_pins": drift, + } + + +ROSTER = { + "claude": _entry( + "claude", tags="brain,ingest,code,ops", cost="high", autonomy="unattended", drift="low" + ), + "codex": _entry("codex", tags="code,brain", cost="high", autonomy="unattended", drift="high"), + "pi": _entry("pi", tags="product,synthesis"), + "amp": _entry("amp", tags="search,research", cost="high", autonomy="unattended", drift="high"), + "droid": _entry("droid", present=False), +} + + +def policy_cfg(**kwargs) -> EngineConfig: + return EngineConfig(harness_policy=HarnessPolicy(**kwargs)) + + +def role_lane(harness="claude", role="infra", auto_approve=False, window="gov-1"): + return AddLaneAction( + window=window, + harness=harness, + brief="b", + rationale="r", + role=role, + auto_approve=auto_approve, + ) + + +def test_roster_none_is_pass_through_even_with_policy(): + cfg = policy_cfg(deny=["claude"]) + assert classify(role_lane("claude"), 1, cfg) == "safe" + assert classify_harness(role_lane("claude"), cfg, None) is None + + +def test_empty_policy_is_pass_through_even_with_roster(): + cfg = EngineConfig() + assert classify(role_lane("amp", auto_approve=True), 1, cfg, ROSTER) == "safe" + assert classify_harness(role_lane("amp"), cfg, ROSTER) is None + assert govern_add_lanes([role_lane("amp")], cfg, ROSTER) == ([role_lane("amp")], []) + + +def test_denied_harness_blocked(): + cfg = policy_cfg(deny=["amp"]) + assert classify(role_lane("amp"), 1, cfg, ROSTER) == "blocked" + + +def test_unknown_to_roster_blocked(): + cfg = policy_cfg(allow=["claude"]) + # the registry-typo case: 'cluade' never reaches the bash boundary + assert classify(role_lane("cluade"), 1, cfg, ROSTER) == "blocked" + + +def test_not_in_allowlist_blocked_without_role_default(): + cfg = policy_cfg(allow=["claude", "pi"]) + assert classify(role_lane("codex"), 1, cfg, ROSTER) == "blocked" + + +def test_allowlisted_harness_safe(): + cfg = policy_cfg(allow=["claude", "pi"]) + assert classify(role_lane("claude"), 1, cfg, ROSTER) == "safe" + + +def test_role_tag_map_mismatch_blocked(): + cfg = policy_cfg(role_tag_map={"infra": ["ops", "code"]}) + assert classify(role_lane("pi", role="infra"), 1, cfg, ROSTER) == "blocked" + # unmapped role: no tag constraint + assert classify(role_lane("pi", role="product"), 1, cfg, ROSTER) == "safe" + + +def test_rewrite_to_role_default(): + cfg = policy_cfg(role_tag_map={"infra": ["ops"]}, role_defaults={"infra": "claude"}) + actions, events = govern_add_lanes([role_lane("pi", role="infra")], cfg, ROSTER) + assert len(actions) == 1 + assert actions[0].harness == "claude" + assert events == [ + { + "event": "harness-rewrite", + "window": "gov-1", + "role": "infra", + "from_harness": "pi", + "to_harness": "claude", + } + ] + # the rewritten action then classifies clean + assert classify(actions[0], 1, cfg, ROSTER) == "safe" + + +def test_no_rewrite_when_default_itself_not_allowed(): + cfg = policy_cfg( + role_tag_map={"infra": ["ops"]}, role_defaults={"infra": "pi"}, deny=["claude"] + ) + actions, events = govern_add_lanes([role_lane("codex", role="infra")], cfg, ROSTER) + assert actions[0].harness == "codex" # untouched: pi has no 'ops' tag either + assert events == [] + assert classify(actions[0], 1, cfg, ROSTER) == "blocked" + + +def test_cost_ceiling_exceeded_destructive(): + cfg = policy_cfg(cost_ceiling="medium") + assert classify(role_lane("claude"), 1, cfg, ROSTER) == "destructive" + assert classify(role_lane("pi"), 1, cfg, ROSTER) == "safe" + + +def test_autonomy_cap_exceeded_destructive(): + cfg = policy_cfg(autonomy_cap="attended") + assert classify(role_lane("codex"), 1, cfg, ROSTER) == "destructive" + assert classify(role_lane("pi"), 1, cfg, ROSTER) == "safe" + + +def test_roster_missing_harness_destructive(): + cfg = policy_cfg(allow=["droid", "claude"]) + assert classify(role_lane("droid"), 1, cfg, ROSTER) == "destructive" + + +def test_roster_health_word_destructive(): + cfg = policy_cfg(allow=["codex", "claude"]) + sick = {**ROSTER, "codex": {**ROSTER["codex"], "health": "unauthenticated"}} + assert classify(role_lane("codex"), 1, cfg, sick) == "destructive" + + +def test_high_drift_unattended_high_risk_destructive(): + cfg = policy_cfg(allow=["codex", "claude"]) # high_risk_roles defaults to ["infra"] + assert classify(role_lane("codex", role="infra", auto_approve=True), 1, cfg, ROSTER) == ( + "destructive" + ) + # attended, or a low-risk role, stays safe + assert classify(role_lane("codex", role="infra"), 1, cfg, ROSTER) == "safe" + assert classify(role_lane("codex", role="search", auto_approve=True), 1, cfg, ROSTER) == "safe" + + +def test_blocked_beats_harness_destructive_and_shape_rules_survive(): + cfg = policy_cfg(deny=["amp"]) + # denied + raw cmd: blocked wins over the cmd shape rule + denied_with_cmd = AddLaneAction( + window="gov-2", harness="amp", cmd="python w.py", brief="b", rationale="r" + ) + assert classify(denied_with_cmd, 1, cfg, ROSTER) == "blocked" + # allowed harness + raw cmd: the shape rule still fires + allowed_with_cmd = AddLaneAction( + window="gov-3", harness="claude", cmd="python w.py", brief="b", rationale="r" + ) + assert classify(allowed_with_cmd, 1, cfg, ROSTER) == "destructive" + + +def test_classify_batch_threads_roster(): + cfg = policy_cfg(deny=["amp"]) + actions = [role_lane("amp", window="gov-4"), dispatch("run tests")] + assert classify_batch(actions, 1, cfg, ROSTER) == ["blocked", "safe"] diff --git a/tests/test_harness_registry.py b/tests/test_harness_registry.py new file mode 100644 index 0000000..931d0b3 --- /dev/null +++ b/tests/test_harness_registry.py @@ -0,0 +1,217 @@ +"""Governance fields on the REAL lib/harness-registry.sh (T0010). + +Runs the registry script through bash — no fakes — so these tests pin both +the new governance surface and the frozen field/oneshot contract around it. +""" + +from __future__ import annotations + +import subprocess +from pathlib import Path + +REGISTRY = Path(__file__).resolve().parents[1] / "lib" / "harness-registry.sh" + +HARNESSES = [ + "pi", + "claude", + "opencode", + "codex", + "cursor-agent", + "hermes", + "droid", + "forge", + "amp", + "openclaw", + "mprocs", + "shell", +] + +GOVERNANCE_FIELDS = [ + "capability_tags", + "cost_tier", + "autonomy_class", + "auth_requirement", + "health_probe", + "drift_pins", +] + + +def run_cli(*args: str) -> subprocess.CompletedProcess[str]: + return subprocess.run( + ["bash", str(REGISTRY), *args], capture_output=True, text=True, check=False + ) + + +def field_value(name: str, field: str) -> str: + proc = run_cli("field", name, field) + assert proc.returncode == 0, f"field {name} {field}: {proc.stderr}" + return proc.stdout.rstrip("\n") + + +def test_list_unchanged(): + proc = run_cli("list") + assert proc.returncode == 0 + assert proc.stdout.split() == HARNESSES + + +def test_frozen_field_and_oneshot_verbs_unchanged(): + assert field_value("claude", "oneshot_template") == "claude -p {prompt}" + proc = run_cli("oneshot", "pi") + assert proc.returncode == 1 + assert proc.stdout == "" + + +def test_governance_fields_resolve_for_every_harness(): + for name in HARNESSES: + for field in GOVERNANCE_FIELDS: + field_value(name, field) # asserts exit 0 internally + + +def test_governance_values_match_profile_matrix(): + # Spot checks against docs/plans/harness-governance.md A.3. + assert field_value("claude", "drift_pins") == "low" + assert "brain" in field_value("claude", "capability_tags").split(",") + assert field_value("codex", "drift_pins") == "high" + assert field_value("amp", "drift_pins") == "high" + assert field_value("hermes", "drift_pins") == "high" + assert field_value("pi", "capability_tags") == "product,synthesis" + assert field_value("pi", "autonomy_class") == "attended" + assert field_value("opencode", "cost_tier") == "low" + assert field_value("openclaw", "auth_requirement") == "gateway" + for name in ("mprocs", "shell"): + assert field_value(name, "autonomy_class") == "none" + assert field_value(name, "auth_requirement") == "none" + + +def test_unattended_class_iff_auto_approve_flag(): + # A.3: only harnesses with a real auto-approve flag are unattended-capable. + for name in HARNESSES: + unattended = field_value(name, "autonomy_class") == "unattended" + has_flag = field_value(name, "auto_approve_flag") != "" + assert unattended == has_flag, name + + +def test_unset_governance_field_is_empty_safe(): + # A registered harness with no governance vars set returns "" and exit 0, + # so old/partial registries degrade to today's behavior. + script = ( + f'source "{REGISTRY}"; HARNESS_REGISTRY_NAMES+=(newtool); ' + 'v="$(harness_field newtool capability_tags)" && printf "[%s]" "$v"' + ) + proc = subprocess.run(["bash", "-c", script], capture_output=True, text=True, check=False) + assert proc.returncode == 0, proc.stderr + assert proc.stdout == "[]" + + +def test_unknown_field_still_rejected(): + # Sourced harness_field returns 1 for an unknown field name. (The CLI + # `field` verb exits 0 regardless — pre-existing frozen behavior; the + # trailing newline printf masks the return code.) + script = f'source "{REGISTRY}"; harness_field claude bogus_field' + proc = subprocess.run(["bash", "-c", script], capture_output=True, text=True, check=False) + assert proc.returncode == 1 + assert "unknown field" in proc.stderr + cli = run_cli("field", "claude", "bogus_field") + assert cli.returncode == 0 + assert cli.stdout == "\n" + + +def test_fields_verb_includes_governance_rows(): + proc = run_cli("fields", "claude") + assert proc.returncode == 0 + for field in GOVERNANCE_FIELDS: + assert field in proc.stdout + + +def test_probe_output_untouched_by_governance_fields(): + # probe is a frozen verb: its output must not grow governance rows. + proc = run_cli("probe", "shell") + assert proc.returncode == 0 + for field in GOVERNANCE_FIELDS: + assert field not in proc.stdout + + +# ── roster + health verbs (T0011) ────────────────────────────────────────── + + +def test_roster_json_contract(): + import json + + proc = run_cli("roster", "--json") + assert proc.returncode == 0, proc.stderr + doc = json.loads(proc.stdout) + assert doc["contract_version"] == 1 + entries = {h["name"]: h for h in doc["harnesses"]} + assert list(entries) == HARNESSES + for entry in entries.values(): + assert isinstance(entry["present"], bool) + for field in GOVERNANCE_FIELDS: + assert field in entry + assert entries["claude"]["drift_pins"] == "low" + # shell needs no binary, so it is always present. + assert entries["shell"]["present"] is True + + +def test_roster_plain_lists_every_harness(): + proc = run_cli("roster") + assert proc.returncode == 0 + lines = proc.stdout.splitlines() + assert len(lines) == len(HARNESSES) + assert [line.split()[0] for line in lines] == HARNESSES + + +def test_health_shell_ok(): + # shell has no binary and no probe: always ok, exit 0. + proc = run_cli("health", "shell") + assert proc.returncode == 0 + assert proc.stdout.strip() == "ok" + + +def test_health_unknown_harness(): + proc = run_cli("health", "nosuch") + assert proc.returncode == 1 + assert "unknown harness" in proc.stderr + assert proc.stdout == "" + + +def test_health_missing_binary(): + import os + + # A PATH without the codex binary (but with bash) must read missing. + env = {**os.environ, "PATH": "/usr/bin:/bin"} + proc = subprocess.run( + ["bash", str(REGISTRY), "health", "codex"], + capture_output=True, + text=True, + check=False, + env=env, + ) + assert proc.returncode == 1 + assert proc.stdout.strip() == "missing" + + +def _health_with_overrides(overrides: str) -> subprocess.CompletedProcess[str]: + # Drive the probe paths hermetically: source the registry, override the + # target harness's governance vars in-shell, then call the CLI entrypoint. + script = f'source "{REGISTRY}"; {overrides}; _harness_registry_cli health shell' + return subprocess.run(["bash", "-c", script], capture_output=True, text=True, check=False) + + +def test_health_probe_pass_is_ok(): + proc = _health_with_overrides("HARNESS_SHELL_HEALTH_PROBE=true") + assert proc.returncode == 0 + assert proc.stdout.strip() == "ok" + + +def test_health_probe_fail_reads_unhealthy_without_auth(): + proc = _health_with_overrides("HARNESS_SHELL_HEALTH_PROBE=false") + assert proc.returncode == 1 + assert proc.stdout.strip() == "unhealthy" + + +def test_health_probe_fail_reads_unauthenticated_with_auth(): + proc = _health_with_overrides( + "HARNESS_SHELL_HEALTH_PROBE=false; HARNESS_SHELL_AUTH_REQUIREMENT=account" + ) + assert proc.returncode == 1 + assert proc.stdout.strip() == "unauthenticated" diff --git a/tests/test_loop_integration.py b/tests/test_loop_integration.py index c171a46..40aaee7 100644 --- a/tests/test_loop_integration.py +++ b/tests/test_loop_integration.py @@ -438,3 +438,116 @@ def test_action_line_truncates_long_payload(): second = action_line(action).split("\n")[1] assert "payload=" + "x" * 200 + "…" in second assert "x" * 201 not in second + + +# ── boot validation + brain-prompt roster rubric (T0014) ─────────────────── + +from loop_orchestrator.engine.config import HarnessPolicy # noqa: E402 +from loop_orchestrator.engine.loop import _assemble_prompt, validate_boot_config # noqa: E402 +from loop_orchestrator.engine.observe import Observer # noqa: E402 +from loop_orchestrator.substrate import Substrate # noqa: E402 + + +def _sub(project: Path) -> Substrate: + return Substrate(project, "demo") + + +def test_boot_validation_clean_defaults(project, monkeypatch): + # No env override: the registry one-shot template is actually consulted. + monkeypatch.delenv("LOOP_ENGINE_BRAIN_CMD", raising=False) + assert validate_boot_config(EngineConfig(), _sub(project)) == [] + + +def test_boot_validation_rejects_oneshotless_brain(project, monkeypatch): + monkeypatch.delenv("LOOP_ENGINE_BRAIN_CMD", raising=False) + from loop_orchestrator.engine.config import BrainConfig + + cfg = EngineConfig(brain=BrainConfig(harness="pi")) + failures = validate_boot_config(cfg, _sub(project)) + assert len(failures) == 1 and "brain.harness 'pi'" in failures[0] + + +def test_boot_validation_env_override_skips_oneshot_check(project): + # project fixture sets LOOP_ENGINE_BRAIN_CMD: a oneshot-less brain boots. + from loop_orchestrator.engine.config import BrainConfig + + cfg = EngineConfig(brain=BrainConfig(harness="pi")) + assert validate_boot_config(cfg, _sub(project)) == [] + + +def test_boot_validation_brain_allow(project): + cfg = EngineConfig(harness_policy=HarnessPolicy(brain_allow=["codex"])) + failures = validate_boot_config(cfg, _sub(project)) + assert len(failures) == 1 + assert "brain_allow" in failures[0] and "'claude'" in failures[0] + + +def test_boot_validation_checks_headless_ingest(project, monkeypatch): + monkeypatch.delenv("LOOP_ENGINE_BRAIN_CMD", raising=False) + monkeypatch.delenv("LOOP_ENGINE_INGEST_CMD", raising=False) + from loop_orchestrator.engine.config import IngestConfig + + cfg = EngineConfig(ingest=IngestConfig(mode="headless", harness="pi")) + failures = validate_boot_config(cfg, _sub(project)) + assert len(failures) == 1 and "ingest.harness 'pi'" in failures[0] + # lane mode never validates the ingest harness + cfg = EngineConfig(ingest=IngestConfig(mode="lane", harness="pi")) + assert validate_boot_config(cfg, _sub(project)) == [] + + +def test_cli_once_fails_fast_on_bad_brain(project, call_log, capsys, monkeypatch): + monkeypatch.delenv("LOOP_ENGINE_BRAIN_CMD", raising=False) + (project / "lane-config.yaml").write_text( + "engine:\n brain:\n harness: pi\n", encoding="utf-8" + ) + rc = cli.main(["--project-root", str(project), "--session", "demo", "once", "--dry-run"]) + assert rc == 2 + assert "brain.harness 'pi'" in capsys.readouterr().err + assert _brain_calls(call_log) == [] + # fail-fast: no cycle started, no events file written + assert not SessionPaths(project, "demo").events_path.exists() + + +ROSTER_FIXTURE = { + "claude": { + "name": "claude", + "present": True, + "capability_tags": "brain,ingest,code,ops", + "cost_tier": "high", + "autonomy_class": "unattended", + "drift_pins": "low", + }, + "amp": { + "name": "amp", + "present": True, + "capability_tags": "search,research", + "cost_tier": "high", + "autonomy_class": "unattended", + "drift_pins": "high", + }, + "droid": {"name": "droid", "present": False, "capability_tags": "code"}, +} + + +def test_prompt_roster_block_filtered_and_rubric(project): + paths = SessionPaths(project, "demo") + paths.ensure() + sub = _sub(project) + snap = Observer(sub, paths).snapshot() + cfg = EngineConfig(harness_policy=HarnessPolicy(deny=["amp"])) + prompt = _assemble_prompt(sub, snap, paths, config=cfg, roster=ROSTER_FIXTURE) + assert "--- harness roster (allowed + present + healthy) ---" in prompt + assert "\nclaude tags=brain,ingest,code,ops cost=high" in prompt + assert "\namp tags=" not in prompt # denied + assert "\ndroid tags=" not in prompt # not present + assert "--- harness selection rubric (first match wins) ---" in prompt + + +def test_prompt_unchanged_without_roster(project): + paths = SessionPaths(project, "demo") + paths.ensure() + sub = _sub(project) + snap = Observer(sub, paths).snapshot() + prompt = _assemble_prompt(sub, snap, paths) + assert "harness roster" not in prompt + assert "selection rubric" not in prompt diff --git a/tests/test_substrate.py b/tests/test_substrate.py index 28b6659..4fa0374 100644 --- a/tests/test_substrate.py +++ b/tests/test_substrate.py @@ -111,6 +111,24 @@ def test_harness_registry(sub, call_log): sub.oneshot_template("pi") +def test_harness_roster_unstubbed_is_empty(sub, call_log): + assert sub.harness_roster() == {} + assert call_log() == ["harness-registry roster --json"] + + +def test_harness_roster_parses_entries(sub, monkeypatch): + monkeypatch.setenv( + "FAKE_ROSTER_JSON", + '{"contract_version": 1, "harnesses": [' + '{"name": "claude", "present": true, "drift_pins": "low"},' + '{"name": "amp", "present": false, "drift_pins": "high"}]}', + ) + roster = sub.harness_roster() + assert set(roster) == {"claude", "amp"} + assert roster["claude"]["present"] is True + assert roster["amp"]["drift_pins"] == "high" + + def test_dispatch_argv_order(sub, call_log): sub.dispatch("web", "echo hi", wait_ready=True, interrupt=True) sub.dispatch("docs", "hello", mode="command")