From b7acfb7135d79d44eb0b980b831f2439b23dfaae Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Wed, 27 May 2026 20:57:37 -0400
Subject: [PATCH 01/23] test(e2e): execute real shell assertions; delete
 dry-run, --validate-only, and the bash runner

The merged hybrid scenario architecture shipped scaffolding that looked like it
ran E2E tests but did not. Two layers were producing fake green:

1. phase.ts:executeStep had no child_process.spawn anywhere. Every real
   shell/probe step fell through to a hardcoded
   { status: "failed", message: "unsupported live step" }, and a handful of
   fake-pass refs (fake-pass, fake-retry-once-pass, fake-always-transient,
   phase-1-skeleton) were the only paths that reported "passed". CI green
   meant the plan compiled, not that any assertion executed.

2. ~30 shell scripts under validation_suites/, onboarding_assertions/,
   nemoclaw_scenarios/install/, and nemoclaw_scenarios/onboard/ began with
   "if e2e_env_is_dry_run; then echo [dry-run] ...; exit 0; fi". Once the
   dry-run flag flowed in (which workflows did pass), every script silently
   exited 0 before its real assertion ran.

This change rips out both layers in one shot. The TS runner has one execution
mode: live. There is no flag, env var, helper, branch, or comment in any
production path that can produce a fake pass.

Orchestrator (TypeScript)
- phase.ts: executeStep now spawns shell steps via child_process.spawn,
  with detached process groups so timeouts kill bash + sleep cleanly. Probe
  steps return skipped "probe not registered" until the registry lands.
  Pending steps return skipped "pending: <ref>". Unknown kinds throw.
  Real evidence is captured to .e2e/logs/<step-id>.log. Step-level
  reliability.timeoutSeconds and retry.{attempts,on} policy are enforced
  here, not in clients.
- run.ts: --dry-run, --validate-only deleted. Default invocation is live
  execution. --list and --plan-only (local debug) survive read-only.
  --emit-matrix added for the dynamic-matrix workflow (PR #4359).
- types.ts: RunContext.dryRun deleted. AssertionResult already supported
  "skipped" status, now actually used.

Workflow
- e2e-scenarios.yaml: the resolve-runner --plan-only warmup, and both
  --dry-run invocations (Linux + WSL), are gone. Workflows execute live.
- workflow-boundary.mts validator now requires --dry-run, --plan-only,
  and --validate-only to NOT appear in the workflow.

Bash entrypoints (PR collapses what was originally going to be PR 1 + PR 2)
- runtime/run-scenario.sh: 483 lines of duplicated install/onboard/
  gateway-check/suite-execution -> 5-line fail-fast stub pointing at
  run.ts. The TS phase orchestrators own this work now.
- runtime/run-suites.sh: same treatment. PhaseOrchestrator.runShellStep
  walks typed assertionGroups directly; nothing in TS calls a YAML-walking
  bash runner.

Shell scripts (the leaves stay, the dry-run skip blocks die)
- validation_suites/**, onboarding_assertions/**, nemoclaw_scenarios/**:
  every "if e2e_env_is_dry_run; then ... exit 0; fi" and every
  "[[ ${E2E_DRY_RUN:-0} == 1 ]]" short-circuit removed. The real assertion
  logic that was hiding underneath now runs unconditionally.
- runtime/lib/env.sh: e2e_env_is_dry_run helper deleted.
- inference_routing.sh: dead _e2e_inference_plan helper (only callable
  from the deleted dry-run paths) deleted.

Tests
- DELETED (validated dead code paths):
    e2e-suite-runner.test.ts             (run-suites.sh behavior)
    e2e-scenario-first-migration.test.ts (run-scenario.sh dry-run plan)
    e2e-expected-state-validator.test.ts (--validate-only mode)
- REWRITTEN:
    e2e-phase-orchestrators.test.ts: now exercises real shell spawning
      via temp scripts (pass/fail/timeout/retry/missing-ref), real probe
      skipping with visible reason, and real pending skipping. The
      previous fake-pass refs in this test were the canonical example of
      the problem.
- TRIMMED:
    e2e-lib-helpers.test.ts: dry-run-mode unit tests deleted; tests of
      real bash semantics survive.
    e2e-scenario-additional-families.test.ts: planOnly-via-bash tests
      deleted; resolveScenario-direct tests survive.
    e2e-scenario-resolver.test.ts: run-scenario.sh --plan-only spawn
      tests deleted; resolver unit tests survive.
    e2e-context-helper.test.ts: dry-run trace test deleted.

Docs
- docs/README.md: updated to state one runner, one mode (live), no
  dry-run, no validate-only. Bash entrypoints documented as deprecated
  fail-fast stubs.

Verification
- run.ts --list          : prints the typed registry (intact)
- run.ts --emit-matrix   : emits JSON matrix for the dynamic-matrix
                           workflow
- run.ts --scenarios <id>: spawns real shell scripts, real exit codes,
                           real failures with real evidence logs. Phase
                           results show passed/failed/skipped honestly.
- All 274 e2e-scenario framework tests pass.
- Audited: no surviving --dry-run, dryRun, E2E_DRY_RUN, e2e_env_is_dry_run,
  fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton,
  unsupported-live-step, --validate-only, or RunContext.dryRun in any
  production path.

CI for this PR will go red on environments where nemoclaw is not actually
installed and onboarded. That is the point. Red is the first honest signal
in months. Subsequent PRs (probe registry, OnboardingOrchestrator wiring
into the real install/onboard dispatchers, old YAML resolver deletion)
fix the real failures rather than hide them.

Spec gates addressed: Phase 6 (orchestrators execute live shell steps),
Phase 7 (single TS runtime entrypoint, bash runners deprecated), and
the workflow side of Phase 9 (--dry-run / --validate-only / suite_filter
gone from active paths). The old YAML resolver source under
runtime/resolver/ stays for now; its deletion is the next PR.
---
 .github/workflows/e2e-scenarios.yaml          |   5 +-
 test/e2e-scenario/docs/README.md              |  32 +-
 .../e2e-context-helper.test.ts                |  35 --
 .../e2e-expected-state-validator.test.ts      | 235 ---------
 .../framework-tests/e2e-lib-helpers.test.ts   | 146 +-----
 .../e2e-phase-orchestrators.test.ts           | 182 +++++--
 .../e2e-scenario-additional-families.test.ts  | 104 +---
 .../e2e-scenario-first-migration.test.ts      | 102 ----
 .../e2e-scenario-resolver.test.ts             |  61 +--
 .../framework-tests/e2e-suite-runner.test.ts  | 249 ---------
 .../fixtures/older-base-image.sh              |  10 +-
 .../nemoclaw_scenarios/install/dispatch.sh    |   2 +-
 .../nemoclaw_scenarios/install/launchable.sh  |   5 -
 .../nemoclaw_scenarios/install/ollama.sh      |   4 -
 .../nemoclaw_scenarios/install/public-curl.sh |   4 -
 .../install/repo-current.sh                   |   5 -
 .../nemoclaw_scenarios/onboard/dispatch.sh    |   4 -
 test/e2e-scenario/runtime/lib/env.sh          |   5 -
 test/e2e-scenario/runtime/run-scenario.sh     | 490 +-----------------
 test/e2e-scenario/runtime/run-suites.sh       | 140 +----
 .../scenarios/orchestrators/phase.ts          | 185 ++++++-
 test/e2e-scenario/scenarios/run.ts            |  77 ++-
 test/e2e-scenario/scenarios/types.ts          |   1 -
 .../validation_suites/assert/gateway-alive.sh |   4 -
 .../validation_suites/assert/sandbox-alive.sh |   5 -
 .../hermes/00-hermes-health.sh                |   4 -
 .../inference/cloud/00-models-health.sh       |   5 -
 .../inference/cloud/01-chat-completion.sh     |   5 -
 .../cloud/02-inference-local-from-sandbox.sh  |   5 -
 .../ollama-auth-proxy/00-proxy-reachable.sh   |   4 -
 .../ollama-gpu/00-ollama-models-health.sh     |   4 -
 .../ollama-gpu/01-ollama-chat-completion.sh   |   4 -
 .../lib/inference_routing.sh                  |  22 -
 .../lib/messaging_providers.sh                |   7 -
 .../validation_suites/lib/rebuild_upgrade.sh  |  32 --
 .../lib/sandbox_lifecycle.sh                  |  29 +-
 .../lib/security_policy_credentials.sh        |  20 -
 .../messaging/common/03-bridge-reachable.sh   |   5 -
 .../platform/macos/00-macos-smoke.sh          |   5 -
 .../platform/wsl/00-wsl-smoke.sh              |   5 -
 .../validation_suites/sandbox-exec.sh         |  11 -
 .../smoke/00-cli-available.sh                 |   5 -
 .../smoke/03-sandbox-shell.sh                 |   6 -
 tools/e2e-scenarios/workflow-boundary.mts     |  17 +-
 44 files changed, 454 insertions(+), 1833 deletions(-)
 delete mode 100644 test/e2e-scenario/framework-tests/e2e-expected-state-validator.test.ts
 delete mode 100644 test/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.ts
 delete mode 100644 test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
diff --git a/.github/workflows/e2e-scenarios.yaml b/.github/workflows/e2e-scenarios.yaml
index 771544c979..283f2196b2 100644
--- a/.github/workflows/e2e-scenarios.yaml
+++ b/.github/workflows/e2e-scenarios.yaml
@@ -81,7 +81,6 @@ jobs:
           for raw in "${IDS[@]}"; do
             id="${raw//[[:space:]]/}"
             [ -n "${id}" ] || continue
-            npx tsx test/e2e-scenario/scenarios/run.ts --scenarios "${id}" --plan-only >/dev/null
             runner="${ROUTES[$id]:-}"
             if [ -z "${runner}" ]; then
               echo "::error::No runner route for scenario: ${id}" >&2
@@ -135,7 +134,7 @@ jobs:
             echo "::error::Invalid scenario input: ${SCENARIOS}" >&2
             exit 1
           fi
-          npx tsx test/e2e-scenario/scenarios/run.ts --scenarios "${SCENARIOS}" --dry-run
+          npx tsx test/e2e-scenario/scenarios/run.ts --scenarios "${SCENARIOS}"
 
       - name: Resolve workspace paths for WSL
         if: contains(inputs.scenarios || github.event.inputs.scenarios, 'wsl-repo-cloud-openclaw')
@@ -172,7 +171,7 @@ jobs:
               mkdir -p "${WSL_WORKDIR}"
               export E2E_CONTEXT_DIR="${WSL_WORKDIR}"
               npm ci --ignore-scripts
-              npx tsx test/e2e-scenario/scenarios/run.ts --scenarios "${SCENARIOS}" --dry-run
+              npx tsx test/e2e-scenario/scenarios/run.ts --scenarios "${SCENARIOS}"
             '
 
       - name: Append plan summary
diff --git a/test/e2e-scenario/docs/README.md b/test/e2e-scenario/docs/README.md
index 15ad01d88d..f4acc8eebe 100644
--- a/test/e2e-scenario/docs/README.md
+++ b/test/e2e-scenario/docs/README.md
@@ -32,24 +32,25 @@ test plan, expected state, and post-onboard suites. Test plans can also declare
 onboarding assertions that run after install/onboard and before expected-state
 validation.
 
-Plan-only resolution accepts either an alias or a test plan ID:
-
-```bash
-bash test/e2e-scenario/runtime/run-scenario.sh ubuntu-repo-cloud-openclaw --plan-only
-bash test/e2e-scenario/runtime/run-scenario.sh ubuntu-repo-docker__cloud-nvidia-openclaw --plan-only
-```
-
 ## How to run
 
+The TypeScript runner is the only supported entrypoint. There is one
+execution mode: live. There is no `--dry-run`, no `--validate-only`, no
+fake-pass code path. Plan output is emitted as a side effect of the
+live run.
+
 ```bash
-bash test/e2e-scenario/runtime/run-scenario.sh <id> --plan-only       # resolve + print plan, no side effects
-bash test/e2e-scenario/runtime/run-scenario.sh <id> --dry-run         # helpers short-circuit with trace
-bash test/e2e-scenario/runtime/run-scenario.sh <id> --validate-only   # assume setup done; validate expected state
-bash test/e2e-scenario/runtime/run-scenario.sh <id>                   # full live run
-bash test/e2e-scenario/runtime/run-suites.sh <suite-id> [<suite-id>…]
-bash test/e2e-scenario/runtime/coverage-report.sh                     # Markdown matrix of scenario × suite
+npx tsx test/e2e-scenario/scenarios/run.ts --scenarios <id[,id...]>     # live execution (the only mode)
+npx tsx test/e2e-scenario/scenarios/run.ts --list                       # list canonical scenario ids
+npx tsx test/e2e-scenario/scenarios/run.ts --emit-matrix                # JSON registry payload for CI matrix fan-out
+npx tsx test/e2e-scenario/scenarios/run.ts --scenarios <id> --plan-only # local debug only; MUST NOT appear in any workflow
+bash test/e2e-scenario/runtime/coverage-report.sh                       # Markdown matrix of scenario × suite
 ```
 
+The deprecated bash entrypoints `runtime/run-scenario.sh` and
+`runtime/run-suites.sh` exist only as fail-fast stubs; they print a
+pointer at `run.ts` and exit non-zero.
+
 Override the runtime context dir with `E2E_CONTEXT_DIR=<path>` (default
 `.e2e/`, gitignored). The scenario runner and suites communicate only
 through `$E2E_CONTEXT_DIR/context.env` — suites do not rediscover
@@ -72,7 +73,8 @@ test/e2e/
     assert/        # outcome assertions (inference, credentials, policy, messaging)
     smoke/ inference/ hermes/ platform/ security/   # suite scripts grouped by concern
   runtime/                           # entry points + cross-cutting shared libs
-    run-scenario.sh / run-suites.sh / coverage-report.sh
+    run-scenario.sh / run-suites.sh    # DEPRECATED fail-fast stubs (see above)
+    coverage-report.sh
     resolver/      # TypeScript: load, plan, validate, coverage (invoked via tsx)
     lib/           # shared shell helpers: context, env, cleanup, logging, artifacts, sandbox-teardown
 ```
@@ -89,7 +91,7 @@ three YAML files above, plus shell scripts under
 `validation_suites/assert/`, or `validation_suites/<category>/`. The
 schemas in
 [`../runtime/resolver/schema.ts`](../runtime/resolver/schema.ts)
-describe the required shape; `run-scenario.sh <id> --plan-only`
+describe the required shape; `npx tsx test/e2e-scenario/scenarios/run.ts --scenarios <id> --plan-only`
 validates your change without running anything destructive.
 
 When adding a suite assertion, emit or preserve a stable `PASS: <id>` /
diff --git a/test/e2e-scenario/framework-tests/e2e-context-helper.test.ts b/test/e2e-scenario/framework-tests/e2e-context-helper.test.ts
index 6a7c97959f..0134d6adc9 100644
--- a/test/e2e-scenario/framework-tests/e2e-context-helper.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-context-helper.test.ts
@@ -9,7 +9,6 @@ import path from "node:path";
 
 const REPO_ROOT = path.resolve(import.meta.dirname, "../../..");
 const CONTEXT_LIB = path.join(REPO_ROOT, "test/e2e-scenario/runtime/lib/context.sh");
-const RUN_SCENARIO = path.join(REPO_ROOT, "test/e2e-scenario/runtime/run-scenario.sh");
 
 function runBash(script: string, env: Record<string, string> = {}): SpawnSyncReturns<string> {
   return spawnSync("bash", ["-c", script], {
@@ -86,38 +85,4 @@ describe("E2E context helper (runtime/lib/context.sh)", () => {
     }
   });
 
-  it("scenario_plan_execution_should_emit_context_under_dry_run", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-ctx-"));
-    try {
-      const r = spawnSync(
-        "bash",
-        [RUN_SCENARIO, "ubuntu-repo-cloud-openclaw", "--dry-run"],
-        {
-          env: { ...process.env, E2E_CONTEXT_DIR: tmp },
-          encoding: "utf8",
-    timeout: Number(process.env.E2E_SPAWN_TIMEOUT_MS ?? 60_000),
-          cwd: REPO_ROOT,
-        },
-      );
-      expect(r.status, r.stderr).toBe(0);
-      const ctxPath = path.join(tmp, "context.env");
-      expect(fs.existsSync(ctxPath), `context.env missing in ${tmp}`).toBe(true);
-      const ctx = fs.readFileSync(ctxPath, "utf8");
-      for (const key of [
-        "E2E_SCENARIO",
-        "E2E_PLATFORM_OS",
-        "E2E_INSTALL_METHOD",
-        "E2E_ONBOARDING_PATH",
-        "E2E_AGENT",
-        "E2E_PROVIDER",
-        "E2E_SANDBOX_NAME",
-        "E2E_GATEWAY_URL",
-        "E2E_INFERENCE_ROUTE",
-      ]) {
-        expect(ctx, `${key} missing from context.env`).toMatch(new RegExp(`^${key}=`, "m"));
-      }
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
 });
diff --git a/test/e2e-scenario/framework-tests/e2e-expected-state-validator.test.ts b/test/e2e-scenario/framework-tests/e2e-expected-state-validator.test.ts
deleted file mode 100644
index ba1f2b5f31..0000000000
--- a/test/e2e-scenario/framework-tests/e2e-expected-state-validator.test.ts
+++ /dev/null
@@ -1,235 +0,0 @@
-// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-// SPDX-License-Identifier: Apache-2.0
-
-import { describe, it, expect } from "vitest";
-import { spawnSync } from "node:child_process";
-import fs from "node:fs";
-import os from "node:os";
-import path from "node:path";
-
-import {
-  validateExpectedState,
-  type ProbeResults,
-} from "../runtime/resolver/validator.ts";
-import type { ExpectedStateConfig, ResolvedSuite } from "../runtime/resolver/schema.ts";
-
-const REPO_ROOT = path.resolve(import.meta.dirname, "../../..");
-const RUN_SCENARIO = path.join(REPO_ROOT, "test/e2e-scenario/runtime/run-scenario.sh");
-
-function cloudOpenclawReady(): ExpectedStateConfig {
-  return {
-    cli: { installed: true },
-    gateway: { expected: "present", health: "healthy" },
-    sandbox: { expected: "present", status: "running", agent: "openclaw" },
-    inference: {
-      expected: "available",
-      provider: "nvidia",
-      route: "inference-local",
-      mode: "gateway-routed",
-    },
-    credentials: { expected: "present", storage: "gateway-managed" },
-  };
-}
-
-function passingProbes(): ProbeResults {
-  return {
-    "cli.installed": true,
-    "gateway.health": "healthy",
-    "gateway.expected": "present",
-    "sandbox.status": "running",
-    "sandbox.expected": "present",
-    "sandbox.agent": "openclaw",
-    "inference.expected": "available",
-    "inference.provider": "nvidia",
-    "inference.route": "inference-local",
-    "inference.mode": "gateway-routed",
-    "credentials.expected": "present",
-    "credentials.storage": "gateway-managed",
-  };
-}
-
-describe("expected state validator", () => {
-  it("should_validate_matching_state", () => {
-    const report = validateExpectedState({
-      stateId: "cloud-openclaw-ready",
-      state: cloudOpenclawReady(),
-      probes: passingProbes(),
-      suites: [],
-    });
-    expect(report.ok).toBe(true);
-    expect(report.checks.every((c) => c.ok)).toBe(true);
-  });
-
-  it("should_fail_when_gateway_expected_but_unhealthy", () => {
-    const probes = passingProbes();
-    probes["gateway.health"] = "unhealthy";
-    const report = validateExpectedState({
-      stateId: "cloud-openclaw-ready",
-      state: cloudOpenclawReady(),
-      probes,
-      suites: [],
-    });
-    expect(report.ok).toBe(false);
-    const failing = report.checks.find((c) => c.key === "gateway.health");
-    expect(failing?.ok).toBe(false);
-    expect(failing?.expected).toBe("healthy");
-    expect(failing?.actual).toBe("unhealthy");
-  });
-
-  it("should_fail_when_sandbox_expected_but_absent", () => {
-    const probes = passingProbes();
-    probes["sandbox.status"] = "absent";
-    probes["sandbox.expected"] = "absent";
-    const report = validateExpectedState({
-      stateId: "cloud-openclaw-ready",
-      state: cloudOpenclawReady(),
-      probes,
-      suites: [],
-    });
-    expect(report.ok).toBe(false);
-    expect(report.checks.some((c) => c.key === "sandbox.status" && !c.ok)).toBe(true);
-  });
-
-  it("should_fail_when_suite_requires_state_unmet_at_runtime", () => {
-    // Expected state claims inference.expected=available, but the probe
-    // reports unavailable; the smoke suite happens to pass but an inference
-    // suite's requires_state should trigger a runtime failure before
-    // execution.
-    const state = cloudOpenclawReady();
-    const probes = passingProbes();
-    probes["inference.expected"] = "unavailable";
-    const inferenceSuite: ResolvedSuite = {
-      id: "inference",
-      requires_state: { "inference.expected": "available" },
-      steps: [{ id: "models-health", script: "suites/inference/cloud/00-models-health.sh" }],
-    };
-    const report = validateExpectedState({
-      stateId: "cloud-openclaw-ready",
-      state,
-      probes,
-      suites: [inferenceSuite],
-    });
-    expect(report.ok).toBe(false);
-    const msg = report.checks
-      .filter((c) => !c.ok)
-      .map((c) => `${c.key}=${c.actual ?? "<missing>"} (wanted ${c.expected})`)
-      .join("; ");
-    expect(msg).toMatch(/inference\.expected/);
-    expect(msg).toMatch(/available/);
-    expect(msg).toMatch(/unavailable/);
-    // Should also reference the suite that made the requirement.
-    expect(report.checks.some((c) => c.suite === "inference" && !c.ok)).toBe(true);
-  });
-});
-
-describe("runner_should_not_run_suites_when_expected_state_fails", () => {
-  it("runs expected-state validation and skips suites on failure", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-es-"));
-    try {
-      const trace = path.join(tmp, "trace.log");
-      // Simulate gateway-unhealthy probe by setting an override env var.
-      const r = spawnSync(
-        "bash",
-        [RUN_SCENARIO, "ubuntu-repo-cloud-openclaw", "--dry-run"],
-        {
-          env: {
-            ...process.env,
-            E2E_CONTEXT_DIR: tmp,
-            E2E_TRACE_FILE: trace,
-            // validator reads these overrides in dry-run mode to fake probes
-            E2E_PROBE_OVERRIDE_GATEWAY_HEALTH: "unhealthy",
-            E2E_VALIDATE_EXPECTED_STATE: "1",
-          },
-          encoding: "utf8",
-    timeout: Number(process.env.E2E_SPAWN_TIMEOUT_MS ?? 60_000),
-          cwd: REPO_ROOT,
-        },
-      );
-      // Dry-run execution should now fail because the expected state
-      // validation runs and sees gateway.health=unhealthy.
-      expect(r.status).not.toBe(0);
-      // Validator must run (its report file should exist) but suites must not.
-      const reportPath = path.join(tmp, "expected-state-report.json");
-      expect(fs.existsSync(reportPath), `missing ${reportPath}`).toBe(true);
-      const report = JSON.parse(fs.readFileSync(reportPath, "utf8"));
-      expect(report.ok).toBe(false);
-      expect(report.checks.some((c: { key: string; ok: boolean }) => c.key === "gateway.health" && !c.ok)).toBe(true);
-      // And the run's failure output should reference expected-state, not suites.
-      expect(`${r.stdout}${r.stderr}`).toMatch(/expected.state/i);
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
-});
-
-// ─────────────────────────────────────────────────────────────────────────────
-// Phase 1.F — --validate-only flag on run-scenario.sh
-// ─────────────────────────────────────────────────────────────────────────────
-
-describe("run-scenario --validate-only flag", () => {
-  it("runs only validator and emits probe results json on stdout without running install/onboard/suites", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-validate-only-"));
-    try {
-      const trace = path.join(tmp, "trace.log");
-      // Pre-populate a context.env: --validate-only assumes setup has already run.
-      fs.writeFileSync(
-        path.join(tmp, "context.env"),
-        "E2E_SCENARIO=ubuntu-repo-cloud-openclaw\n",
-      );
-      const r = spawnSync(
-        "bash",
-        [RUN_SCENARIO, "ubuntu-repo-cloud-openclaw", "--validate-only"],
-        {
-          env: {
-            ...process.env,
-            E2E_CONTEXT_DIR: tmp,
-            E2E_TRACE_FILE: trace,
-            // Supply probe overrides for every key the expected state needs.
-            E2E_PROBE_OVERRIDE_CLI_INSTALLED: "true",
-            E2E_PROBE_OVERRIDE_GATEWAY_EXPECTED: "present",
-            E2E_PROBE_OVERRIDE_GATEWAY_HEALTH: "healthy",
-            E2E_PROBE_OVERRIDE_SANDBOX_EXPECTED: "present",
-            E2E_PROBE_OVERRIDE_SANDBOX_STATUS: "running",
-            E2E_PROBE_OVERRIDE_SANDBOX_AGENT: "openclaw",
-            E2E_PROBE_OVERRIDE_INFERENCE_EXPECTED: "available",
-            E2E_PROBE_OVERRIDE_INFERENCE_PROVIDER: "nvidia",
-            E2E_PROBE_OVERRIDE_INFERENCE_ROUTE: "inference-local",
-            E2E_PROBE_OVERRIDE_INFERENCE_MODE: "gateway-routed",
-            E2E_PROBE_OVERRIDE_CREDENTIALS_EXPECTED: "present",
-            E2E_PROBE_OVERRIDE_CREDENTIALS_STORAGE: "gateway-managed",
-            E2E_PROBE_OVERRIDE_SECURITY_SHIELDS: "supported",
-            // `security.policy_engine` has an embedded underscore, which the
-            // E2E_PROBE_OVERRIDE_* convention cannot express. Use the
-            // JSON escape hatch for this one.
-            E2E_PROBE_OVERRIDES_JSON: JSON.stringify({ "security.policy_engine": "supported" }),
-          },
-          encoding: "utf8",
-          timeout: Number(process.env.E2E_SPAWN_TIMEOUT_MS ?? 60_000),
-          cwd: REPO_ROOT,
-        },
-      );
-      expect(r.status, r.stderr).toBe(0);
-      // Must NOT have traced install or onboard.
-      const contents = fs.existsSync(trace) ? fs.readFileSync(trace, "utf8") : "";
-      expect(contents).not.toMatch(/install:/);
-      expect(contents).not.toMatch(/onboard:/);
-      // Must have emitted an expected-state-report.json (probe results).
-      const reportPath = path.join(tmp, "expected-state-report.json");
-      expect(fs.existsSync(reportPath), `missing ${reportPath}`).toBe(true);
-      const report = JSON.parse(fs.readFileSync(reportPath, "utf8"));
-      expect(report.ok).toBe(true);
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
-
-  it("is_mutually_exclusive_with_plan_only", () => {
-    const r = spawnSync(
-      "bash",
-      [RUN_SCENARIO, "ubuntu-repo-cloud-openclaw", "--validate-only", "--plan-only"],
-      { encoding: "utf8", timeout: 15_000, cwd: REPO_ROOT },
-    );
-    expect(r.status).not.toBe(0);
-    expect(r.stdout + r.stderr).toMatch(/mutually.exclusive|cannot.*both|--plan-only.*--validate-only|--validate-only.*--plan-only/i);
-  });
-});
diff --git a/test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts b/test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
index 1a5c1a8403..27a3cc0662 100644
--- a/test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
@@ -15,7 +15,6 @@ const ASSERT = path.join(VALIDATION_SUITES, "assert");
 const REBUILD_UPGRADE_LIB = path.join(VALIDATION_SUITES, "lib/rebuild_upgrade.sh");
 const FIXTURES = path.join(REPO_ROOT, "test/e2e-scenario/nemoclaw_scenarios/fixtures");
 const INSTALL_DIR = path.join(REPO_ROOT, "test/e2e-scenario/nemoclaw_scenarios/install");
-const RUN_SCENARIO = path.join(REPO_ROOT, "test/e2e-scenario/runtime/run-scenario.sh");
 
 function runBash(script: string, env: Record<string, string> = {}): SpawnSyncReturns<string> {
   return spawnSync("bash", ["-c", script], {
@@ -61,51 +60,6 @@ describe("E2E shell helpers", () => {
     }
   });
 
-  it("test_should_emit_plan_only_checks_without_live_infrastructure", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-inf-plan-"));
-    try {
-      const r = runBash(
-        `
-        set -euo pipefail
-        . "${RUNTIME_LIB}/context.sh"
-        . "${VALIDATION_SUITES}/lib/inference_routing.sh"
-        e2e_context_init
-        e2e_context_set E2E_SANDBOX_NAME sandbox-1
-        e2e_inference_routing_assert_chat_completion "post-onboard.inference-routing.inference-local-chat-completion"
-      `,
-        { E2E_CONTEXT_DIR: tmp, E2E_DRY_RUN: "1" },
-      );
-      expect(r.status, r.stderr).toBe(0);
-      expect(r.stdout).toContain("post-onboard.inference-routing.inference-local-chat-completion");
-      expect(r.stdout).toMatch(/dry-run|plan/i);
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
-
-  it("test_should_not_print_secret_values_in_helper_output", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-inf-secret-"));
-    try {
-      const r = runBash(
-        `
-        set -euo pipefail
-        . "${RUNTIME_LIB}/context.sh"
-        . "${VALIDATION_SUITES}/lib/inference_routing.sh"
-        e2e_context_init
-        e2e_context_set E2E_SANDBOX_NAME sandbox-1
-        e2e_context_set E2E_PROVIDER_API_KEY super-secret-test-token
-        e2e_inference_routing_assert_auth_proxy "post-onboard.ollama-auth-proxy.authenticated-request-accepted" "valid"
-      `,
-        { E2E_CONTEXT_DIR: tmp, E2E_DRY_RUN: "1" },
-      );
-      expect(r.status, r.stderr).toBe(0);
-      expect(r.stdout + r.stderr).not.toContain("super-secret-test-token");
-      expect(r.stdout + r.stderr).toMatch(/REDACTED|dry-run|plan/i);
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
-
   it("security_policy_credentials_helper_should_load_with_context_library", () => {
     const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "spc-context-"));
     try {
@@ -117,7 +71,7 @@ describe("E2E shell helpers", () => {
         spc_require_context E2E_SCENARIO E2E_PROVIDER
         echo "provider=$(spc_context_get E2E_PROVIDER)"
         `,
-        { E2E_CONTEXT_DIR: tmp, E2E_DRY_RUN: "1" },
+        { E2E_CONTEXT_DIR: tmp },
       );
       expect(r.status, r.stderr).toBe(0);
       expect(r.stdout).toContain("provider=nvidia");
@@ -136,7 +90,7 @@ describe("E2E shell helpers", () => {
         . "${VALIDATION_SUITES}/lib/security_policy_credentials.sh"
         spc_require_context E2E_PROVIDER
         `,
-        { E2E_CONTEXT_DIR: tmp, E2E_DRY_RUN: "1" },
+        { E2E_CONTEXT_DIR: tmp },
       );
       expect(r.status).not.toBe(0);
       expect(r.stderr).toContain("E2E_PROVIDER");
@@ -474,38 +428,6 @@ exit 0
     }
   });
 
-  it("scenario_dry_run_should_trace_helper_sequence_in_order", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-trace-"));
-    try {
-      const trace = path.join(tmp, "trace.log");
-      const r = spawnSync(
-        "bash",
-        [RUN_SCENARIO, "ubuntu-repo-cloud-openclaw", "--dry-run"],
-        {
-          env: {
-            ...process.env,
-            E2E_CONTEXT_DIR: tmp,
-            E2E_TRACE_FILE: trace,
-          },
-          encoding: "utf8",
-    timeout: Number(process.env.E2E_SPAWN_TIMEOUT_MS ?? 60_000),
-          cwd: REPO_ROOT,
-        },
-      );
-      expect(r.status, r.stderr).toBe(0);
-      expect(fs.existsSync(trace), "trace log missing").toBe(true);
-      const contents = fs.readFileSync(trace, "utf8");
-      const order = ["env:noninteractive", "install:", "onboard:", "gateway:check", "sandbox:check"];
-      let pos = 0;
-      for (const marker of order) {
-        const idx = contents.indexOf(marker, pos);
-        expect(idx, `trace missing marker in order: ${marker}\nfull:\n${contents}`).toBeGreaterThanOrEqual(0);
-        pos = idx + marker.length;
-      }
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
 });
 
 // ─────────────────────────────────────────────────────────────────────────────
@@ -683,21 +605,6 @@ exec "$@"
     }
   });
 
-  it("sandbox_exec_should_dry_run_short_circuit_when_e2e_dry_run_set", () => {
-    // Use a PATH that has bash itself but no nemoclaw — dry-run must
-    // short-circuit before the CLI lookup.
-    const r = runBash(
-      `
-        set -euo pipefail
-        . "${VALIDATION_SUITES}/sandbox-exec.sh"
-        e2e_sandbox_exec sb1 -- rm -rf /
-      `,
-      { E2E_DRY_RUN: "1", PATH: "/usr/bin:/bin" },
-    );
-    expect(r.status, r.stderr).toBe(0);
-    expect(r.stdout + r.stderr).toMatch(/dry[- ]run/i);
-  });
-
   it("sandbox_exec_stdin_should_quote_args_safely_when_piped", () => {
     // Verify that $TOKEN is NOT expanded on the host side before being
     // delivered to the sandbox. We stub openshell to echo back stdin.
@@ -968,53 +875,6 @@ describe("Issue #3810 messaging provider helper library", () => {
   });
 });
 
-// ─────────────────────────────────────────────────────────────────────────────
-// Phase 1.E — Install-method dispatcher splits
-// ─────────────────────────────────────────────────────────────────────────────
-
-describe("Phase 1.E install dispatcher splits", () => {
-  function dispatchDryRun(profile: string): SpawnSyncReturns<string> {
-    return runBash(
-      `
-        set -euo pipefail
-        . "${INSTALL_DIR}/dispatch.sh"
-        e2e_install "${profile}"
-      `,
-      { E2E_DRY_RUN: "1" },
-    );
-  }
-
-  it("install_should_dispatch_to_install_repo_helper_for_repo_current_profile", () => {
-    const r = dispatchDryRun("repo-current");
-    expect(r.status, r.stderr).toBe(0);
-    expect(r.stdout + r.stderr).toMatch(/install-repo/);
-    expect(r.stdout + r.stderr).not.toMatch(/install-curl|install-ollama|install-launchable/);
-  });
-
-  it("install_should_dispatch_to_install_curl_helper_for_public_installer_profile", () => {
-    const r = dispatchDryRun("public-installer");
-    expect(r.status, r.stderr).toBe(0);
-    expect(r.stdout + r.stderr).toMatch(/install-curl/);
-    expect(r.stdout + r.stderr).not.toMatch(/install-repo|install-ollama|install-launchable/);
-  });
-
-  it("install_should_dispatch_to_install_ollama_helper_for_ollama_profile", () => {
-    const r = dispatchDryRun("ollama");
-    expect(r.status, r.stderr).toBe(0);
-    expect(r.stdout + r.stderr).toMatch(/install-ollama/);
-    expect(r.stdout + r.stderr).not.toMatch(/install-repo|install-curl|install-launchable/);
-  });
-
-  it("install_should_dispatch_to_install_launchable_helper_for_launchable_profile", () => {
-    const r = dispatchDryRun("launchable");
-    expect(r.status, r.stderr).toBe(0);
-    expect(r.stdout + r.stderr).toMatch(/install-launchable/);
-    expect(r.stdout + r.stderr).not.toMatch(/install-repo|install-curl|install-ollama/);
-  });
-});
-
-
-
 describe("baseline onboarding validation helper", () => {
   it("baseline_helper_should_source_under_strict_shell_options", () => {
     const r = runBash(`set -euo pipefail; source "${VALIDATION_SUITES}/lib/baseline_onboarding.sh"`);
@@ -1080,7 +940,7 @@ describe("sandbox lifecycle validation helper", () => {
     try {
       const bin = path.join(tmp, "bin"); fs.mkdirSync(bin);
       fs.writeFileSync(path.join(bin, "timeout"), "#!/usr/bin/env bash\necho timed out >&2\nexit 124\n", { mode: 0o755 });
-      const r = runBash(`set -e; unset E2E_DRY_RUN; . "${VALIDATION_SUITES}/lib/sandbox_lifecycle.sh"; sandbox_lifecycle_run_with_timeout 1 bash -c 'sleep 5'`, { PATH: `${bin}:${process.env.PATH}` });
+      const r = runBash(`set -e; . "${VALIDATION_SUITES}/lib/sandbox_lifecycle.sh"; sandbox_lifecycle_run_with_timeout 1 bash -c 'sleep 5'`, { PATH: `${bin}:${process.env.PATH}` });
       expect(r.status).toBe(124);
       expect(r.stderr).toMatch(/timed out/);
     } finally { fs.rmSync(tmp, { recursive: true, force: true }); }
diff --git a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
index 497dac3387..e6a899ae5a 100644
--- a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
@@ -3,6 +3,7 @@
 
 import { describe, expect, it } from "vitest";
 import fs from "node:fs";
+import os from "node:os";
 import path from "node:path";
 
 import { HostCliClient } from "../scenarios/clients/host-cli.ts";
@@ -11,11 +12,23 @@ import { PhaseOrchestrator } from "../scenarios/orchestrators/phase.ts";
 import { ScenarioRunner } from "../scenarios/orchestrators/runner.ts";
 import type { AssertionStep, PhaseName, PhaseResult, RunContext, RunPlanPhase } from "../scenarios/types.ts";
 
-function fakeCtx(): RunContext {
-  return { contextDir: fs.mkdtempSync(path.join(process.cwd(), ".tmp-e2e-phase-")), dryRun: true };
+const REPO_ROOT = path.resolve(import.meta.dirname, "../../..");
+
+function freshCtx(): RunContext {
+  return { contextDir: fs.mkdtempSync(path.join(os.tmpdir(), "e2e-phase-")) };
 }
 
-function fakeStep(id: string, phase: PhaseName, ref = "fake-pass"): AssertionStep {
+function shellStep(id: string, phase: PhaseName, ref: string, reliability?: AssertionStep["reliability"]): AssertionStep {
+  return {
+    id,
+    phase,
+    implementation: { kind: "shell", ref },
+    evidencePath: `.e2e/assertions/${id}.log`,
+    reliability,
+  };
+}
+
+function probeStep(id: string, phase: PhaseName, ref = "no-such-probe"): AssertionStep {
   return {
     id,
     phase,
@@ -24,17 +37,31 @@ function fakeStep(id: string, phase: PhaseName, ref = "fake-pass"): AssertionSte
   };
 }
 
-function fakePhase(step: AssertionStep): RunPlanPhase {
+function pendingStep(id: string, phase: PhaseName): AssertionStep {
   return {
-    name: step.phase,
+    id,
+    phase,
+    implementation: { kind: "pending", ref: "not-yet" },
+  };
+}
+
+function makePhase(steps: AssertionStep[]): RunPlanPhase {
+  return {
+    name: steps[0].phase,
     actions: [],
-    assertionGroups: [{ id: `group.${step.id}`, phase: step.phase, migrationStatus: "complete", steps: [step] }],
+    assertionGroups: [{ id: `group.${steps[0].id}`, phase: steps[0].phase, migrationStatus: "complete", steps }],
   };
 }
 
-describe("phase orchestrators", () => {
+function writeTempScript(dir: string, name: string, body: string): string {
+  const p = path.join(dir, name);
+  fs.writeFileSync(p, `#!/usr/bin/env bash\nset -euo pipefail\n${body}\n`, { mode: 0o755 });
+  return p;
+}
+
+describe("phase orchestrators - top-level delegation", () => {
   it("test_should_execute_phase_assertions_from_phase_orchestrators_not_top_level_runner", async () => {
-    const ctx = fakeCtx();
+    const ctx = freshCtx();
     try {
       const [plan] = compileRunPlans(["ubuntu-repo-cloud-openclaw"]);
       const calls: string[] = [];
@@ -58,58 +85,145 @@ describe("phase orchestrators", () => {
       fs.rmSync(ctx.contextDir, { recursive: true, force: true });
     }
   });
+});
 
-  it("test_should_record_step_status_attempts_duration_classifier_and_evidence", async () => {
-    const ctx = fakeCtx();
+describe("phase orchestrators - real shell execution", () => {
+  it("shell_step_passes_when_script_exits_zero", async () => {
+    const ctx = freshCtx();
     try {
-      const step = fakeStep("runtime.retry-pass", "runtime", "fake-retry-once-pass");
-      step.reliability = { retry: { attempts: 2, on: ["gateway-transient"] } };
+      const script = writeTempScript(ctx.contextDir, "ok.sh", "echo hello-from-real-shell");
+      const ref = path.relative(REPO_ROOT, script);
+      const step = shellStep("runtime.real-pass", "runtime", ref);
       const orchestrator = new PhaseOrchestrator("runtime");
 
-      const result = await orchestrator.run(ctx, fakePhase(step));
+      const result = await orchestrator.run(ctx, makePhase([step]));
 
       expect(result.status).toBe("passed");
       expect(result.assertions[0]).toEqual(
-        expect.objectContaining({
-          id: "runtime.retry-pass",
-          status: "passed",
-          attempts: 2,
-          classifier: "gateway-transient",
-          evidence: ".e2e/assertions/runtime.retry-pass.json",
-        }),
+        expect.objectContaining({ id: "runtime.real-pass", status: "passed", attempts: 1 }),
       );
-      expect(result.assertions[0].durationMs).toBeGreaterThanOrEqual(0);
+      const log = fs.readFileSync(result.assertions[0].evidence!, "utf8");
+      expect(log).toContain("hello-from-real-shell");
     } finally {
       fs.rmSync(ctx.contextDir, { recursive: true, force: true });
     }
   });
 
-  it("test_should_enforce_timeout_and_retry_policy_in_orchestrator", async () => {
-    const ctx = fakeCtx();
+  it("shell_step_fails_when_script_exits_nonzero_and_records_stderr_tail", async () => {
+    const ctx = freshCtx();
     try {
-      const step = fakeStep("runtime.retry-fail", "runtime", "fake-always-transient");
-      step.reliability = { timeoutSeconds: 1, retry: { attempts: 2, on: ["provider-transient"] } };
+      const script = writeTempScript(ctx.contextDir, "fail.sh", 'echo "boom: real failure" >&2; exit 7');
+      const ref = path.relative(REPO_ROOT, script);
+      const step = shellStep("runtime.real-fail", "runtime", ref);
       const orchestrator = new PhaseOrchestrator("runtime");
 
-      const result = await orchestrator.run(ctx, fakePhase(step));
+      const result = await orchestrator.run(ctx, makePhase([step]));
 
       expect(result.status).toBe("failed");
-      expect(result.assertions[0]).toEqual(
-        expect.objectContaining({
-          id: "runtime.retry-fail",
-          status: "failed",
-          attempts: 2,
-          classifier: "provider-transient",
-        }),
+      expect(result.assertions[0].status).toBe("failed");
+      expect(result.assertions[0].message).toMatch(/exit 7/);
+      expect(result.assertions[0].message).toMatch(/boom: real failure/);
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("shell_step_times_out_via_orchestrator_policy_not_script", async () => {
+    const ctx = freshCtx();
+    try {
+      const script = writeTempScript(ctx.contextDir, "slow.sh", "sleep 30");
+      const ref = path.relative(REPO_ROOT, script);
+      const step = shellStep("runtime.real-timeout", "runtime", ref, { timeoutSeconds: 1 });
+      const orchestrator = new PhaseOrchestrator("runtime");
+
+      const started = Date.now();
+      const result = await orchestrator.run(ctx, makePhase([step]));
+      const elapsed = Date.now() - started;
+
+      expect(result.status).toBe("failed");
+      expect(result.assertions[0].message).toMatch(/exceeded 1s/);
+      expect(elapsed).toBeLessThan(15_000);
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  }, 20_000);
+
+  it("shell_step_retries_on_classified_transient_then_passes", async () => {
+    const ctx = freshCtx();
+    try {
+      const counterFile = path.join(ctx.contextDir, "counter");
+      fs.writeFileSync(counterFile, "0");
+      const script = writeTempScript(
+        ctx.contextDir,
+        "gateway-flaky.sh",
+        `n=$(cat "${counterFile}"); n=$((n+1)); echo "$n" > "${counterFile}"; if [ "$n" -lt 2 ]; then echo "gateway-transient: try again" >&2; exit 1; fi; echo ok`,
       );
+      const ref = path.relative(REPO_ROOT, script);
+      const step = shellStep("runtime.gateway-retry", "runtime", ref, {
+        retry: { attempts: 2, on: ["gateway-transient"] },
+      });
+      const orchestrator = new PhaseOrchestrator("runtime");
+
+      const result = await orchestrator.run(ctx, makePhase([step]));
+
+      expect(result.status).toBe("passed");
+      expect(result.assertions[0].attempts).toBe(2);
+      expect(result.assertions[0].classifier).toBe("gateway-transient");
     } finally {
       fs.rmSync(ctx.contextDir, { recursive: true, force: true });
     }
   });
 
+  it("shell_step_fails_with_clear_message_when_script_missing", async () => {
+    const ctx = freshCtx();
+    try {
+      const step = shellStep("runtime.missing", "runtime", "test/e2e-scenario/does-not-exist.sh");
+      const orchestrator = new PhaseOrchestrator("runtime");
+
+      const result = await orchestrator.run(ctx, makePhase([step]));
+
+      expect(result.status).toBe("failed");
+      expect(result.assertions[0].message).toMatch(/script not found/);
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("probe_step_without_registered_probe_skips_visibly_never_passes_falsely", async () => {
+    const ctx = freshCtx();
+    try {
+      const step = probeStep("runtime.probe-pending", "runtime");
+      const orchestrator = new PhaseOrchestrator("runtime");
+
+      const result = await orchestrator.run(ctx, makePhase([step]));
+
+      expect(result.assertions[0].status).toBe("skipped");
+      expect(result.assertions[0].message).toMatch(/probe not registered/);
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("pending_step_skips_visibly_with_pending_marker", async () => {
+    const ctx = freshCtx();
+    try {
+      const step = pendingStep("runtime.pending", "runtime");
+      const orchestrator = new PhaseOrchestrator("runtime");
+
+      const result = await orchestrator.run(ctx, makePhase([step]));
+
+      expect(result.assertions[0].status).toBe("skipped");
+      expect(result.assertions[0].message).toMatch(/^pending:/);
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+});
+
+describe("clients are pass/fail/policy free", () => {
   it("test_should_keep_clients_free_of_pass_fail_and_retry_semantics", () => {
     const source = fs.readFileSync(
-      path.join(process.cwd(), "test/e2e-scenario/scenarios/clients/host-cli.ts"),
+      path.join(REPO_ROOT, "test/e2e-scenario/scenarios/clients/host-cli.ts"),
       "utf8",
     );
     const observation = new HostCliClient().observeVersion();
diff --git a/test/e2e-scenario/framework-tests/e2e-scenario-additional-families.test.ts b/test/e2e-scenario/framework-tests/e2e-scenario-additional-families.test.ts
index 8c2e70caae..2d3c42fba0 100644
--- a/test/e2e-scenario/framework-tests/e2e-scenario-additional-families.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-scenario-additional-families.test.ts
@@ -2,17 +2,15 @@
 // SPDX-License-Identifier: Apache-2.0
 
 /**
- * Phase 9: Migrate Additional Scenario Families.
- * Verifies metadata for new scenarios (macOS, WSL, GPU local Ollama, Brev
- * launchable, Ubuntu cloud Hermes, and the no-docker negative preflight)
- * plus the deferred schema concepts (scenario-level overrides, negative
- * expected state).
+ * Phase 9: Additional Scenario Families - resolver-level metadata only.
+ *
+ * Plan-printout tests that exercised the deprecated bash entrypoint
+ * (run-scenario.sh --plan-only) were deleted alongside the bash runner.
+ * The TS runner is exercised by e2e-plan-compiler / e2e-scenario-registry
+ * / e2e-phase-orchestrators tests instead.
  */
 
 import { describe, it, expect } from "vitest";
-import { spawnSync } from "node:child_process";
-import fs from "node:fs";
-import os from "node:os";
 import path from "node:path";
 
 import { loadMetadataFromDir } from "../runtime/resolver/load.ts";
@@ -20,27 +18,6 @@ import { resolveScenario } from "../runtime/resolver/plan.ts";
 
 const REPO_ROOT = path.resolve(import.meta.dirname, "../../..");
 const E2E_DIR = path.join(REPO_ROOT, "test/e2e-scenario");
-const RUN_SCENARIO = path.join(E2E_DIR, "runtime", "run-scenario.sh");
-
-function planOnly(scenarioId: string): { stdout: string; stderr: string; status: number | null; plan: Record<string, unknown> } {
-  const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-p9-"));
-  try {
-    const r = spawnSync("bash", [RUN_SCENARIO, scenarioId, "--plan-only"], {
-      env: { ...process.env, E2E_CONTEXT_DIR: tmp },
-      encoding: "utf8",
-    timeout: Number(process.env.E2E_SPAWN_TIMEOUT_MS ?? 60_000),
-      cwd: REPO_ROOT,
-    });
-    let plan = {};
-    const pj = path.join(tmp, "plan.json");
-    if (fs.existsSync(pj)) {
-      plan = JSON.parse(fs.readFileSync(pj, "utf8"));
-    }
-    return { stdout: r.stdout, stderr: r.stderr, status: r.status, plan };
-  } finally {
-    fs.rmSync(tmp, { recursive: true, force: true });
-  }
-}
 
 describe("Issue 3812: inference/provider suite families", () => {
   it("test_should_route_inference_suite_families_to_domain_specific_steps", () => {
@@ -74,37 +51,6 @@ describe("Phase 9: additional scenario families - metadata", () => {
   });
 });
 
-describe("Phase 9: macOS / WSL plan-only", () => {
-  it("macos scenario plan identifies macOS platform", () => {
-    const { status, plan } = planOnly("macos-repo-cloud-openclaw");
-    expect(status).toBe(0);
-    const dims = (plan as { dimensions: { platform: { profile: { os?: string } } } }).dimensions;
-    expect(dims.platform.profile.os).toBe("macos");
-  });
-
-  it("wsl scenario plan identifies WSL platform", () => {
-    const { status, plan } = planOnly("wsl-repo-cloud-openclaw");
-    expect(status).toBe(0);
-    const dims = (plan as { dimensions: { platform: { profile: { os?: string } } } }).dimensions;
-    expect(dims.platform.profile.os).toBe("wsl");
-  });
-});
-
-describe("Phase 9: GPU local Ollama plan-only", () => {
-  it("runtime indicates GPU/CDI and provider is ollama", () => {
-    const { status, plan } = planOnly("gpu-repo-local-ollama-openclaw");
-    expect(status).toBe(0);
-    const dims = (plan as {
-      dimensions: {
-        runtime: { profile: { gpu_runtime?: string } };
-        onboarding: { profile: { provider?: string } };
-      };
-    }).dimensions;
-    expect(dims.runtime.profile.gpu_runtime).toBe("cdi");
-    expect(dims.onboarding.profile.provider).toBe("ollama");
-  });
-});
-
 describe("Phase 9: Brev launchable scenario (overrides schema)", () => {
   it("should_support_scenario_overrides_on_brev_launchable", () => {
     const meta = loadMetadataFromDir(E2E_DIR);
@@ -116,21 +62,6 @@ describe("Phase 9: Brev launchable scenario (overrides schema)", () => {
     expect(overrides?.onboarding?.gateway?.bind_address).toBeTypeOf("string");
     expect(overrides?.onboarding?.gateway?.bind_address?.length).toBeGreaterThan(0);
   });
-
-  it("plan shows remote target, launchable install, and gateway bind override", () => {
-    const { status, stdout, plan } = planOnly("brev-launchable-cloud-openclaw");
-    expect(status).toBe(0);
-    const dims = (plan as {
-      dimensions: {
-        platform: { profile: { execution_target?: string } };
-        install: { id: string };
-      };
-    }).dimensions;
-    expect(dims.platform.profile.execution_target).toBe("remote");
-    expect(dims.install.id).toBe("launchable");
-    expect(stdout).toMatch(/Overrides:/);
-    expect(stdout).toMatch(/bind_address/);
-  });
 });
 
 describe("Phase 9: negative preflight", () => {
@@ -148,27 +79,4 @@ describe("Phase 9: negative preflight", () => {
     expect(es?.sandbox?.expected).toBe("absent");
     expect(es?.failure?.expected).toBe(true);
   });
-
-  it("negative scenario plan identifies docker missing and negative state", () => {
-    const { status, plan } = planOnly("ubuntu-no-docker-preflight-negative");
-    expect(status).toBe(0);
-    const p = plan as {
-      dimensions: { runtime: { profile: { container_daemon?: string } } };
-      expected_state: { id: string };
-      expected_failure?: {
-        phase?: string;
-        error_class?: string;
-        message_pattern?: string;
-        forbidden_side_effects?: string[];
-      };
-    };
-    expect(p.dimensions.runtime.profile.container_daemon).toBe("missing");
-    expect(p.expected_state.id).toBe("preflight-failure-no-sandbox");
-    expect(p.expected_failure?.phase).toBe("preflight");
-    expect(p.expected_failure?.error_class).toBe("docker-missing");
-    expect(p.expected_failure?.message_pattern).toBeTypeOf("string");
-    expect(p.expected_failure?.forbidden_side_effects).toEqual(
-      expect.arrayContaining(["sandbox-created", "gateway-started", "credentials-written"]),
-    );
-  });
 });
diff --git a/test/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.ts b/test/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.ts
deleted file mode 100644
index 0307ca9103..0000000000
--- a/test/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.ts
+++ /dev/null
@@ -1,102 +0,0 @@
-// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-// SPDX-License-Identifier: Apache-2.0
-
-/**
- * Phase 6: Migrate First Scenario - ubuntu-repo-cloud-openclaw.
- * Verifies resolver output, plan printout, and dry-run phase ordering.
- */
-
-import { describe, it, expect } from "vitest";
-import { spawnSync } from "node:child_process";
-import fs from "node:fs";
-import os from "node:os";
-import path from "node:path";
-
-import { loadMetadataFromDir } from "../runtime/resolver/load.ts";
-import { resolveScenario } from "../runtime/resolver/plan.ts";
-
-const REPO_ROOT = path.resolve(import.meta.dirname, "../../..");
-const E2E_DIR = path.join(REPO_ROOT, "test/e2e-scenario");
-const RUN_SCENARIO = path.join(E2E_DIR, "runtime", "run-scenario.sh");
-
-describe("Phase 6: ubuntu-repo-cloud-openclaw migration", () => {
-  it("ubuntu_repo_cloud_openclaw_should_resolve_to_cloud_openclaw_ready", () => {
-    const meta = loadMetadataFromDir(E2E_DIR);
-    const plan = resolveScenario("ubuntu-repo-cloud-openclaw", meta);
-    expect(plan.expected_state.id).toBe("cloud-openclaw-ready");
-    const suiteIds = plan.suites.map((s) => s.id);
-    expect(suiteIds).toContain("smoke");
-    expect(suiteIds).toContain("inference");
-  });
-
-  it("ubuntu_repo_cloud_openclaw_plan_should_include_setup_install_onboard", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-first-"));
-    try {
-      const r = spawnSync(
-        "bash",
-        [RUN_SCENARIO, "ubuntu-repo-cloud-openclaw", "--plan-only"],
-        { env: { ...process.env, E2E_CONTEXT_DIR: tmp }, encoding: "utf8",
-    timeout: Number(process.env.E2E_SPAWN_TIMEOUT_MS ?? 60_000), cwd: REPO_ROOT },
-      );
-      expect(r.status, r.stderr).toBe(0);
-      expect(r.stdout).toMatch(/install=repo-current/);
-      expect(r.stdout).toMatch(/runtime=docker-running/);
-      expect(r.stdout).toMatch(/onboarding=cloud-openclaw/);
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
-
-  it("ubuntu_repo_cloud_openclaw_dry_run_should_execute_phases_in_order", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-first-"));
-    try {
-      const trace = path.join(tmp, "trace.log");
-      const r = spawnSync(
-        "bash",
-        [RUN_SCENARIO, "ubuntu-repo-cloud-openclaw", "--dry-run"],
-        {
-          env: { ...process.env, E2E_CONTEXT_DIR: tmp, E2E_TRACE_FILE: trace },
-          encoding: "utf8",
-    timeout: Number(process.env.E2E_SPAWN_TIMEOUT_MS ?? 60_000),
-          cwd: REPO_ROOT,
-        },
-      );
-      expect(r.status, r.stderr).toBe(0);
-      expect(fs.existsSync(trace)).toBe(true);
-      const contents = fs.readFileSync(trace, "utf8");
-      const order = [
-        "env:noninteractive",
-        "install:repo-current",
-        "onboard:cloud-openclaw",
-        "gateway:check",
-        "sandbox:check",
-      ];
-      let pos = 0;
-      for (const marker of order) {
-        const idx = contents.indexOf(marker, pos);
-        expect(idx, `missing marker ${marker}. trace:\n${contents}`).toBeGreaterThanOrEqual(0);
-        pos = idx + marker.length;
-      }
-      // The run should also seed the context and produce plan.json.
-      expect(fs.existsSync(path.join(tmp, "context.env"))).toBe(true);
-      expect(fs.existsSync(path.join(tmp, "plan.json"))).toBe(true);
-      // After dry-run, suite runner should be able to execute the full
-      // suite sequence against the emitted context.
-      const suites = spawnSync(
-        "bash",
-        [path.join(E2E_DIR, "runtime", "run-suites.sh"), "smoke", "inference"],
-        {
-          env: { ...process.env, E2E_CONTEXT_DIR: tmp, E2E_DRY_RUN: "1" },
-          encoding: "utf8",
-    timeout: Number(process.env.E2E_SPAWN_TIMEOUT_MS ?? 60_000),
-          cwd: REPO_ROOT,
-        },
-      );
-      expect(suites.status, `suite stderr:${suites.stderr}\nstdout:${suites.stdout}`).toBe(0);
-      expect(suites.stdout).toMatch(/PASS smoke\/cli-available/);
-      expect(suites.stdout).toMatch(/PASS inference\/models-health/);
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
-});
diff --git a/test/e2e-scenario/framework-tests/e2e-scenario-resolver.test.ts b/test/e2e-scenario/framework-tests/e2e-scenario-resolver.test.ts
index dc4f105884..a6944c9f64 100644
--- a/test/e2e-scenario/framework-tests/e2e-scenario-resolver.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-scenario-resolver.test.ts
@@ -199,62 +199,7 @@ suites:
   });
 });
 
-describe("run-scenario.sh --plan-only", () => {
-  it("run_scenario_plan_only_should_print_plan", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-plan-"));
-    try {
-      const result = spawnSync(
-        "bash",
-        [
-          path.join(E2E_DIR, "runtime", "run-scenario.sh"),
-          "ubuntu-repo-cloud-openclaw",
-          "--plan-only",
-        ],
-        {
-          env: { ...process.env, E2E_CONTEXT_DIR: tmp },
-          encoding: "utf8",
-    timeout: Number(process.env.E2E_SPAWN_TIMEOUT_MS ?? 60_000),
-          cwd: REPO_ROOT,
-        },
-      );
-      expect(result.status, result.stderr).toBe(0);
-      expect(result.stdout).toContain("ubuntu-repo-cloud-openclaw");
-      expect(result.stdout).toContain("cloud-openclaw-ready");
-      expect(result.stdout).toContain("smoke");
-      expect(result.stdout).toContain("inference");
-      const planJsonPath = path.join(tmp, "plan.json");
-      expect(fs.existsSync(planJsonPath)).toBe(true);
-      const doc = JSON.parse(fs.readFileSync(planJsonPath, "utf8"));
-      expect(doc.scenario_id).toBe("ubuntu-repo-cloud-openclaw");
-      expect(doc.expected_state.id).toBe("cloud-openclaw-ready");
-      expect(Array.isArray(doc.suites)).toBe(true);
-      expect(doc.suites.map((s: { id: string }) => s.id)).toContain("smoke");
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
+// run-scenario.sh-based plan-only tests removed: the bash runner is
+// now a fail-fast stub. Equivalent coverage of the typed runner lives in
+// e2e-plan-compiler.test.ts and e2e-scenario-registry.test.ts.
 
-  it("run_scenario_plan_only_should_fail_for_unknown_scenario", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-plan-"));
-    try {
-      const result = spawnSync(
-        "bash",
-        [
-          path.join(E2E_DIR, "runtime", "run-scenario.sh"),
-          "does-not-exist",
-          "--plan-only",
-        ],
-        {
-          env: { ...process.env, E2E_CONTEXT_DIR: tmp },
-          encoding: "utf8",
-    timeout: Number(process.env.E2E_SPAWN_TIMEOUT_MS ?? 60_000),
-          cwd: REPO_ROOT,
-        },
-      );
-      expect(result.status).not.toBe(0);
-      expect(`${result.stderr}${result.stdout}`).toMatch(/does-not-exist/);
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
-});
diff --git a/test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts b/test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
deleted file mode 100644
index 5a917853f8..0000000000
--- a/test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
+++ /dev/null
@@ -1,249 +0,0 @@
-// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-// SPDX-License-Identifier: Apache-2.0
-
-import { describe, it, expect } from "vitest";
-import { spawnSync, type SpawnSyncReturns } from "node:child_process";
-import fs from "node:fs";
-import os from "node:os";
-import path from "node:path";
-const REPO_ROOT = path.resolve(import.meta.dirname, "../../..");
-const RUN_SUITES = path.join(REPO_ROOT, "test/e2e-scenario/runtime/run-suites.sh");
-
-function runSuites(args: string[], env: Record<string, string> = {}): SpawnSyncReturns<string> {
-  return spawnSync("bash", [RUN_SUITES, ...args], {
-    env: { ...process.env, ...env },
-    encoding: "utf8",
-    timeout: Number(process.env.E2E_SPAWN_TIMEOUT_MS ?? 60_000),
-    cwd: REPO_ROOT,
-  });
-}
-
-function seedContext(tmp: string, values: Record<string, string>): void {
-  fs.mkdirSync(tmp, { recursive: true });
-  const ctx = Object.entries(values)
-    .map(([k, v]) => `${k}=${v}`)
-    .join("\n");
-  fs.writeFileSync(path.join(tmp, "context.env"), `${ctx}\n`);
-}
-
-function fullContext(): Record<string, string> {
-  return {
-    E2E_SCENARIO: "ubuntu-repo-cloud-openclaw",
-    E2E_PLATFORM_OS: "ubuntu",
-    E2E_EXECUTION_TARGET: "local",
-    E2E_INSTALL_METHOD: "repo-checkout",
-    E2E_CONTAINER_ENGINE: "docker",
-    E2E_CONTAINER_DAEMON: "running",
-    E2E_ONBOARDING_PATH: "cloud",
-    E2E_AGENT: "openclaw",
-    E2E_PROVIDER: "nvidia",
-    E2E_SANDBOX_NAME: "e2e-ubuntu-repo-cloud-openclaw",
-    E2E_GATEWAY_URL: "http://127.0.0.1:18789",
-    E2E_INFERENCE_ROUTE: "inference-local",
-  };
-}
-
-describe("Issue #3810 messaging suite wiring", () => {
-  it("should_define_real_steps_for_messaging_provider_suites", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-messaging-suites-"));
-    try {
-      const baseContext = {
-        ...fullContext(),
-        E2E_PROVIDER: "telegram",
-        E2E_MESSAGING_PROVIDER: "telegram",
-        E2E_MESSAGING_BRIDGE_URL: "http://127.0.0.1:18789",
-        E2E_MESSAGING_CONFIG_CONTENT: "TELEGRAM_BOT_TOKEN=PLACEHOLDER",
-      };
-      seedContext(tmp, baseContext);
-      const telegram = runSuites(["messaging-telegram"], {
-        E2E_CONTEXT_DIR: tmp,
-        E2E_DRY_RUN: "1",
-      });
-      expect(telegram.status, `stderr:${telegram.stderr}\nstdout:${telegram.stdout}`).toBe(0);
-      seedContext(tmp, {
-        ...baseContext,
-        E2E_MESSAGING_PROVIDER: "discord",
-        E2E_MESSAGING_CONFIG_CONTENT: "DISCORD_BOT_TOKEN=PLACEHOLDER",
-      });
-      const discord = runSuites(["messaging-discord"], {
-        E2E_CONTEXT_DIR: tmp,
-        E2E_DRY_RUN: "1",
-      });
-      expect(discord.status, `stderr:${discord.stderr}\nstdout:${discord.stdout}`).toBe(0);
-      seedContext(tmp, {
-        ...baseContext,
-        E2E_MESSAGING_PROVIDER: "slack",
-        E2E_MESSAGING_CHANNEL: "bot",
-        E2E_MESSAGING_CONFIG_CONTENT: "SLACK_BOT_TOKEN=PLACEHOLDER",
-      });
-      const slack = runSuites(["messaging-slack"], {
-        E2E_CONTEXT_DIR: tmp,
-        E2E_DRY_RUN: "1",
-      });
-      expect(slack.status, `stderr:${slack.stderr}\nstdout:${slack.stdout}`).toBe(0);
-      const output = `${telegram.stdout}\n${discord.stdout}\n${slack.stdout}`;
-      for (const id of [
-        "messaging-provider-attached",
-        "messaging-placeholder-configured",
-        "messaging-no-secret-leak",
-        "messaging-bridge-reachable",
-        "telegram-injection-safety",
-        "discord-gateway-path",
-        "slack-provider-state",
-      ]) {
-        expect(output).toContain(id);
-      }
-      expect(output).not.toContain("cli-available");
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
-});
-
-describe("run-suites.sh", () => {
-  it("security_credentials_suite_should_emit_stable_assertion_ids", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-security-credentials-"));
-    try {
-      seedContext(tmp, { ...fullContext(), E2E_CREDENTIALS_EXPECTED: "present" });
-      const r = runSuites(["security-credentials"], { E2E_CONTEXT_DIR: tmp, E2E_DRY_RUN: "1", HOME: tmp });
-      expect(r.status, `stderr:${r.stderr}\nstdout:${r.stdout}`).toBe(0);
-      expect(r.stdout).toContain("post-onboard.credentials.gateway-list-redacts-values");
-      expect(r.stdout).toContain("post-onboard.credentials.no-plaintext-host-store");
-      expect(r.stdout).not.toMatch(/no-credentials-leaked|assert\//);
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
-
-  it("run_suites_should_run_steps_in_declared_order", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-suite-"));
-    try {
-      seedContext(tmp, fullContext());
-      const r = runSuites(["smoke"], {
-        E2E_CONTEXT_DIR: tmp,
-        E2E_DRY_RUN: "1",
-      });
-      expect(r.status, `stderr:${r.stderr}\nstdout:${r.stdout}`).toBe(0);
-      // Smoke order is: cli-available, gateway-health, sandbox-listed, sandbox-shell
-      const order = ["cli-available", "gateway-health", "sandbox-listed", "sandbox-shell"];
-      let pos = 0;
-      for (const marker of order) {
-        const idx = r.stdout.indexOf(marker, pos);
-        expect(idx, `missing marker ${marker} after ${pos} in:\n${r.stdout}`).toBeGreaterThanOrEqual(0);
-        pos = idx + marker.length;
-      }
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
-
-  it("run_suites_should_fail_on_unknown_suite", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-suite-"));
-    try {
-      seedContext(tmp, fullContext());
-      const r = runSuites(["does-not-exist"], { E2E_CONTEXT_DIR: tmp, E2E_DRY_RUN: "1" });
-      expect(r.status).not.toBe(0);
-      expect(`${r.stdout}${r.stderr}`).toMatch(/does-not-exist/);
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
-
-  it("run_suites_should_stop_on_first_failed_step", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-suite-"));
-    try {
-      seedContext(tmp, fullContext());
-      // Use a fixture suites file with a failing middle step.
-      const fixtureSuites = path.join(tmp, "suites.yaml");
-      const fixtureDir = path.join(tmp, "suites", "fixture");
-      fs.mkdirSync(fixtureDir, { recursive: true });
-      fs.writeFileSync(path.join(fixtureDir, "00-a.sh"), "#!/usr/bin/env bash\necho A-RAN\nexit 0\n");
-      fs.writeFileSync(path.join(fixtureDir, "01-b.sh"), "#!/usr/bin/env bash\necho B-RAN\nexit 1\n");
-      fs.writeFileSync(path.join(fixtureDir, "02-c.sh"), "#!/usr/bin/env bash\necho C-RAN\nexit 0\n");
-      fs.chmodSync(path.join(fixtureDir, "00-a.sh"), 0o755);
-      fs.chmodSync(path.join(fixtureDir, "01-b.sh"), 0o755);
-      fs.chmodSync(path.join(fixtureDir, "02-c.sh"), 0o755);
-      fs.writeFileSync(
-        fixtureSuites,
-        `suites:
-  fixture:
-    steps:
-      - { id: a, script: suites/fixture/00-a.sh }
-      - { id: b, script: suites/fixture/01-b.sh }
-      - { id: c, script: suites/fixture/02-c.sh }
-`,
-      );
-      const r = runSuites(["fixture"], {
-        E2E_CONTEXT_DIR: tmp,
-        E2E_SUITES_FILE: fixtureSuites,
-        E2E_SUITES_DIR: tmp,
-      });
-      expect(r.status).not.toBe(0);
-      expect(r.stdout).toContain("A-RAN");
-      expect(r.stdout).toContain("B-RAN");
-      expect(r.stdout).not.toContain("C-RAN");
-      expect(`${r.stdout}${r.stderr}`).toMatch(/FAIL.*(fixture\/b|step=b)/i);
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
-
-  it("smoke_suite_should_require_context", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-suite-"));
-    try {
-      // No context.env written to tmp.
-      const r = runSuites(["smoke"], { E2E_CONTEXT_DIR: tmp, E2E_DRY_RUN: "1" });
-      expect(r.status).not.toBe(0);
-      expect(`${r.stderr}${r.stdout}`).toMatch(/context\.env|E2E_SCENARIO|missing/i);
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
-
-  it("rebuild_and_upgrade_suites_should_emit_stable_assertion_ids_in_dry_run", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-suite-"));
-    try {
-      seedContext(tmp, fullContext());
-      const r = runSuites(["rebuild", "upgrade"], { E2E_CONTEXT_DIR: tmp, E2E_DRY_RUN: "1" });
-      expect(r.status, `stderr:${r.stderr}\nstdout:${r.stdout}`).toBe(0);
-      for (const id of [
-        "suite.rebuild.workspace_state_preserved",
-        "suite.rebuild.agent_version_upgraded",
-        "suite.rebuild.inference_still_works",
-        "suite.rebuild.policy_presets_preserved",
-        "suite.rebuild.hermes_config_preserved",
-        "suite.upgrade.sandbox_registry_preserved",
-        "suite.upgrade.gateway_version_upgraded",
-        "suite.upgrade.survivor_agent_reachable",
-      ]) {
-        expect(r.stdout).toContain(id);
-      }
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
-
-  it("smoke_and_inference_run_with_stub_context", () => {
-    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-suite-"));
-    try {
-      seedContext(tmp, fullContext());
-      const r = runSuites(["smoke", "inference"], { E2E_CONTEXT_DIR: tmp, E2E_DRY_RUN: "1" });
-      expect(r.status, `stderr:${r.stderr}\nstdout:${r.stdout}`).toBe(0);
-      for (const id of [
-        "cli-available",
-        "gateway-health",
-        "sandbox-listed",
-        "sandbox-shell",
-        "models-health",
-        "chat-completion",
-        "sandbox-inference-local",
-      ]) {
-        expect(r.stdout).toContain(id);
-      }
-      // Summary should call out PASS for each step.
-      expect(r.stdout).toMatch(/PASS/);
-    } finally {
-      fs.rmSync(tmp, { recursive: true, force: true });
-    }
-  });
-});
diff --git a/test/e2e-scenario/nemoclaw_scenarios/fixtures/older-base-image.sh b/test/e2e-scenario/nemoclaw_scenarios/fixtures/older-base-image.sh
index 3d49c03116..d10fbd2c9d 100755
--- a/test/e2e-scenario/nemoclaw_scenarios/fixtures/older-base-image.sh
+++ b/test/e2e-scenario/nemoclaw_scenarios/fixtures/older-base-image.sh
@@ -12,8 +12,6 @@
 #   older_base_image_prepare <tag> [--registry ghcr.io/nvidia/nemoclaw]
 #     Writes a minimal Dockerfile to a temp location whose first line is
 #     `FROM <registry>:<tag>`, and prints the Dockerfile path on stdout.
-#     Honors E2E_DRY_RUN: skips the `docker pull` step (but still writes
-#     the Dockerfile, which is what callers inspect).
 #   older_base_image_cleanup <dockerfile-path>
 #     Removes the generated Dockerfile and (if present) its build context.
 
@@ -50,11 +48,9 @@ LABEL nemoclaw.e2e.fixture=older-base-image
 EOF
 
   e2e_env_trace "fixture:older-base-image" "${registry}:${tag}"
-  if ! e2e_env_is_dry_run; then
-    if command -v docker >/dev/null 2>&1; then
-      docker pull "${registry}:${tag}" >&2 \
-        || echo "older_base_image_prepare: docker pull failed (continuing; build may still succeed on cached layers)" >&2
-    fi
+  if command -v docker >/dev/null 2>&1; then
+    docker pull "${registry}:${tag}" >&2 \
+      || echo "older_base_image_prepare: docker pull failed (continuing; build may still succeed on cached layers)" >&2
   fi
   printf '%s\n' "${dockerfile}"
 }
diff --git a/test/e2e-scenario/nemoclaw_scenarios/install/dispatch.sh b/test/e2e-scenario/nemoclaw_scenarios/install/dispatch.sh
index 7ea798cfdf..1a2ec2b0aa 100755
--- a/test/e2e-scenario/nemoclaw_scenarios/install/dispatch.sh
+++ b/test/e2e-scenario/nemoclaw_scenarios/install/dispatch.sh
@@ -4,7 +4,7 @@
 #
 # Install dispatcher. Routes by install-method / profile id to one of four
 # split helpers (repo-current.sh, public-curl.sh, ollama.sh,
-# launchable.sh). Honors E2E_DRY_RUN.
+# launchable.sh).
 #
 # Accepts both legacy install-method names (repo-checkout,
 # curl-install-script) and the new profile-centric names used by
diff --git a/test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh b/test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
index 5ec638e90a..09d8aa3bbb 100755
--- a/test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
+++ b/test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
@@ -18,11 +18,6 @@ _E2E_INST_LNCH_RUNTIME_LIB="$(cd "${_E2E_INST_LNCH_DIR}/../../runtime/lib" && pw
 
 e2e_install_launchable() {
   e2e_env_trace "install-launchable"
-  if e2e_env_is_dry_run; then
-    echo "[dry-run] install-launchable (skipped)"
-    return 0
-  fi
-
   # Match nightly launchable-smoke-e2e: exercise the launchable bootstrap
   # script on the current runner instead of assuming a pre-provisioned Brev VM.
   # The script has no Brev API dependency; it installs Docker/OpenShell/NemoClaw
diff --git a/test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh b/test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
index a9d5f81c14..449eae519a 100755
--- a/test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
+++ b/test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
@@ -17,10 +17,6 @@ _E2E_INST_OL_RUNTIME_LIB="$(cd "${_E2E_INST_OL_DIR}/../../runtime/lib" && pwd)"
 
 e2e_install_ollama() {
   e2e_env_trace "install-ollama"
-  if e2e_env_is_dry_run; then
-    echo "[dry-run] install-ollama (skipped)"
-    return 0
-  fi
   local ollama_url="${E2E_OLLAMA_INSTALL_URL:-https://ollama.ai/install.sh}"
   if ! command -v ollama >/dev/null 2>&1; then
     if ! curl -fsSL --retry 3 --retry-delay 2 "${ollama_url}" | bash; then
diff --git a/test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh b/test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
index 143d097f0d..6628e332a2 100755
--- a/test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
+++ b/test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
@@ -16,10 +16,6 @@ _E2E_INST_CURL_RUNTIME_LIB="$(cd "${_E2E_INST_CURL_DIR}/../../runtime/lib" && pw
 
 e2e_install_curl() {
   e2e_env_trace "install-curl"
-  if e2e_env_is_dry_run; then
-    echo "[dry-run] install-curl (skipped)"
-    return 0
-  fi
   local url="${E2E_INSTALLER_URL:-https://raw.githubusercontent.com/NVIDIA/NemoClaw/main/scripts/install.sh}"
   local sha256="${E2E_INSTALLER_SHA256:-}"
   local tmp
diff --git a/test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh b/test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
index 8c985dc3f7..000431a4b8 100755
--- a/test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
+++ b/test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
@@ -5,7 +5,6 @@
 # Install from a checked-out repo (repo-current / repo-checkout profile).
 #
 # Split from the install dispatcher to keep scenario setup logic flat and to
-# make the per-profile code discoverable by grep. Honors E2E_DRY_RUN.
 
 _E2E_INST_REPO_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 _E2E_INST_REPO_RUNTIME_LIB="$(cd "${_E2E_INST_REPO_DIR}/../../runtime/lib" && pwd)"
@@ -16,10 +15,6 @@ _E2E_INST_REPO_RUNTIME_LIB="$(cd "${_E2E_INST_REPO_DIR}/../../runtime/lib" && pw
 
 e2e_install_repo() {
   e2e_env_trace "install-repo"
-  if e2e_env_is_dry_run; then
-    echo "[dry-run] install-repo (skipped)"
-    return 0
-  fi
   local repo_root
   repo_root="$(cd "${_E2E_INST_REPO_DIR}/../../../.." && pwd)"
   cd "${repo_root}" || return
diff --git a/test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh b/test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
index 2baf698986..951ab6613e 100755
--- a/test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
+++ b/test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
@@ -26,10 +26,6 @@ e2e_onboard() {
     return 2
   fi
   e2e_env_trace "onboard:${profile}"
-  if e2e_env_is_dry_run; then
-    echo "[dry-run] onboard profile=${profile} (skipped)"
-    return 0
-  fi
   case "${profile}" in
     cloud-openclaw)
       e2e_onboard_cloud_openclaw
diff --git a/test/e2e-scenario/runtime/lib/env.sh b/test/e2e-scenario/runtime/lib/env.sh
index ed33fb8a6a..9c33af97cc 100755
--- a/test/e2e-scenario/runtime/lib/env.sh
+++ b/test/e2e-scenario/runtime/lib/env.sh
@@ -40,8 +40,3 @@ e2e_env_trace() {
     printf '%s %s\n' "${event}" "$*" >>"${E2E_TRACE_FILE}"
   fi
 }
-
-# e2e_env_is_dry_run: true if E2E_DRY_RUN=1
-e2e_env_is_dry_run() {
-  [[ "${E2E_DRY_RUN:-0}" == "1" ]]
-}
diff --git a/test/e2e-scenario/runtime/run-scenario.sh b/test/e2e-scenario/runtime/run-scenario.sh
index 58042c8523..2477ce79ec 100755
--- a/test/e2e-scenario/runtime/run-scenario.sh
+++ b/test/e2e-scenario/runtime/run-scenario.sh
@@ -2,482 +2,24 @@
 # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
-# E2E scenario runner entrypoint.
-#
-# Usage:
-#   bash test/e2e-scenario/runtime/run-scenario.sh <scenario-id> [--plan-only|--validate-only|--dry-run]
-#
-# Flags:
-#   --plan-only      Resolve metadata and print the plan only. Writes
-#                    ${E2E_CONTEXT_DIR:-.e2e}/plan.json for artifact upload.
-#   --validate-only  Run the expected-state validator against the current
-#                    context.env without running install/onboard/suites.
-#                    Emits probe results JSON to stdout and writes
-#                    ${E2E_CONTEXT_DIR}/expected-state-report.json. Used by
-#                    the parity-compare workflow to collect per-assertion
-#                    probe results. Mutually exclusive with --plan-only.
-#   --dry-run        (reserved) Run orchestration with real side effects
-#                    replaced by trace-logged stubs. Sets E2E_DRY_RUN=1 for
-#                    helpers. Full dry-run orchestration lands in later phases.
-#
-# Environment:
-#   E2E_CONTEXT_DIR  Override the scenario artifact directory
-#                    (default: <repo-root>/.e2e/).
+# DEPRECATED. The hybrid scenario architecture has a single supported runtime
+# entrypoint: test/e2e-scenario/scenarios/run.ts. This bash runner duplicated
+# install/onboard/gateway-check/suite-execution that now belongs in TS phase
+# orchestrators (EnvironmentOrchestrator, OnboardingOrchestrator,
+# RuntimeOrchestrator) and shared clients (HostCliClient, GatewayClient,
+# SandboxClient). It is fail-fast so the deprecation is loud, not silent.
 
 set -euo pipefail
 
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-E2E_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
-REPO_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd)"
-
-SCENARIO_ID=""
-PLAN_ONLY=0
-VALIDATE_ONLY=0
-DRY_RUN=0
-
-usage() {
-  cat >&2 <<'USAGE'
-Usage: bash test/e2e-scenario/runtime/run-scenario.sh <scenario-id> [--plan-only|--validate-only|--dry-run]
-USAGE
-}
-
-while [[ $# -gt 0 ]]; do
-  case "$1" in
-    --plan-only)
-      PLAN_ONLY=1
-      shift
-      ;;
-    --validate-only)
-      VALIDATE_ONLY=1
-      shift
-      ;;
-    --dry-run)
-      DRY_RUN=1
-      shift
-      ;;
-    -h | --help)
-      usage
-      exit 0
-      ;;
-    --*)
-      echo "run-scenario: unknown flag: $1" >&2
-      usage
-      exit 2
-      ;;
-    *)
-      if [[ -z "${SCENARIO_ID}" ]]; then
-        SCENARIO_ID="$1"
-      else
-        echo "run-scenario: unexpected positional argument: $1" >&2
-        usage
-        exit 2
-      fi
-      shift
-      ;;
-  esac
-done
-
-if [[ -z "${SCENARIO_ID}" ]]; then
-  echo "run-scenario: missing scenario id" >&2
-  usage
-  exit 2
-fi
-
-if [[ "${PLAN_ONLY}" -eq 1 && "${VALIDATE_ONLY}" -eq 1 ]]; then
-  echo "run-scenario: --plan-only and --validate-only are mutually exclusive" >&2
-  usage
-  exit 2
-fi
-
-export E2E_CONTEXT_DIR="${E2E_CONTEXT_DIR:-${REPO_ROOT}/.e2e}"
-mkdir -p "${E2E_CONTEXT_DIR}"
-
-if [[ "${DRY_RUN}" -eq 1 ]]; then
-  export E2E_DRY_RUN=1
-fi
-
-# Prefer the locally-installed tsx if present, otherwise fall back to npx.
-TSX_BIN="${REPO_ROOT}/node_modules/.bin/tsx"
-if [[ ! -x "${TSX_BIN}" ]]; then
-  TSX_BIN=""
-fi
-
-run_resolver() {
-  if [[ -n "${TSX_BIN}" ]]; then
-    "${TSX_BIN}" "${SCRIPT_DIR}/resolver/index.ts" "$@"
-    return
-  fi
-  # CodeRabbit review item #10: fail closed with a clear hint instead of
-  # silently pulling tsx from the network via `npx --yes`.
-  if ! (cd "${REPO_ROOT}" && npx --no-install tsx "${SCRIPT_DIR}/resolver/index.ts" "$@"); then
-    echo "run-scenario: tsx is required but not installed. Run 'npm ci' at the repo root and retry." >&2
-    return 1
-  fi
-}
-
-run_resolver plan "${SCENARIO_ID}" --context-dir "${E2E_CONTEXT_DIR}"
-
-if [[ "${PLAN_ONLY}" -eq 1 ]]; then
-  exit 0
-fi
-
-# --validate-only: assume setup has already completed. Skip install /
-# onboard / suite execution and dispatch the expected-state validator
-# using probes resolved from E2E_PROBE_OVERRIDE_* env vars. Emits the
-# probe results JSON report to stdout and writes it to
-# ${E2E_CONTEXT_DIR}/expected-state-report.json.
-if [[ "${VALIDATE_ONLY}" -eq 1 ]]; then
-  validate_args=("${SCENARIO_ID}" --context-dir "${E2E_CONTEXT_DIR}")
-  if ! run_resolver validate-state "${validate_args[@]}"; then
-    echo "run-scenario: --validate-only: expected-state validation failed" >&2
-    exit 3
-  fi
-  exit 0
-fi
-
-# Source the shared helper library so we can exercise the full
-# setup → install → onboard → gateway/sandbox check sequence. In dry-run
-# mode each helper short-circuits (and writes to E2E_TRACE_FILE if set).
-# shellcheck source=lib/env.sh
-. "${SCRIPT_DIR}/lib/env.sh"
-# shellcheck source=lib/context.sh
-. "${SCRIPT_DIR}/lib/context.sh"
-# shellcheck source=lib/negative.sh
-. "${SCRIPT_DIR}/lib/negative.sh"
-# shellcheck source=lib/port-holder.sh
-. "${SCRIPT_DIR}/lib/port-holder.sh"
-# shellcheck source=../nemoclaw_scenarios/install/dispatch.sh
-. "${E2E_ROOT}/nemoclaw_scenarios/install/dispatch.sh"
-# shellcheck source=../nemoclaw_scenarios/onboard/dispatch.sh
-. "${E2E_ROOT}/nemoclaw_scenarios/onboard/dispatch.sh"
-# shellcheck source=../validation_suites/assert/gateway-alive.sh
-. "${E2E_ROOT}/validation_suites/assert/gateway-alive.sh"
-# shellcheck source=../validation_suites/assert/sandbox-alive.sh
-. "${E2E_ROOT}/validation_suites/assert/sandbox-alive.sh"
-
-# Apply standard non-interactive env (and trace it).
-e2e_env_apply_noninteractive
-e2e_env_trace "env:noninteractive"
-
-# Emit normalized context from the resolved plan.
-e2e_context_init
-"${E2E_ROOT}/nemoclaw_scenarios/helpers/emit-context-from-plan.sh" "${E2E_CONTEXT_DIR}/plan.json"
-
-# Extract the install method and onboarding profile from the plan so we can
-# dispatch to the right helpers.
-read_plan_string() {
-  local key="$1"
-  node -e "
-    const p = JSON.parse(require('fs').readFileSync(process.argv[1], 'utf8'));
-    const parts = process.argv[2].split('.');
-    let cur = p;
-    for (const part of parts) { if (cur == null) { cur = ''; break; } cur = cur[part]; }
-    process.stdout.write(cur == null ? '' : String(cur));
-  " "${E2E_CONTEXT_DIR}/plan.json" "${key}"
-}
-
-INSTALL_ID="$(read_plan_string dimensions.install.id)"
-INSTALL_METHOD="$(read_plan_string dimensions.install.profile.method)"
-ONBOARDING_ID="$(read_plan_string dimensions.onboarding.id)"
-RUNTIME_ID="$(read_plan_string dimensions.runtime.id)"
-RUNTIME_CONTAINER_DAEMON="$(read_plan_string dimensions.runtime.profile.container_daemon)"
-EXPECTED_STATE_ID="$(read_plan_string expected_state.id)"
-FAILURE_STAGE="$(read_plan_string expected_state.config.failure.stage)"
-FAILURE_EXIT_CODE="$(read_plan_string expected_state.config.failure.exit_code)"
-FAILURE_MESSAGE_CONTAINS="$(read_plan_string expected_state.config.failure.message_contains)"
-FAILURE_NO_STACK_TRACE="$(read_plan_string expected_state.config.failure.no_stack_trace)"
-
-# Trace the dimension id so scenario-level assertions can identify the
-# configured install (e.g. repo-current); e2e_install internally traces
-# the resolved method.
-e2e_env_trace "install:${INSTALL_ID}"
-
-install_log="${E2E_CONTEXT_DIR}/install.log"
-set +e
-e2e_install "${INSTALL_METHOD}" >"${install_log}" 2>&1
-install_status=$?
-set -e
-if [[ "${install_status}" -ne 0 ]]; then
-  cat "${install_log}" >&2
-  echo "run-scenario: install ${INSTALL_METHOD} failed with status ${install_status}" >&2
-  exit "${install_status}"
-fi
-export PATH="${HOME}/.local/bin:${PATH}"
-{
-  printf 'PATH=%s\n' "${PATH}"
-  command -v nemoclaw || true
-} >"${E2E_CONTEXT_DIR}/post-install-path.log" 2>&1
-if [[ "${DRY_RUN}" -eq 1 ]]; then
-  printf 'run-scenario: dry-run skipping post-install nemoclaw PATH verification\n' >&2
-else
-  nemoclaw_bin="$(command -v nemoclaw || true)"
-  if [[ -z "${nemoclaw_bin}" ]]; then
-    cat "${E2E_CONTEXT_DIR}/post-install-path.log" >&2
-    echo "run-scenario: nemoclaw not found on PATH after install" >&2
-    exit 127
-  fi
-  printf 'run-scenario: using nemoclaw at %s\n' "${nemoclaw_bin}" >&2
-fi
-
-# Negative scenarios declare an `expected_failure` block on their expected
-# state (see NemoClaw issue #3608). The runner forces the failure mode for
-# the scenario, captures the setup log, gathers a side-effect inventory, and
-# delegates structured matching to `resolver/index.ts match-failure`. The
-# matcher writes `expected-vs-actual.json` for CI artifact upload.
-
-read_plan_failure_field() {
-  local key="$1"
-  node -e "
-    (() => {
-      const p = JSON.parse(require('fs').readFileSync(process.argv[1], 'utf8'));
-      const ef = p.expected_failure;
-      if (!ef) { process.stdout.write(''); return; }
-      const v = ef[process.argv[2]];
-      process.stdout.write(v == null ? '' : Array.isArray(v) ? v.join(',') : String(v));
-    })();
-  " "${E2E_CONTEXT_DIR}/plan.json" "${key}"
-}
-
-EXPECTED_FAILURE_PHASE="$(read_plan_failure_field phase)"
-
-if [[ -n "${EXPECTED_FAILURE_PHASE}" ]]; then
-  expected_error_class="$(read_plan_failure_field error_class)"
-  negative_log="${E2E_CONTEXT_DIR}/negative-${EXPECTED_FAILURE_PHASE}.log"
-  sandbox_name="$(e2e_context_get E2E_SANDBOX_NAME)"
-
-  # Snapshot the side-effect baseline BEFORE forcing the failure so we only
-  # report effects newly introduced by this scenario. A pre-existing gateway
-  # or credentials file from an earlier run would otherwise look like a fresh
-  # side effect and falsely fail negative scenarios in dirty environments.
-  baseline_sandbox=0
-  if [[ -n "${sandbox_name}" ]] && openshell sandbox list 2>/dev/null | grep -Fq "${sandbox_name}"; then
-    baseline_sandbox=1
-  fi
-  baseline_gateway=0
-  if nemoclaw gateway status >/dev/null 2>&1; then
-    baseline_gateway=1
-  fi
-  baseline_credentials=0
-  if [[ -s "${HOME}/.nemoclaw/credentials.json" ]]; then
-    baseline_credentials=1
-  fi
-
-  # Force the failure mode declared by the scenario. Only `preflight` /
-  # `docker-missing` is implemented here; other phases are accepted by the
-  # schema but their forcing logic lands alongside the first consumer.
-  case "${EXPECTED_FAILURE_PHASE}:${expected_error_class}" in
-    preflight:docker-missing)
-      if [[ "${DRY_RUN}" -eq 1 ]]; then
-        printf 'Cannot connect to the Docker daemon during preflight\n' >"${negative_log}"
-      else
-        if DOCKER_HOST="unix:///tmp/nemoclaw-e2e-missing-docker.sock" \
-          e2e_onboard "${ONBOARDING_ID}" >"${negative_log}" 2>&1; then
-          echo "run-scenario: expected preflight failure, but onboarding succeeded" >&2
-          cat "${negative_log}" >&2
-          exit 4
-        fi
-      fi
-      ;;
-    *)
-      echo "run-scenario: expected_failure phase=${EXPECTED_FAILURE_PHASE} class=${expected_error_class} has no forcing implementation yet" >&2
-      exit 2
-      ;;
-  esac
-
-  # Compute the side-effect delta: only count effects that were absent in the
-  # baseline and present after the forced failure.
-  observed_side_effects=""
-  if [[ "${baseline_sandbox}" -eq 0 ]] && [[ -n "${sandbox_name}" ]] \
-    && openshell sandbox list 2>/dev/null | grep -Fq "${sandbox_name}"; then
-    observed_side_effects="${observed_side_effects:+${observed_side_effects},}sandbox-created"
-  fi
-  if [[ "${baseline_gateway}" -eq 0 ]] && nemoclaw gateway status >/dev/null 2>&1; then
-    observed_side_effects="${observed_side_effects:+${observed_side_effects},}gateway-started"
-  fi
-  if [[ "${baseline_credentials}" -eq 0 ]] && [[ -s "${HOME}/.nemoclaw/credentials.json" ]]; then
-    observed_side_effects="${observed_side_effects:+${observed_side_effects},}credentials-written"
-  fi
-
-  # `--observed-error-class` is intentionally omitted: the runner does not yet
-  # derive a structured error class from the actual failure output, and
-  # reporting the planned class back to the matcher would make the check
-  # tautological. The matcher logs this as a skipped check.
-  match_args=(
-    match-failure "${SCENARIO_ID}"
-    --context-dir "${E2E_CONTEXT_DIR}"
-    --log "${negative_log}"
-    --observed-phase "${EXPECTED_FAILURE_PHASE}"
-  )
-  if [[ -n "${observed_side_effects}" ]]; then
-    match_args+=(--observed-side-effects "${observed_side_effects}")
-  fi
-  if ! run_resolver "${match_args[@]}"; then
-    echo "run-scenario: expected-failure match failed; see ${E2E_CONTEXT_DIR}/expected-vs-actual.json" >&2
-    exit 4
-  fi
-  echo "run-scenario: negative scenario passed (phase=${EXPECTED_FAILURE_PHASE} class=${expected_error_class})"
-  exit 0
-fi
-
-if [[ "${EXPECTED_STATE_ID}" == "preflight-failure-no-sandbox" ]]; then
-  negative_log="${E2E_CONTEXT_DIR}/negative-preflight.log"
-  sandbox_name="$(e2e_context_get E2E_SANDBOX_NAME)"
-  if [[ "${DRY_RUN}" -eq 1 ]]; then
-    printf 'Cannot connect to the Docker daemon during preflight\n' >"${negative_log}"
-  elif DOCKER_HOST="unix:///tmp/nemoclaw-e2e-missing-docker.sock" e2e_onboard "${ONBOARDING_ID}" >"${negative_log}" 2>&1; then
-    echo "run-scenario: expected preflight failure, but onboarding succeeded" >&2
-    exit 4
-  fi
-  if ! grep -Eiq "docker|container|daemon|socket|preflight" "${negative_log}"; then
-    echo "run-scenario: negative preflight failed without a clear Docker/preflight reason" >&2
-    cat "${negative_log}" >&2
-    exit 4
-  fi
-  if openshell sandbox list 2>/dev/null | grep -Fq "${sandbox_name}"; then
-    echo "run-scenario: negative preflight left behind sandbox ${sandbox_name}" >&2
-    exit 4
-  fi
-  echo "run-scenario: negative preflight passed; Docker daemon unavailable and no sandbox was created"
-  exit 0
-fi
-
-if [[ "${FAILURE_STAGE}" == "onboarding" ]]; then
-  negative_log="${E2E_CONTEXT_DIR}/negative-onboarding.log"
-  sandbox_name="$(e2e_context_get E2E_SANDBOX_NAME)"
-  port_holder_started=0
-  onboard_env=(NEMOCLAW_SANDBOX_NAME="${sandbox_name}" NEMOCLAW_RECREATE_SANDBOX=1 NEMOCLAW_POLICY_MODE=skip)
-  case "${ONBOARDING_ID}" in
-    cloud-openclaw-invalid-nvidia-key)
-      onboard_env+=(NVIDIA_API_KEY=not-a-nvidia-key)
-      ;;
-    cloud-openclaw-gateway-port-conflict)
-      conflict_port="$(read_plan_string dimensions.onboarding.profile.gateway_port)"
-      : "${conflict_port:=18080}"
-      if e2e_port_holder_start "${conflict_port}"; then
-        port_holder_started=1
-      else
-        echo "run-scenario: could not start port holder on ${conflict_port}; continuing against any existing listener" >&2
-      fi
-      onboard_env+=(NEMOCLAW_GATEWAY_PORT="${conflict_port}")
-      ;;
-  esac
-  if [[ "${DRY_RUN}" -eq 1 ]]; then
-    printf '%s
-' "${FAILURE_MESSAGE_CONTAINS}" >"${negative_log}"
-    negative_status="${FAILURE_EXIT_CODE:-1}"
-  else
-    set +e
-    (
-      export "${onboard_env[@]}"
-      e2e_onboard "${ONBOARDING_ID}"
-    ) >"${negative_log}" 2>&1
-    negative_status=$?
-    set -e
-  fi
-  if [[ "${port_holder_started}" -eq 1 ]]; then
-    e2e_port_holder_stop
-  fi
-  if ! e2e_negative_assert_failure "${negative_log}" "${negative_status}" "${FAILURE_EXIT_CODE:-1}" "${FAILURE_MESSAGE_CONTAINS}" "$([[ "${FAILURE_NO_STACK_TRACE}" == "true" ]] && echo 1 || echo 0)"; then
-    exit 4
-  fi
-  if openshell sandbox list 2>/dev/null | grep -Fq "${sandbox_name}"; then
-    echo "run-scenario: negative onboarding left behind sandbox ${sandbox_name}" >&2
-    exit 4
-  fi
-  echo "run-scenario: negative onboarding ${ONBOARDING_ID} passed"
-  exit 0
-fi
-
-DOCKER_OPTIONAL_UNAVAILABLE=0
-if [[ "${RUNTIME_CONTAINER_DAEMON}" == "optional" ]] && ! docker info >/dev/null 2>&1; then
-  DOCKER_OPTIONAL_UNAVAILABLE=1
-  echo "SKIP: scenario.${SCENARIO_ID}.docker-dependent-suites Docker unavailable for optional runtime ${RUNTIME_ID}; gateway/sandbox/inference coverage skipped"
-  echo "run-scenario: Docker unavailable for optional runtime ${RUNTIME_ID}; scaling back to platform-only suites"
-else
-  onboard_log="${E2E_CONTEXT_DIR}/onboard.log"
-  set +e
-  e2e_onboard "${ONBOARDING_ID}" >"${onboard_log}" 2>&1
-  onboard_status=$?
-  set -e
-  if [[ "${onboard_status}" -ne 0 ]]; then
-    cat "${onboard_log}" >&2
-    echo "run-scenario: onboarding ${ONBOARDING_ID} failed with status ${onboard_status}" >&2
-    exit "${onboard_status}"
-  fi
-  if [[ "${RUNTIME_ID}" == "gpu-docker-cdi" ]] && ! e2e_env_is_dry_run; then
-    echo "run-scenario: GPU Docker CDI uses host-network gateway; validating gateway from suites"
-  else
-    e2e_gateway_assert_healthy
-  fi
-  e2e_sandbox_assert_running
-fi
-
-# Expected state validation. The validator reads E2E_PROBE_OVERRIDE_* env
-# variables to simulate real probe outputs in dry-run/test contexts.
-# Live probe wiring lands scenario-by-scenario; by default, live runs move
-# straight from setup checks to suites so migrated suite assertions can be
-# debugged against the real environment.
-if [[ "${E2E_VALIDATE_EXPECTED_STATE:-0}" == "1" || "${DRY_RUN}" -eq 1 ]]; then
-  validate_args=("${SCENARIO_ID}" --context-dir "${E2E_CONTEXT_DIR}")
-  if [[ "${DRY_RUN}" -eq 1 ]]; then
-    # CodeRabbit review item #9: explicitly opt in to seeding probes from
-    # the expected state in dry-run/test mode. Live runs go through real
-    # probes and must fail closed if any are missing.
-    validate_args+=(--probes-from-state)
-  fi
-  if ! run_resolver validate-state "${validate_args[@]}"; then
-    echo "run-scenario: expected-state validation failed; suites will NOT run" >&2
-    exit 3
-  fi
-fi
-
-if [[ "${DRY_RUN}" -eq 1 ]]; then
-  echo "run-scenario: dry-run complete; context.env emitted under ${E2E_CONTEXT_DIR}"
-  exit 0
-fi
-
-SUITE_IDS=()
-while IFS= read -r suite_id; do
-  SUITE_IDS+=("${suite_id}")
-done < <(node -e "
-  try {
-    const planPath = process.argv[1];
-    const p = JSON.parse(require('fs').readFileSync(planPath, 'utf8'));
-    if (!Array.isArray(p.suites)) {
-      throw new Error('missing or invalid suites array');
-    }
-    const filter = process.env.E2E_SUITE_FILTER || '';
-    const selected = filter ? filter.split(',').map((s) => s.trim()).filter(Boolean) : p.suites.map((s) => s.id);
-    for (const id of selected) console.log(id);
-  } catch (err) {
-    console.error('run-scenario: failed to parse plan.json ' + process.argv[1] + ': ' + err.message);
-    process.exit(1);
-  }
-" "${E2E_CONTEXT_DIR}/plan.json")
-
-if [[ "${#SUITE_IDS[@]}" -eq 0 ]]; then
-  echo "run-scenario: no suites selected for ${SCENARIO_ID}" >&2
-  exit 4
-fi
-
-if [[ "${DOCKER_OPTIONAL_UNAVAILABLE}" -eq 1 ]]; then
-  FILTERED_SUITE_IDS=()
-  for suite_id in "${SUITE_IDS[@]}"; do
-    case "${suite_id}" in
-      smoke | inference | credentials | hermes-specific | local-ollama-inference | ollama-proxy | gateway-health | sandbox-shell | cloud-inference | ollama-auth-proxy | security-credentials | messaging-telegram | messaging-discord | messaging-slack | security-shields | inference-routing | sandbox-lifecycle | sandbox-operations | snapshot | rebuild | upgrade | diagnostics | docs-validation | openai-compatible-inference | inference-switch | kimi-compatibility | messaging-token-rotation | security-policy | security-injection | model-router)
-        echo "SKIP: suite.${suite_id} skipped because optional Docker runtime ${RUNTIME_ID} is unavailable"
-        ;;
-      *)
-        FILTERED_SUITE_IDS+=("${suite_id}")
-        ;;
-    esac
-  done
-  SUITE_IDS=("${FILTERED_SUITE_IDS[@]}")
-fi
+cat >&2 <<'MSG'
+run-scenario.sh is deprecated. Use the TS runner instead:
 
-if [[ "${#SUITE_IDS[@]}" -eq 0 ]]; then
-  echo "run-scenario: all suites skipped for ${SCENARIO_ID}" >&2
-  exit 0
-fi
+  npx tsx test/e2e-scenario/scenarios/run.ts --scenarios <id[,id...]>
 
-bash "${SCRIPT_DIR}/run-suites.sh" "${SUITE_IDS[@]}"
+Other run.ts modes (read-only):
+  --list                List canonical scenario ids
+  --emit-matrix         Emit GitHub Actions matrix payload from the registry
+  --plan-only           Local debug: print the compiled plan, do not execute
+                        (must NOT appear in any CI workflow)
+MSG
+exit 2
diff --git a/test/e2e-scenario/runtime/run-suites.sh b/test/e2e-scenario/runtime/run-suites.sh
index e99c069408..dac69cd422 100755
--- a/test/e2e-scenario/runtime/run-suites.sh
+++ b/test/e2e-scenario/runtime/run-suites.sh
@@ -2,136 +2,20 @@
 # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
-# Run one or more functional suites against a completed E2E environment.
-#
-# Usage:
-#   bash test/e2e-scenario/runtime/run-suites.sh <suite-id> [<suite-id> ...]
-#
-# Reads suite metadata from test/e2e-scenario/validation_suites/suites.yaml
-# (or $E2E_SUITES_FILE). Each suite script receives .e2e/context.env
-# via E2E_CONTEXT_DIR and is expected to source runtime/lib/context.sh if
-# it needs specific keys.
-#
-# Environment:
-#   E2E_CONTEXT_DIR   Directory containing context.env (default: <repo>/.e2e)
-#   E2E_SUITES_FILE   Override suites metadata file (for tests)
-#   E2E_SUITES_DIR    Override the directory that suite scripts are resolved
-#                     against (default: test/e2e-scenario/validation_suites/)
-#   E2E_DRY_RUN       When 1, suite scripts run in dry-run mode themselves.
-#
-# Exit code: 0 if all steps pass; non-zero at the first failing step.
+# DEPRECATED. Suite execution is now driven directly by the TS phase
+# orchestrator (RuntimeOrchestrator -> PhaseOrchestrator.runShellStep) which
+# spawns each migrated assertion step's implementation.ref shell script.
+# There is no longer a YAML-walking bash suite runner.
 
 set -euo pipefail
 
-SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
-E2E_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
-REPO_ROOT="$(cd "${SCRIPT_DIR}/../../.." && pwd)"
-VALIDATION_SUITES_DIR="${E2E_ROOT}/validation_suites"
-
-if (($# == 0)); then
-  echo "run-suites: at least one suite id required" >&2
-  echo "Usage: bash test/e2e-scenario/runtime/run-suites.sh <suite-id> [<suite-id> ...]" >&2
-  exit 2
-fi
-
-export E2E_CONTEXT_DIR="${E2E_CONTEXT_DIR:-${REPO_ROOT}/.e2e}"
-SUITES_FILE="${E2E_SUITES_FILE:-${VALIDATION_SUITES_DIR}/suites.yaml}"
-SUITES_DIR="${E2E_SUITES_DIR:-${VALIDATION_SUITES_DIR}}"
-
-CTX_FILE="${E2E_CONTEXT_DIR}/context.env"
-if [[ ! -f "${CTX_FILE}" ]]; then
-  echo "run-suites: missing ${CTX_FILE}; run-scenario.sh must emit context before running suites" >&2
-  exit 1
-fi
-
-# Sanity-check that the baseline scenario key is present.
-if ! grep -q '^E2E_SCENARIO=' "${CTX_FILE}"; then
-  echo "run-suites: ${CTX_FILE} is missing required key E2E_SCENARIO" >&2
-  exit 1
-fi
-
-# Resolve the suite step list by reading the YAML via node.
-resolve_suite() {
-  local suite_id="$1"
-  node -e "
-    const fs = require('fs');
-    const path = process.argv[1];
-    const wanted = process.argv[2];
-    const raw = fs.readFileSync(path, 'utf8');
-    // Minimal YAML reader: prefer js-yaml if available; else fall back.
-    let yaml;
-    try { yaml = require('js-yaml'); } catch (_) {
-      process.stderr.write('run-suites: js-yaml required to parse suite metadata\n');
-      process.exit(2);
-    }
-    const doc = yaml.load(raw);
-    if (!doc || !doc.suites || !doc.suites[wanted]) {
-      process.stderr.write('run-suites: unknown suite: ' + wanted + '\n');
-      process.exit(3);
-    }
-    const steps = doc.suites[wanted].steps || [];
-    for (const s of steps) {
-      if (!s || typeof s.id !== 'string' || typeof s.script !== 'string') {
-        process.stderr.write('run-suites: malformed step in ' + wanted + '\n');
-        process.exit(4);
-      }
-      process.stdout.write(s.id + '\t' + s.script + '\n');
-    }
-  " "${SUITES_FILE}" "${suite_id}"
-}
-
-declare -a FAILED_STEPS=()
-declare -a PASSED_STEPS=()
-OVERALL_STATUS=0
-
-run_one_suite() {
-  local suite_id="$1"
-  echo "== suite: ${suite_id} =="
-  local steps
-  if ! steps="$(resolve_suite "${suite_id}")"; then
-    OVERALL_STATUS=1
-    return 1
-  fi
-  if [[ -z "${steps}" ]]; then
-    echo "  (no steps)"
-    return 0
-  fi
-  while IFS=$'\t' read -r step_id script; do
-    [[ -z "${step_id}" ]] && continue
-    local full="${SUITES_DIR}/${script}"
-    echo "  -> step: ${step_id} (${script})"
-    if [[ ! -f "${full}" ]]; then
-      echo "    FAIL: script not found at ${full}" >&2
-      FAILED_STEPS+=("${suite_id}/${step_id}")
-      OVERALL_STATUS=1
-      return 1
-    fi
-    if ! bash "${full}"; then
-      echo "    FAIL: suite=${suite_id} step=${step_id}" >&2
-      FAILED_STEPS+=("${suite_id}/${step_id}")
-      OVERALL_STATUS=1
-      return 1
-    fi
-    echo "    PASS: ${step_id}"
-    PASSED_STEPS+=("${suite_id}/${step_id}")
-  done <<<"${steps}"
-}
-
-for suite_id in "$@"; do
-  if ! run_one_suite "${suite_id}"; then
-    break
-  fi
-done
+cat >&2 <<'MSG'
+run-suites.sh is deprecated. Suite assertions are now executed by
+test/e2e-scenario/scenarios/orchestrators/phase.ts via child_process.spawn,
+walking the typed assertionGroups defined in the scenario registry.
 
-echo
-echo "== suite summary =="
-# bash 3.2 (macOS) fails on "${arr[@]}" when the array is empty under `set -u`;
-# use the `${arr[@]+...}` guard to expand to nothing when empty.
-for p in ${PASSED_STEPS[@]+"${PASSED_STEPS[@]}"}; do
-  echo "  PASS ${p}"
-done
-for f in ${FAILED_STEPS[@]+"${FAILED_STEPS[@]}"}; do
-  echo "  FAIL ${f}"
-done
+Run scenarios via:
 
-exit "${OVERALL_STATUS}"
+  npx tsx test/e2e-scenario/scenarios/run.ts --scenarios <id[,id...]>
+MSG
+exit 2
diff --git a/test/e2e-scenario/scenarios/orchestrators/phase.ts b/test/e2e-scenario/scenarios/orchestrators/phase.ts
index ae59a58e62..220a8426c7 100644
--- a/test/e2e-scenario/scenarios/orchestrators/phase.ts
+++ b/test/e2e-scenario/scenarios/orchestrators/phase.ts
@@ -1,8 +1,10 @@
 // SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 // SPDX-License-Identifier: Apache-2.0
 
+import { spawn } from "node:child_process";
 import fs from "node:fs";
 import path from "node:path";
+import { fileURLToPath } from "node:url";
 import type {
   AssertionResult,
   AssertionStep,
@@ -13,19 +15,29 @@ import type {
   TransientClassifier,
 } from "../types.ts";
 
+const REPO_ROOT = path.resolve(path.dirname(fileURLToPath(import.meta.url)), "../../../..");
+const DEFAULT_STEP_TIMEOUT_SECONDS = 300;
+
 interface StepAttemptOutcome {
-  status: "passed" | "failed";
+  status: "passed" | "failed" | "skipped";
   classifier?: TransientClassifier;
   message?: string;
+  evidence?: string;
 }
 
-function transientForRef(ref: string): TransientClassifier {
-  if (ref.includes("provider") || ref.includes("transient")) {
-    return "provider-transient";
+// Heuristic transient classifier for shell step refs that don't print
+// their own classifier hint. Phase orchestrators own classification;
+// clients/scripts do not.
+function classifierForRef(ref: string): TransientClassifier {
+  if (/provider|inference|chat-completion|cloudflared|tunnel/i.test(ref)) {
+    return ref.includes("tunnel") || ref.includes("cloudflared") ? "external-tunnel" : "provider-transient";
   }
-  if (ref.includes("gateway")) {
+  if (/gateway/i.test(ref)) {
     return "gateway-transient";
   }
+  if (/event-capture|tui|chat-events/i.test(ref)) {
+    return "empty-event-capture";
+  }
   return "runner-infra";
 }
 
@@ -39,7 +51,9 @@ export class PhaseOrchestrator {
         assertions.push(await this.runStep(ctx, step));
       }
     }
-    const status = assertions.some((assertion) => assertion.status === "failed") ? "failed" : "passed";
+    const failed = assertions.some((assertion) => assertion.status === "failed");
+    const allSkipped = assertions.length > 0 && assertions.every((assertion) => assertion.status === "skipped");
+    const status: PhaseResult["status"] = failed ? "failed" : allSkipped ? "skipped" : "passed";
     const result: PhaseResult = { phase: this.phaseName, status, assertions };
     this.writePhaseResult(ctx, result);
     return result;
@@ -48,20 +62,21 @@ export class PhaseOrchestrator {
   private async runStep(ctx: RunContext, step: AssertionStep): Promise<AssertionResult> {
     const startedAt = Date.now();
     const rawAttempts = step.reliability?.retry?.attempts;
-    const maxAttempts = typeof rawAttempts === "number" && Number.isFinite(rawAttempts) ? Math.max(1, Math.floor(rawAttempts)) : 1;
+    const maxAttempts =
+      typeof rawAttempts === "number" && Number.isFinite(rawAttempts) ? Math.max(1, Math.floor(rawAttempts)) : 1;
     let attempts = 0;
     let lastOutcome: StepAttemptOutcome = { status: "failed", message: "step did not run" };
     for (let attempt = 1; attempt <= maxAttempts; attempt += 1) {
       attempts = attempt;
       lastOutcome = await this.executeStep(ctx, step, attempt);
-      if (lastOutcome.status === "passed") {
+      if (lastOutcome.status === "passed" || lastOutcome.status === "skipped") {
         return {
           id: step.id,
-          status: "passed",
+          status: lastOutcome.status,
           attempts,
           durationMs: Date.now() - startedAt,
           classifier: attempt > 1 ? step.reliability?.retry?.on[0] : lastOutcome.classifier,
-          evidence: step.evidencePath,
+          evidence: lastOutcome.evidence ?? step.evidencePath,
           message: lastOutcome.message,
         };
       }
@@ -75,7 +90,7 @@ export class PhaseOrchestrator {
       attempts,
       durationMs: Date.now() - startedAt,
       classifier: lastOutcome.classifier,
-      evidence: step.evidencePath,
+      evidence: lastOutcome.evidence ?? step.evidencePath,
       message: lastOutcome.message,
     };
   }
@@ -92,26 +107,144 @@ export class PhaseOrchestrator {
     return step.reliability?.retry?.on.includes(classifier) ?? false;
   }
 
-  private async executeStep(_ctx: RunContext, step: AssertionStep, attempt: number): Promise<StepAttemptOutcome> {
-    const ref = step.implementation?.ref ?? "";
-    if (ref === "fake-pass" || ref === "phase-1-skeleton") {
-      return { status: "passed" };
+  private async executeStep(ctx: RunContext, step: AssertionStep, _attempt: number): Promise<StepAttemptOutcome> {
+    const kind = step.implementation?.kind;
+    if (kind === "shell") {
+      return this.runShellStep(ctx, step);
     }
-    if (ref === "fake-retry-once-pass") {
-      return attempt === 1
-        ? { status: "failed", classifier: step.reliability?.retry?.on[0] ?? "gateway-transient" }
-        : { status: "passed" };
+    if (kind === "probe") {
+      // Probe registry lands in a follow-up PR. Until then, surface
+      // unimplemented probes as visibly skipped — never as fake green.
+      return {
+        status: "skipped",
+        message: `probe not registered: ${step.implementation?.ref ?? "<no ref>"}`,
+      };
     }
-    if (ref === "fake-always-transient") {
-      return { status: "failed", classifier: step.reliability?.retry?.on[0] ?? transientForRef(ref) };
+    if (kind === "pending") {
+      // pending steps surface as skipped with the placeholder ref so
+      // gaps are visible in plan output and phase results.
+      return { status: "skipped", message: `pending: ${step.implementation?.ref ?? ""}` };
     }
-    if (step.implementation?.kind === "shell" && _ctx.dryRun) {
-      return { status: "passed", message: `dry-run shell ${ref}` };
+    throw new Error(`Unknown assertion step kind for ${step.id}: ${String(kind)}`);
+  }
+
+  private async runShellStep(ctx: RunContext, step: AssertionStep): Promise<StepAttemptOutcome> {
+    const ref = step.implementation?.ref;
+    if (!ref) {
+      return { status: "failed", message: `shell step ${step.id} missing implementation.ref` };
     }
-    if (step.implementation?.kind === "probe" && _ctx.dryRun) {
-      return { status: "passed", message: `dry-run probe ${ref}` };
+    const scriptPath = path.isAbsolute(ref) ? ref : path.resolve(REPO_ROOT, ref);
+    if (!fs.existsSync(scriptPath)) {
+      return { status: "failed", message: `shell step ${step.id} script not found: ${scriptPath}` };
+    }
+
+    const timeoutSeconds = step.reliability?.timeoutSeconds ?? DEFAULT_STEP_TIMEOUT_SECONDS;
+    const logDir = path.join(ctx.contextDir, ".e2e", "logs");
+    fs.mkdirSync(logDir, { recursive: true });
+    const logPath = path.join(logDir, `${step.id}.log`);
+
+    const env: NodeJS.ProcessEnv = {
+      ...process.env,
+      E2E_CONTEXT_DIR: ctx.contextDir,
+      E2E_STEP_ID: step.id,
+      E2E_PHASE: step.phase,
+    };
+    // Surface scenario-derived context (E2E_SANDBOX_NAME, E2E_GATEWAY_URL,
+    // etc.) that the environment+onboarding phases wrote into context.env.
+    const contextEnvPath = path.join(ctx.contextDir, ".e2e", "context.env");
+    if (fs.existsSync(contextEnvPath)) {
+      const contextEnv = fs.readFileSync(contextEnvPath, "utf8");
+      for (const line of contextEnv.split("\n")) {
+        const trimmed = line.trim();
+        if (!trimmed || trimmed.startsWith("#")) {
+          continue;
+        }
+        const eq = trimmed.indexOf("=");
+        if (eq <= 0) {
+          continue;
+        }
+        const key = trimmed.slice(0, eq);
+        let value = trimmed.slice(eq + 1);
+        if ((value.startsWith('"') && value.endsWith('"')) || (value.startsWith("'") && value.endsWith("'"))) {
+          value = value.slice(1, -1);
+        }
+        env[key] = value;
+      }
     }
-    return { status: "failed", message: `unsupported live step ${step.id}` };
+
+    return await new Promise<StepAttemptOutcome>((resolve) => {
+      // detached: true puts the child (and any of its children, e.g. a `sleep`
+      // spawned by bash) into its own process group. We send signals to the
+      // negative pid so the whole group dies on timeout. Without this, bash
+      // ignores SIGTERM until its current foreground command (e.g. sleep)
+      // returns, and timeouts effectively don't work.
+      const child = spawn("bash", [scriptPath], { env, cwd: REPO_ROOT, detached: true });
+      const pgid = child.pid;
+      const logStream = fs.createWriteStream(logPath);
+      let stderrTail = "";
+      child.stdout.pipe(logStream, { end: false });
+      child.stderr.pipe(logStream, { end: false });
+      child.stderr.on("data", (chunk: Buffer) => {
+        stderrTail = (stderrTail + chunk.toString("utf8")).slice(-4096);
+      });
+
+      const killGroup = (signal: NodeJS.Signals) => {
+        if (typeof pgid !== "number") {
+          child.kill(signal);
+          return;
+        }
+        try {
+          process.kill(-pgid, signal);
+        } catch {
+          /* group already gone */
+        }
+      };
+
+      let timedOut = false;
+      const timeout = setTimeout(() => {
+        timedOut = true;
+        killGroup("SIGTERM");
+        setTimeout(() => {
+          if (!child.killed) {
+            killGroup("SIGKILL");
+          }
+        }, 5_000).unref();
+      }, timeoutSeconds * 1_000);
+
+      child.on("error", (err) => {
+        clearTimeout(timeout);
+        logStream.end();
+        resolve({
+          status: "failed",
+          message: `shell step ${step.id} spawn error: ${err.message}`,
+          evidence: logPath,
+        });
+      });
+
+      child.on("close", (code, signal) => {
+        clearTimeout(timeout);
+        logStream.end();
+        if (timedOut) {
+          resolve({
+            status: "failed",
+            classifier: "runner-infra",
+            message: `shell step ${step.id} exceeded ${timeoutSeconds}s (signal=${signal ?? "SIGTERM"})`,
+            evidence: logPath,
+          });
+          return;
+        }
+        if (code === 0) {
+          resolve({ status: "passed", evidence: logPath });
+          return;
+        }
+        resolve({
+          status: "failed",
+          classifier: classifierForRef(ref),
+          message: `shell step ${step.id} exit ${code ?? "null"}: ${stderrTail.split("\n").slice(-3).join(" | ").trim()}`,
+          evidence: logPath,
+        });
+      });
+    });
   }
 
   private writePhaseResult(ctx: RunContext, result: PhaseResult) {
diff --git a/test/e2e-scenario/scenarios/run.ts b/test/e2e-scenario/scenarios/run.ts
index e666e07844..2a16c85996 100644
--- a/test/e2e-scenario/scenarios/run.ts
+++ b/test/e2e-scenario/scenarios/run.ts
@@ -4,33 +4,29 @@
 import { compileRunPlans, renderPlanText, writePlanArtifacts } from "./compiler.ts";
 import { ScenarioRunner } from "./orchestrators/runner.ts";
 import { listScenarios } from "./registry.ts";
+import type { PhaseResult } from "./types.ts";
 
 interface Args {
   list: boolean;
+  emitMatrix: boolean;
   planOnly: boolean;
-  dryRun: boolean;
-  validateOnly: boolean;
   scenarios: string[];
 }
 
 function parseArgs(argv: string[]): Args {
-  const args: Args = { list: false, planOnly: false, dryRun: false, validateOnly: false, scenarios: [] };
+  const args: Args = { list: false, emitMatrix: false, planOnly: false, scenarios: [] };
   for (let i = 0; i < argv.length; i += 1) {
     const arg = argv[i];
     if (arg === "--list") {
       args.list = true;
       continue;
     }
-    if (arg === "--plan-only") {
-      args.planOnly = true;
+    if (arg === "--emit-matrix") {
+      args.emitMatrix = true;
       continue;
     }
-    if (arg === "--dry-run") {
-      args.dryRun = true;
-      continue;
-    }
-    if (arg === "--validate-only") {
-      args.validateOnly = true;
+    if (arg === "--plan-only") {
+      args.planOnly = true;
       continue;
     }
     if (arg === "--scenarios") {
@@ -54,17 +50,29 @@ function printList() {
   }
 }
 
+function emitMatrix() {
+  // Read-only emission of the typed registry as a GitHub Actions matrix
+  // payload. Consumed by the dynamic matrix workflow (PR #4359).
+  const payload = {
+    include: listScenarios().map((scenario) => ({
+      id: scenario.id,
+      description: scenario.description ?? "",
+    })),
+  };
+  console.log(JSON.stringify(payload));
+}
+
 async function main() {
   const args = parseArgs(process.argv.slice(2));
   if (args.list) {
     printList();
     return;
   }
-
-  const modeCount = [args.planOnly, args.dryRun, args.validateOnly].filter(Boolean).length;
-  if (modeCount !== 1) {
-    throw new Error("Use exactly one of --plan-only, --dry-run, or --validate-only with --scenarios <id[,id...]>");
+  if (args.emitMatrix) {
+    emitMatrix();
+    return;
   }
+
   if (args.scenarios.length === 0) {
     throw new Error("scenario execution requires --scenarios <id[,id...]>");
   }
@@ -78,12 +86,43 @@ async function main() {
   writePlanArtifacts(plans, contextDir);
   console.log(renderPlanText(plans));
 
-  if (args.dryRun) {
-    const runner = new ScenarioRunner();
-    for (const plan of plans) {
-      await runner.run({ contextDir, dryRun: true }, plan);
+  if (args.planOnly) {
+    // Local debug only. Workflows must not pass --plan-only.
+    return;
+  }
+
+  const runner = new ScenarioRunner();
+  const allResults: PhaseResult[] = [];
+  let anyFailed = false;
+  for (const plan of plans) {
+    const results = await runner.run({ contextDir }, plan);
+    allResults.push(...results);
+    if (results.some((result) => result.status === "failed")) {
+      anyFailed = true;
     }
   }
+
+  // Surface a compact run summary so phase results don't have to be opened
+  // to see what passed.
+  console.log("");
+  console.log("Phase results:");
+  for (const result of allResults) {
+    const counts = result.assertions.reduce(
+      (acc, assertion) => {
+        acc[assertion.status] = (acc[assertion.status] ?? 0) + 1;
+        return acc;
+      },
+      {} as Record<string, number>,
+    );
+    const detail = Object.entries(counts)
+      .map(([status, count]) => `${status}=${count}`)
+      .join(" ");
+    console.log(`  ${result.phase}: ${result.status} (${detail || "no steps"})`);
+  }
+
+  if (anyFailed) {
+    process.exitCode = 1;
+  }
 }
 
 try {
diff --git a/test/e2e-scenario/scenarios/types.ts b/test/e2e-scenario/scenarios/types.ts
index b29f8458d6..c83464a5e9 100644
--- a/test/e2e-scenario/scenarios/types.ts
+++ b/test/e2e-scenario/scenarios/types.ts
@@ -126,7 +126,6 @@ export interface RunPlan {
 
 export interface RunContext {
   contextDir: string;
-  dryRun: boolean;
 }
 
 export interface AssertionResult {
diff --git a/test/e2e-scenario/validation_suites/assert/gateway-alive.sh b/test/e2e-scenario/validation_suites/assert/gateway-alive.sh
index a498602d35..5eec76073c 100755
--- a/test/e2e-scenario/validation_suites/assert/gateway-alive.sh
+++ b/test/e2e-scenario/validation_suites/assert/gateway-alive.sh
@@ -23,10 +23,6 @@ e2e_gateway_assert_healthy() {
     return 2
   fi
   e2e_env_trace "gateway:check" "${url}"
-  if e2e_env_is_dry_run; then
-    echo "[dry-run] gateway check ${url} (skipped)"
-    return 0
-  fi
   # Prefer /health if available, otherwise just hit the base URL.
   local http_code
   http_code="$(curl -fsS -o /dev/null -w '%{http_code}' --max-time 5 "${url%/}/health" 2>/dev/null || echo 000)"
diff --git a/test/e2e-scenario/validation_suites/assert/sandbox-alive.sh b/test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
index b85ef9cd60..473061e972 100755
--- a/test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
+++ b/test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
@@ -12,7 +12,6 @@ _E2E_SB_LIB_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../../runtime/lib" && pwd)
 
 # e2e_sandbox_assert_running
 # Requires E2E_SANDBOX_NAME in context. Real implementation queries
-# `nemoclaw list`; honors E2E_DRY_RUN.
 e2e_sandbox_assert_running() {
   if ! e2e_context_require E2E_SANDBOX_NAME; then
     return 1
@@ -20,10 +19,6 @@ e2e_sandbox_assert_running() {
   local name
   name="$(e2e_context_get E2E_SANDBOX_NAME)"
   e2e_env_trace "sandbox:check" "${name}"
-  if e2e_env_is_dry_run; then
-    echo "[dry-run] sandbox check ${name} (skipped)"
-    return 0
-  fi
   if ! command -v nemoclaw >/dev/null 2>&1; then
     echo "e2e_sandbox_assert_running: nemoclaw CLI not on PATH" >&2
     return 1
diff --git a/test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh b/test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
index 0fff0fd9ab..4b8161aea4 100755
--- a/test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
+++ b/test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
@@ -16,10 +16,6 @@ LIB_DIR="$(cd "${SCRIPT_DIR}/../../runtime/lib" && pwd)"
 
 echo "hermes-specific:hermes-health"
 e2e_context_require E2E_AGENT
-if e2e_env_is_dry_run; then
-  echo "[dry-run] would run Hermes health checks"
-  exit 0
-fi
 agent="$(e2e_context_get E2E_AGENT)"
 if [[ "${agent}" != "hermes" ]]; then
   echo "hermes-specific: E2E_AGENT should be 'hermes', got '${agent}'" >&2
diff --git a/test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh b/test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
index 64e1b086fc..af8ad99081 100755
--- a/test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
+++ b/test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
@@ -17,11 +17,6 @@ LIB_DIR="$(cd "${SCRIPT_DIR}/../../../runtime/lib" && pwd)"
 echo "inference:models-health"
 e2e_context_require E2E_SANDBOX_NAME
 
-if e2e_env_is_dry_run; then
-  echo "[dry-run] would GET inference.local/v1/models from inside the sandbox"
-  exit 0
-fi
-
 name="$(e2e_context_get E2E_SANDBOX_NAME)"
 body="$(openshell sandbox exec --name "${name}" -- curl -fsS --max-time 30 "https://inference.local/v1/models")"
 if [[ -z "${body}" ]]; then
diff --git a/test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh b/test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
index f54ff8806b..242a9496f1 100755
--- a/test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
+++ b/test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
@@ -16,11 +16,6 @@ LIB_DIR="$(cd "${SCRIPT_DIR}/../../../runtime/lib" && pwd)"
 echo "inference:chat-completion"
 e2e_context_require E2E_SANDBOX_NAME
 
-if e2e_env_is_dry_run; then
-  echo "[dry-run] would POST a chat completion to inference.local from inside the sandbox"
-  exit 0
-fi
-
 name="$(e2e_context_get E2E_SANDBOX_NAME)"
 payload='{"model":"nvidia/nemotron-3-super-120b-a12b","messages":[{"role":"user","content":"Reply with exactly one word: PONG"}],"max_tokens":100}'
 response="$(openshell sandbox exec --name "${name}" -- curl -fsS --max-time 60 -H 'Content-Type: application/json' \
diff --git a/test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh b/test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
index 6d1343a736..c3253a966a 100755
--- a/test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
+++ b/test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
@@ -17,11 +17,6 @@ LIB_DIR="$(cd "${SCRIPT_DIR}/../../../runtime/lib" && pwd)"
 echo "inference:sandbox-inference-local"
 e2e_context_require E2E_SANDBOX_NAME E2E_INFERENCE_ROUTE
 
-if e2e_env_is_dry_run; then
-  echo "[dry-run] would resolve inference-local from inside the sandbox"
-  exit 0
-fi
-
 name="$(e2e_context_get E2E_SANDBOX_NAME)"
 route="$(e2e_context_get E2E_INFERENCE_ROUTE)"
 # CodeRabbit review item #13: capture then truncate to avoid `| head` racing
diff --git a/test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh b/test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
index 77d4772c17..a3887864b2 100755
--- a/test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
+++ b/test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
@@ -15,10 +15,6 @@ LIB_DIR="$(cd "${SCRIPT_DIR}/../../../runtime/lib" && pwd)"
 
 echo "ollama-proxy:proxy-reachable"
 e2e_context_require E2E_SANDBOX_NAME
-if e2e_env_is_dry_run; then
-  echo "[dry-run] would verify the Ollama auth proxy is reachable from the sandbox"
-  exit 0
-fi
 name="$(e2e_context_get E2E_SANDBOX_NAME)"
 # The Ollama auth proxy intentionally rejects unauthenticated requests to
 # /api/tags (legacy test-gpu-e2e.sh accepts 401/403 as proof the proxy is
diff --git a/test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh b/test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
index 47e9f1fd43..d61ead2e98 100755
--- a/test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
+++ b/test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
@@ -15,10 +15,6 @@ LIB_DIR="$(cd "${SCRIPT_DIR}/../../../runtime/lib" && pwd)"
 
 echo "local-ollama-inference:ollama-models-health"
 e2e_context_require E2E_PROVIDER
-if e2e_env_is_dry_run; then
-  echo "[dry-run] would GET ollama /api/tags via host Ollama"
-  exit 0
-fi
 # GPU Ollama scenarios mirror legacy test-gpu-e2e.sh: validate the host
 # Ollama daemon directly because Docker GPU host networking bypasses the
 # normal dashboard/gateway forward path.
diff --git a/test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh b/test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
index ad8ff54faa..5d18b4209a 100755
--- a/test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
+++ b/test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
@@ -15,10 +15,6 @@ LIB_DIR="$(cd "${SCRIPT_DIR}/../../../runtime/lib" && pwd)"
 
 echo "local-ollama-inference:ollama-chat-completion"
 e2e_context_require E2E_SANDBOX_NAME
-if e2e_env_is_dry_run; then
-  echo "[dry-run] would POST chat completion from sandbox to host-network Ollama"
-  exit 0
-fi
 name="$(e2e_context_get E2E_SANDBOX_NAME)"
 model="$(curl -fsS --max-time 10 http://127.0.0.1:11434/api/tags \
   | node -e "const fs=require('fs'); const data=JSON.parse(fs.readFileSync(0,'utf8')); process.stdout.write(data.models?.[0]?.name || data.models?.[0]?.model || 'default');")"
diff --git a/test/e2e-scenario/validation_suites/lib/inference_routing.sh b/test/e2e-scenario/validation_suites/lib/inference_routing.sh
index b4f4c1d63f..17db0bbedb 100755
--- a/test/e2e-scenario/validation_suites/lib/inference_routing.sh
+++ b/test/e2e-scenario/validation_suites/lib/inference_routing.sh
@@ -31,16 +31,6 @@ _e2e_inference_sandbox_name() {
   e2e_context_get E2E_SANDBOX_NAME
 }
 
-_e2e_inference_plan() {
-  local assertion_id="${1:-}"
-  local detail="${2:-planned inference/provider check}"
-  e2e_env_trace "inference:plan" "${assertion_id} ${detail}"
-  echo "[dry-run] ${assertion_id}: ${detail}"
-  if [[ -f "$(e2e_context_path)" ]]; then
-    e2e_context_dump | sed -E 's/(TOKEN|SECRET|API_KEY|APIKEY|CREDENTIAL|PASSWORD)([^=]*)=.*/\1\2=REDACTED/'
-  fi
-}
-
 _e2e_inference_curl_json() {
   local sandbox="$1"
   local url="$2"
@@ -64,10 +54,6 @@ e2e_inference_routing_assert_chat_completion() {
   local assertion_id="${1:-post-onboard.inference-routing.inference-local-chat-completion}"
   _e2e_inference_assertion "${assertion_id}"
   _e2e_inference_require_sandbox
-  if e2e_env_is_dry_run; then
-    _e2e_inference_plan "${assertion_id}" "POST https://inference.local/v1/chat/completions with bounded curl"
-    return 0
-  fi
   local sandbox payload output
   sandbox="$(_e2e_inference_sandbox_name)"
   payload='{"model":"default","messages":[{"role":"user","content":"Say ok"}],"max_tokens":8}'
@@ -84,10 +70,6 @@ e2e_inference_routing_assert_health() {
   local url="${2:-https://inference.local/v1/models}"
   _e2e_inference_assertion "${assertion_id}"
   _e2e_inference_require_sandbox
-  if e2e_env_is_dry_run; then
-    _e2e_inference_plan "${assertion_id}" "GET ${url} with bounded curl"
-    return 0
-  fi
   local sandbox status
   sandbox="$(_e2e_inference_sandbox_name)"
   status="$(_e2e_inference_status "${sandbox}" "${url}")"
@@ -103,10 +85,6 @@ e2e_inference_routing_assert_auth_proxy() {
   local mode="${2:-valid}"
   _e2e_inference_assertion "${assertion_id}"
   _e2e_inference_require_sandbox
-  if e2e_env_is_dry_run; then
-    _e2e_inference_plan "${assertion_id}" "auth-proxy ${mode} request; sensitive context redacted"
-    return 0
-  fi
   local sandbox status token
   sandbox="$(_e2e_inference_sandbox_name)"
   case "${mode}" in
diff --git a/test/e2e-scenario/validation_suites/lib/messaging_providers.sh b/test/e2e-scenario/validation_suites/lib/messaging_providers.sh
index 03c85ae6c2..4756fc54ef 100755
--- a/test/e2e-scenario/validation_suites/lib/messaging_providers.sh
+++ b/test/e2e-scenario/validation_suites/lib/messaging_providers.sh
@@ -104,10 +104,6 @@ e2e_messaging_read_config_surface() {
     return 0
   fi
   path="$(e2e_messaging_agent_config_path)"
-  if [[ -n "${E2E_DRY_RUN:-}" ]]; then
-    printf '%s=PLACEHOLDER\n' "$(e2e_messaging_config_key)"
-    return 0
-  fi
   if [[ -f "${path}" ]]; then
     cat "${path}"
     return 0
@@ -167,9 +163,6 @@ e2e_messaging_assert_literal_payload() {
   local assertion_id="${1:?assertion id required}"
   local payload="${2:?payload required}"
   local observed="${3:-}"
-  if [[ -z "${observed}" && -n "${E2E_DRY_RUN:-}" ]]; then
-    observed="${payload}"
-  fi
   if [[ -z "${observed}" ]]; then
     e2e_fail "${assertion_id} missing observed payload output"
   fi
diff --git a/test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh b/test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
index c6483c99fb..99138304de 100755
--- a/test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
+++ b/test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
@@ -32,10 +32,6 @@ _rebuild_upgrade_run() {
 
 rebuild_upgrade_assert_sandbox_reachable() {
   rebuild_upgrade_require_context || return 1
-  if [[ "${E2E_DRY_RUN:-0}" == "1" ]]; then
-    e2e_pass "suite.upgrade.survivor_agent_reachable dry-run"
-    return 0
-  fi
   local sandbox
   sandbox="$(_rebuild_upgrade_ctx E2E_SANDBOX_NAME)"
   if _rebuild_upgrade_run REBUILD_UPGRADE_SANDBOX_CMD openshell sandbox exec -n "${sandbox}" -- true; then
@@ -47,10 +43,6 @@ rebuild_upgrade_assert_sandbox_reachable() {
 
 rebuild_upgrade_assert_marker_preserved() {
   rebuild_upgrade_require_context || return 1
-  if [[ "${E2E_DRY_RUN:-0}" == "1" ]]; then
-    e2e_pass "suite.rebuild.workspace_state_preserved dry-run"
-    return 0
-  fi
   local sandbox marker_path expected actual
   sandbox="$(_rebuild_upgrade_ctx E2E_SANDBOX_NAME)"
   marker_path="${E2E_REBUILD_MARKER_PATH:-/workspace/.nemoclaw-rebuild-marker}"
@@ -65,10 +57,6 @@ rebuild_upgrade_assert_marker_preserved() {
 
 rebuild_upgrade_assert_agent_version_upgraded() {
   rebuild_upgrade_require_context || return 1
-  if [[ "${E2E_DRY_RUN:-0}" == "1" ]]; then
-    e2e_pass "suite.rebuild.agent_version_upgraded dry-run"
-    return 0
-  fi
   local sandbox old expected actual cmd
   sandbox="$(_rebuild_upgrade_ctx E2E_SANDBOX_NAME)"
   old="${E2E_OLD_AGENT_VERSION:-}"
@@ -84,10 +72,6 @@ rebuild_upgrade_assert_agent_version_upgraded() {
 
 rebuild_upgrade_assert_inference_works() {
   rebuild_upgrade_require_context || return 1
-  if [[ "${E2E_DRY_RUN:-0}" == "1" ]]; then
-    e2e_pass "suite.rebuild.inference_still_works dry-run"
-    return 0
-  fi
   local sandbox cmd output
   sandbox="$(_rebuild_upgrade_ctx E2E_SANDBOX_NAME)"
   cmd="${E2E_INFERENCE_CHECK_COMMAND:-curl -fsS http://inference.local/v1/models}"
@@ -101,10 +85,6 @@ rebuild_upgrade_assert_inference_works() {
 
 rebuild_upgrade_assert_policy_presets_preserved() {
   rebuild_upgrade_require_context || return 1
-  if [[ "${E2E_DRY_RUN:-0}" == "1" ]]; then
-    e2e_pass "suite.rebuild.policy_presets_preserved dry-run"
-    return 0
-  fi
   local presets output preset
   presets="${E2E_EXPECTED_POLICY_PRESETS:-npm pypi}"
   output="$(_rebuild_upgrade_run REBUILD_UPGRADE_NEMOCLAW_CMD nemoclaw policy status 2>/dev/null || true)"
@@ -123,10 +103,6 @@ rebuild_upgrade_assert_hermes_config_preserved() {
     e2e_pass "suite.rebuild.hermes_config_preserved skipped non-hermes"
     return 0
   fi
-  if [[ "${E2E_DRY_RUN:-0}" == "1" ]]; then
-    e2e_pass "suite.rebuild.hermes_config_preserved dry-run"
-    return 0
-  fi
   local sandbox output
   sandbox="$(_rebuild_upgrade_ctx E2E_SANDBOX_NAME)"
   output="$(_rebuild_upgrade_run REBUILD_UPGRADE_SANDBOX_CMD openshell sandbox exec -n "${sandbox}" -- bash -lc "grep -R 'platforms.discord\|DISCORD' ~/.hermes . 2>/dev/null" || true)"
@@ -139,10 +115,6 @@ rebuild_upgrade_assert_hermes_config_preserved() {
 
 rebuild_upgrade_assert_sandbox_registry_preserved() {
   rebuild_upgrade_require_context || return 1
-  if [[ "${E2E_DRY_RUN:-0}" == "1" ]]; then
-    e2e_pass "suite.upgrade.sandbox_registry_preserved dry-run"
-    return 0
-  fi
   local sandbox output
   sandbox="$(_rebuild_upgrade_ctx E2E_SANDBOX_NAME)"
   output="$(_rebuild_upgrade_run REBUILD_UPGRADE_NEMOCLAW_CMD nemoclaw list 2>/dev/null || true)"
@@ -155,10 +127,6 @@ rebuild_upgrade_assert_sandbox_registry_preserved() {
 
 rebuild_upgrade_assert_gateway_version_upgraded() {
   rebuild_upgrade_require_context || return 1
-  if [[ "${E2E_DRY_RUN:-0}" == "1" ]]; then
-    e2e_pass "suite.upgrade.gateway_version_upgraded dry-run"
-    return 0
-  fi
   local expected output
   expected="${E2E_EXPECTED_OPENSHELL_VERSION:-}"
   output="$(_rebuild_upgrade_run REBUILD_UPGRADE_GATEWAY_CMD curl -fsS "$(_rebuild_upgrade_ctx E2E_GATEWAY_URL)/version" 2>/dev/null || true)"
diff --git a/test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh b/test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
index df942487e7..3b186093d1 100755
--- a/test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
+++ b/test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
@@ -37,11 +37,6 @@ sandbox_lifecycle_run_with_timeout() {
   local seconds="$1"
   shift
   SANDBOX_LIFECYCLE_LAST_OUTPUT=""
-  if [[ "${E2E_DRY_RUN:-0}" == "1" ]]; then
-    SANDBOX_LIFECYCLE_LAST_OUTPUT="dry-run: $*"
-    printf '%s\n' "${SANDBOX_LIFECYCLE_LAST_OUTPUT}"
-    return 0
-  fi
   if command -v timeout >/dev/null 2>&1; then
     SANDBOX_LIFECYCLE_LAST_OUTPUT="$(timeout "${seconds}" "$@" 2>&1)" || {
       local rc=$?
@@ -64,7 +59,7 @@ sandbox_lifecycle_assert_nemoclaw_list_contains_sandbox() {
     sandbox_lifecycle_fail "${id}" "nemoclaw list failed"
     return 1
   }
-  [[ "${E2E_DRY_RUN:-0}" == "1" || "${SANDBOX_LIFECYCLE_LAST_OUTPUT}" == *"${E2E_SANDBOX_NAME}"* ]] || {
+  [[ "${SANDBOX_LIFECYCLE_LAST_OUTPUT}" == *"${E2E_SANDBOX_NAME}"* ]] || {
     sandbox_lifecycle_fail "${id}" "sandbox not listed: ${E2E_SANDBOX_NAME}"
     return 1
   }
@@ -77,16 +72,14 @@ sandbox_lifecycle_assert_status_fields_present() {
     sandbox_lifecycle_fail "${id}" "nemoclaw status failed"
     return 1
   }
-  if [[ "${E2E_DRY_RUN:-0}" != "1" ]]; then
-    local status_output_lower
-    status_output_lower="$(printf '%s' "${SANDBOX_LIFECYCLE_LAST_OUTPUT}" | tr '[:upper:]' '[:lower:]')"
-    for field in status gateway sandbox; do
-      [[ "${status_output_lower}" == *"${field}"* ]] || {
-        sandbox_lifecycle_fail "${id}" "missing status field: ${field}"
-        return 1
-      }
-    done
-  fi
+  local status_output_lower
+  status_output_lower="$(printf '%s' "${SANDBOX_LIFECYCLE_LAST_OUTPUT}" | tr '[:upper:]' '[:lower:]')"
+  for field in status gateway sandbox; do
+    [[ "${status_output_lower}" == *"${field}"* ]] || {
+      sandbox_lifecycle_fail "${id}" "missing status field: ${field}"
+      return 1
+    }
+  done
   sandbox_lifecycle_pass "${id}" "status fields present"
 }
 
@@ -96,7 +89,7 @@ sandbox_lifecycle_assert_logs_available() {
     sandbox_lifecycle_fail "${id}" "nemoclaw logs failed"
     return 1
   }
-  [[ "${E2E_DRY_RUN:-0}" == "1" || -n "${SANDBOX_LIFECYCLE_LAST_OUTPUT}" ]] || {
+  [[ -n "${SANDBOX_LIFECYCLE_LAST_OUTPUT}" ]] || {
     sandbox_lifecycle_fail "${id}" "logs empty"
     return 1
   }
@@ -109,7 +102,7 @@ sandbox_lifecycle_assert_openshell_exec_ok() {
     sandbox_lifecycle_fail "${id}" "openshell exec failed"
     return 1
   }
-  [[ "${E2E_DRY_RUN:-0}" == "1" || "${SANDBOX_LIFECYCLE_LAST_OUTPUT}" == *"lifecycle-ok"* ]] || {
+  [[ "${SANDBOX_LIFECYCLE_LAST_OUTPUT}" == *"lifecycle-ok"* ]] || {
     sandbox_lifecycle_fail "${id}" "unexpected exec output"
     return 1
   }
diff --git a/test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh b/test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
index 3e1872d62a..8d34a5444f 100755
--- a/test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
+++ b/test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
@@ -55,10 +55,6 @@ spc_assert_credentials_expected() {
     return 1
   fi
   spc_log_provider_metadata "$(spc_context_get E2E_PROVIDER)" "gateway"
-  if e2e_env_is_dry_run; then
-    echo "[dry-run] would list gateway credentials without raw values"
-    return 0
-  fi
   local raw_file listed_raw listed list_rc
   raw_file="$(mktemp "${TMPDIR:-/tmp}/nemoclaw-credentials-list.XXXXXX")"
   chmod 600 "${raw_file}"
@@ -105,10 +101,6 @@ spc_assert_policy_preset_present() {
   spc_assertion_id "post-onboard.security-policy.${preset}-preset-applied"
   spc_require_context E2E_SCENARIO E2E_SANDBOX_NAME
   echo "policy preset expected: ${preset}"
-  if e2e_env_is_dry_run; then
-    echo "[dry-run] would verify policy preset ${preset}"
-    return 0
-  fi
   local sandbox_name active
   sandbox_name="$(spc_context_get E2E_SANDBOX_NAME)"
   if ! active="$(nemoclaw "${sandbox_name}" policy-list 2>&1)"; then
@@ -143,10 +135,6 @@ spc_semver_ge() {
 spc_assert_openshell_credential_rewrite_supported() {
   spc_assertion_id "post-onboard.gateway.openshell-version-supports-credential-rewrite"
   spc_require_context E2E_SCENARIO
-  if e2e_env_is_dry_run; then
-    echo "[dry-run] would verify OpenShell gateway capability metadata"
-    return 0
-  fi
   local openshell_bin version_output version minimum_version binary_strings feature
   minimum_version="0.0.39"
   openshell_bin="$(command -v openshell 2>/dev/null || true)"
@@ -221,10 +209,6 @@ spc_assert_shields_permissions_match_state() {
 spc_assert_shields_config_consistent() {
   spc_assertion_id "post-onboard.security-shields.config-consistent"
   spc_require_context E2E_SCENARIO E2E_SANDBOX_NAME E2E_AGENT
-  if e2e_env_is_dry_run; then
-    echo "[dry-run] would verify shields config consistency"
-    return 0
-  fi
   local sandbox_name status observed expected
   sandbox_name="$(spc_context_get E2E_SANDBOX_NAME)"
   if ! status="$(nemoclaw "${sandbox_name}" shields status 2>&1)"; then
@@ -262,10 +246,6 @@ spc_assert_telegram_payload_not_shell_executed() {
   if [[ -n "${fixture_payload}" ]]; then
     printf 'telegram payload fixture loaded (%s bytes)\n' "${#fixture_payload}"
   fi
-  if e2e_env_is_dry_run; then
-    echo "[dry-run] would submit payload without shell evaluation"
-    return 0
-  fi
   local sandbox_name marker payload send_output marker_state
   sandbox_name="$(spc_context_get E2E_SANDBOX_NAME)"
   marker="/tmp/nemoclaw-telegram-injection-proof-$RANDOM-$$"
diff --git a/test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh b/test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
index 9fc2156ad0..8ec82f8aeb 100755
--- a/test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
+++ b/test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
@@ -5,9 +5,4 @@
 set -euo pipefail
 . "$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)/lib/messaging_providers.sh"
 e2e_messaging_load_context
-if [[ -n "${E2E_DRY_RUN:-}" ]]; then
-  provider="$(e2e_messaging_provider_name)"
-  e2e_pass "expected-state.messaging.${provider}.bridge-reachable dry-run"
-  exit 0
-fi
 e2e_messaging_assert_bridge_reachable
diff --git a/test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh b/test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
index 2f42115f5e..4f2f094c67 100755
--- a/test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
+++ b/test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
@@ -19,11 +19,6 @@ LIB_DIR="$(cd "${SCRIPT_DIR}/../../../runtime/lib" && pwd)"
 echo "platform-macos:macos-smoke"
 e2e_context_require E2E_PLATFORM_OS
 
-if e2e_env_is_dry_run; then
-  echo "[dry-run] would run macOS-specific smoke checks"
-  exit 0
-fi
-
 os="$(e2e_context_get E2E_PLATFORM_OS)"
 if [[ "${os}" != "macos" ]]; then
   echo "platform-macos: E2E_PLATFORM_OS should be 'macos', got '${os}'" >&2
diff --git a/test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh b/test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
index 1aeb39fe7c..ef96795a0c 100755
--- a/test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
+++ b/test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
@@ -17,11 +17,6 @@ LIB_DIR="$(cd "${SCRIPT_DIR}/../../../runtime/lib" && pwd)"
 echo "platform-wsl:wsl-smoke"
 e2e_context_require E2E_PLATFORM_OS E2E_SANDBOX_NAME
 
-if e2e_env_is_dry_run; then
-  echo "[dry-run] would run WSL-specific smoke checks"
-  exit 0
-fi
-
 os="$(e2e_context_get E2E_PLATFORM_OS)"
 if [[ "${os}" != "wsl" ]]; then
   echo "platform-wsl: E2E_PLATFORM_OS should be 'wsl', got '${os}'" >&2
diff --git a/test/e2e-scenario/validation_suites/sandbox-exec.sh b/test/e2e-scenario/validation_suites/sandbox-exec.sh
index 0682c4cf2f..c9e3ec06c6 100755
--- a/test/e2e-scenario/validation_suites/sandbox-exec.sh
+++ b/test/e2e-scenario/validation_suites/sandbox-exec.sh
@@ -12,7 +12,6 @@
 # Functions:
 #   e2e_sandbox_exec       <sandbox> -- <cmd> [args...]
 #       Run <cmd> inside <sandbox> via `openshell sandbox exec`. No stdin passed.
-#       Exit code propagates from <cmd>. Honors E2E_DRY_RUN.
 #
 #   e2e_sandbox_exec_stdin <sandbox> -- <cmd> [args...]
 #       Like e2e_sandbox_exec but pipes the caller's stdin into the
@@ -52,10 +51,6 @@ _e2e_sbex_parse() {
 e2e_sandbox_exec() {
   _e2e_sbex_parse "$@" || return $?
   e2e_env_trace "sandbox:exec" "${_E2E_SBEX_SB_NAME}" "${_E2E_SBEX_CMD[*]}"
-  if e2e_env_is_dry_run; then
-    echo "[dry-run] sandbox_exec ${_E2E_SBEX_SB_NAME} -- ${_E2E_SBEX_CMD[*]} (skipped)"
-    return 0
-  fi
   if ! command -v openshell >/dev/null 2>&1; then
     echo "e2e_sandbox_exec: openshell CLI not on PATH" >&2
     return 127
@@ -70,12 +65,6 @@ e2e_sandbox_exec() {
 e2e_sandbox_exec_stdin() {
   _e2e_sbex_parse "$@" || return $?
   e2e_env_trace "sandbox:exec_stdin" "${_E2E_SBEX_SB_NAME}" "${_E2E_SBEX_CMD[*]}"
-  if e2e_env_is_dry_run; then
-    # Consume stdin so the caller's pipeline doesn't SIGPIPE.
-    cat >/dev/null 2>&1 || true
-    echo "[dry-run] sandbox_exec_stdin ${_E2E_SBEX_SB_NAME} -- ${_E2E_SBEX_CMD[*]} (skipped)"
-    return 0
-  fi
   if ! command -v openshell >/dev/null 2>&1; then
     echo "e2e_sandbox_exec_stdin: openshell CLI not on PATH" >&2
     return 127
diff --git a/test/e2e-scenario/validation_suites/smoke/00-cli-available.sh b/test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
index e56925b1f9..ab733f039d 100755
--- a/test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
+++ b/test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
@@ -18,11 +18,6 @@ echo "smoke:cli-available"
 
 e2e_context_require E2E_SCENARIO
 
-if e2e_env_is_dry_run; then
-  echo "[dry-run] would check that nemoclaw CLI is on PATH"
-  exit 0
-fi
-
 if ! command -v nemoclaw >/dev/null 2>&1; then
   echo "smoke:cli-available: nemoclaw CLI not on PATH" >&2
   exit 1
diff --git a/test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh b/test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
index b92dc33e8a..d27e8b7a24 100755
--- a/test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
+++ b/test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
@@ -4,7 +4,6 @@
 #
 # smoke step: sandbox-shell
 # Verifies that OpenShell can execute a trivial command inside the sandbox.
-# Honors E2E_DRY_RUN.
 
 set -euo pipefail
 
@@ -18,11 +17,6 @@ LIB_DIR="$(cd "${SCRIPT_DIR}/../../runtime/lib" && pwd)"
 echo "smoke:sandbox-shell"
 e2e_context_require E2E_SANDBOX_NAME
 
-if e2e_env_is_dry_run; then
-  echo "[dry-run] would run: openshell sandbox exec --name <sandbox> -- echo ok"
-  exit 0
-fi
-
 name="$(e2e_context_get E2E_SANDBOX_NAME)"
 output="$(openshell sandbox exec --name "${name}" -- echo ok 2>&1)"
 echo "${output}"
diff --git a/tools/e2e-scenarios/workflow-boundary.mts b/tools/e2e-scenarios/workflow-boundary.mts
index 3eba39a9c3..1e1069271d 100644
--- a/tools/e2e-scenarios/workflow-boundary.mts
+++ b/tools/e2e-scenarios/workflow-boundary.mts
@@ -49,6 +49,13 @@ function requireRunContains(errors: string[], step: WorkflowStep | undefined, ex
   }
 }
 
+function requireRunDoesNotContain(errors: string[], step: WorkflowStep | undefined, forbidden: string): void {
+  if (!step) return;
+  if (stringValue(step.run).includes(forbidden)) {
+    errors.push(`step '${step.name ?? "<unnamed>"}' run script must not include ${forbidden}`);
+  }
+}
+
 export function validateE2eScenariosWorkflowBoundary(
   workflowPath = DEFAULT_WORKFLOW_PATH,
 ): string[] {
@@ -92,12 +99,18 @@ export function validateE2eScenariosWorkflowBoundary(
   const normalRun = requireStep(errors, steps, "Run typed scenarios");
   requireRunContains(errors, normalRun, "npx tsx test/e2e-scenario/scenarios/run.ts");
   requireRunContains(errors, normalRun, "--scenarios");
-  requireRunContains(errors, normalRun, "--dry-run");
+  // The TS runner has one execution mode: live. Workflows must not pass
+  // --dry-run, --plan-only, or --validate-only — they hide real test runs.
+  requireRunDoesNotContain(errors, normalRun, "--dry-run");
+  requireRunDoesNotContain(errors, normalRun, "--plan-only");
+  requireRunDoesNotContain(errors, normalRun, "--validate-only");
 
   const wslRun = requireStep(errors, steps, "Run typed scenarios in WSL");
   requireRunContains(errors, wslRun, "npx tsx test/e2e-scenario/scenarios/run.ts");
   requireRunContains(errors, wslRun, "--scenarios");
-  requireRunContains(errors, wslRun, "--dry-run");
+  requireRunDoesNotContain(errors, wslRun, "--dry-run");
+  requireRunDoesNotContain(errors, wslRun, "--plan-only");
+  requireRunDoesNotContain(errors, wslRun, "--validate-only");
 
   const upload = requireStep(errors, steps, "Upload scenario artifacts");
   const uploadWith = asRecord(upload?.with);

From a5aabd79bfe78fe5a719ea3bf3e227fb800711c8 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Wed, 27 May 2026 21:21:23 -0400
Subject: [PATCH 02/23] fix(e2e): flush log writeStream before evidence read;
 replace source-shape client check

- PhaseOrchestrator.runShellStep: wait for the log WriteStream to finish
  before resolving so callers (and tests) reading evidence synchronously
  see the actual stdout/stderr instead of an empty file. Race exposed by
  e2e-phase-orchestrators 'shell_step_passes_when_script_exits_zero'.
- e2e-phase-orchestrators: replace client-source toMatch regex (1
  source-shape test, budget=0) with a runtime-shape behavior assertion
  on the HostCliClient observation. Still enforces 'clients do not
  encode pass/fail or retry/timeout semantics' per hybrid-scenario E2E
  architecture spec, without violating source-shape budget.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../e2e-phase-orchestrators.test.ts           | 26 ++++++--
 .../scenarios/orchestrators/phase.ts          | 59 ++++++++++++-------
 2 files changed, 59 insertions(+), 26 deletions(-)

diff --git a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
index e6a899ae5a..4840b8adcd 100644
--- a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
@@ -222,13 +222,29 @@ describe("phase orchestrators - real shell execution", () => {
 
 describe("clients are pass/fail/policy free", () => {
   it("test_should_keep_clients_free_of_pass_fail_and_retry_semantics", () => {
-    const source = fs.readFileSync(
-      path.join(REPO_ROOT, "test/e2e-scenario/scenarios/clients/host-cli.ts"),
-      "utf8",
-    );
     const observation = new HostCliClient().observeVersion();
 
+    // The client returns a raw act/observe shape only: the command it would
+    // run. It must NOT decide pass/fail, attach retry policy, surface a
+    // classifier, or expose AssertionResult/PhaseResult-shaped fields.
     expect(observation).toEqual(expect.objectContaining({ command: ["nemoclaw", "--version"] }));
-    expect(source).not.toMatch(/AssertionResult|PhaseResult|retry|timeout|passed|failed/);
+    // Raw act/observe fields are allowed (exitCode/stdout/stderr/timing).
+    // Pass/fail and reliability-policy fields are not.
+    const forbiddenKeys = [
+      "status",
+      "attempts",
+      "classifier",
+      "evidence",
+      "retry",
+      "timeout",
+      "timeoutSeconds",
+      "phase",
+      "assertions",
+      "passed",
+      "failed",
+    ];
+    for (const key of forbiddenKeys) {
+      expect(observation).not.toHaveProperty(key);
+    }
   });
 });
diff --git a/test/e2e-scenario/scenarios/orchestrators/phase.ts b/test/e2e-scenario/scenarios/orchestrators/phase.ts
index 220a8426c7..b5abb107be 100644
--- a/test/e2e-scenario/scenarios/orchestrators/phase.ts
+++ b/test/e2e-scenario/scenarios/orchestrators/phase.ts
@@ -211,37 +211,54 @@ export class PhaseOrchestrator {
         }, 5_000).unref();
       }, timeoutSeconds * 1_000);
 
+      // Wait for the log writeStream to fully flush before resolving so
+      // callers can synchronously read the evidence file. Without this, the
+      // 'close' event on the child fires before the WriteStream finishes
+      // draining, and tests/orchestrators see an empty log file.
+      const finishLog = (): Promise<void> =>
+        new Promise((res) => {
+          if ((logStream as unknown as { closed?: boolean }).closed) {
+            res();
+            return;
+          }
+          logStream.once("finish", () => res());
+          logStream.once("error", () => res());
+          logStream.end();
+        });
+
       child.on("error", (err) => {
         clearTimeout(timeout);
-        logStream.end();
-        resolve({
-          status: "failed",
-          message: `shell step ${step.id} spawn error: ${err.message}`,
-          evidence: logPath,
-        });
+        void finishLog().then(() =>
+          resolve({
+            status: "failed",
+            message: `shell step ${step.id} spawn error: ${err.message}`,
+            evidence: logPath,
+          }),
+        );
       });
 
       child.on("close", (code, signal) => {
         clearTimeout(timeout);
-        logStream.end();
-        if (timedOut) {
+        void finishLog().then(() => {
+          if (timedOut) {
+            resolve({
+              status: "failed",
+              classifier: "runner-infra",
+              message: `shell step ${step.id} exceeded ${timeoutSeconds}s (signal=${signal ?? "SIGTERM"})`,
+              evidence: logPath,
+            });
+            return;
+          }
+          if (code === 0) {
+            resolve({ status: "passed", evidence: logPath });
+            return;
+          }
           resolve({
             status: "failed",
-            classifier: "runner-infra",
-            message: `shell step ${step.id} exceeded ${timeoutSeconds}s (signal=${signal ?? "SIGTERM"})`,
+            classifier: classifierForRef(ref),
+            message: `shell step ${step.id} exit ${code ?? "null"}: ${stderrTail.split("\n").slice(-3).join(" | ").trim()}`,
             evidence: logPath,
           });
-          return;
-        }
-        if (code === 0) {
-          resolve({ status: "passed", evidence: logPath });
-          return;
-        }
-        resolve({
-          status: "failed",
-          classifier: classifierForRef(ref),
-          message: `shell step ${step.id} exit ${code ?? "null"}: ${stderrTail.split("\n").slice(-3).join(" | ").trim()}`,
-          evidence: logPath,
         });
       });
     });

From 903f90bfc4f189ae8797b9470b57a8f36abbedec Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Wed, 27 May 2026 21:29:37 -0400
Subject: [PATCH 03/23] fix(e2e): drop extra trailing newline
 (end-of-file-fixer)

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 test/e2e-scenario/framework-tests/e2e-scenario-resolver.test.ts | 1 -
 1 file changed, 1 deletion(-)

diff --git a/test/e2e-scenario/framework-tests/e2e-scenario-resolver.test.ts b/test/e2e-scenario/framework-tests/e2e-scenario-resolver.test.ts
index a6944c9f64..0111aa0e42 100644
--- a/test/e2e-scenario/framework-tests/e2e-scenario-resolver.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-scenario-resolver.test.ts
@@ -202,4 +202,3 @@ suites:
 // run-scenario.sh-based plan-only tests removed: the bash runner is
 // now a fail-fast stub. Equivalent coverage of the typed runner lives in
 // e2e-plan-compiler.test.ts and e2e-scenario-registry.test.ts.
-

From 628870ceb8a55971d273b131c6b8adca5fb3d9da Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Wed, 27 May 2026 22:02:00 -0400
Subject: [PATCH 04/23] feat(e2e): wire install/onboard into typed phase
 orchestrators

Closes the spec's reopened Phase 6 gap. The new typed runner now
executes the install and onboarding work that the deleted bash runner
used to perform, but inside EnvironmentOrchestrator and
OnboardingOrchestrator instead of in workflow YAML or a resurrected
bash runner. All canonical scenarios now reach a real, live SUT before
their assertions run.

Architecture (per hybrid-scenario-e2e-architecture spec):

* types.ts: introduce typed PhaseAction (kind=shell-fn|shell, scriptRef,
  fn, arg, timeoutSeconds, evidencePath) and PhaseActionResult. Replace
  the prior actions: string[] free-form labels with PhaseAction[]. Add
  actions[] to PhaseResult so failure-layer attribution stays clear:
  setup failure is recorded distinctly from assertion failure.

* compiler.ts: phaseActions() now emits typed actions for environment
  (context.emit + install.<id>) and onboarding (profile.<id>). Stable
  action ids: environment.context.emit,
  environment.install.<install-id>, onboarding.profile.<profile-id>.
  All install/onboard actions point at the existing dispatcher scripts
  (install/dispatch.sh, onboard/dispatch.sh) - shell remains the
  implementation per spec, invocation is centralized.

* orchestrators/phase.ts: PhaseOrchestrator.run() executes actions
  before assertions. Action failure short-circuits the phase so
  assertions never run against an environment that was never set up.
  Action runner reuses the same spawn/timeout/process-group/log-flush
  machinery as runShellStep. Per-action timeout, no retry (install and
  onboarding must fail loudly).

* nemoclaw_scenarios/dispatch-action.sh: new bash launcher (the only
  new shell file). The install/onboard dispatchers are intentionally
  library-style (function definitions only); this launcher gives them
  a deterministic executable entrypoint that sources runtime/lib/env.sh
  + runtime/lib/context.sh, applies non-interactive env, sources the
  requested dispatcher, and invokes the named function with one arg.
  Replaces the orchestration that the deleted run-scenario.sh used to
  do, but called from the typed orchestrator instead.

* plan-only output: now shows 'Action: <id> (timeout=...) -> <fn> <arg>'
  per phase, before assertion groups. Maintainers can preview the full
  setup+onboard+assert sequence before dispatch.

* framework-tests/e2e-phase-orchestrators.test.ts: add five behavior
  tests covering action-runs-before-assertions, action-failure short-
  circuits-assertions, action timeout via orchestrator policy,
  evidence-log flushed-before-resolve, and compiler emits typed
  install/onboard actions for all 7 canonical scenarios.

What stays out:

* No workflow YAML edits. .github/workflows/e2e-scenarios.yaml still
  invokes only 'npx tsx test/e2e-scenario/scenarios/run.ts --scenarios
  ...'. Workflow YAML stays innocent of install/onboard plumbing.
* No client edits. HostCliClient et al. remain pass/fail/policy free.
* No resolver/YAML-first revival. setup_scenarios/test_plans/suite_filter
  remain unsupported.

Validation gate (Phase 6 reopen note) is the next step: after this
push goes green on PR CI, dispatch e2e-scenarios-all.yaml against
feat/e2e-real-execution and confirm canonical scenarios produce real
phase results with action evidence under .e2e/actions/<id>.log,
instead of <1s 'failed=34 skipped=5'.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../e2e-phase-orchestrators.test.ts           | 169 +++++++++++++++++-
 .../nemoclaw_scenarios/dispatch-action.sh     |  72 ++++++++
 test/e2e-scenario/scenarios/compiler.ts       |  89 ++++++++-
 .../scenarios/orchestrators/phase.ts          | 163 ++++++++++++++++-
 test/e2e-scenario/scenarios/types.ts          |  44 ++++-
 5 files changed, 520 insertions(+), 17 deletions(-)
 create mode 100755 test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh

diff --git a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
index 4840b8adcd..729c8cbd04 100644
--- a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
@@ -10,7 +10,14 @@ import { HostCliClient } from "../scenarios/clients/host-cli.ts";
 import { compileRunPlans } from "../scenarios/compiler.ts";
 import { PhaseOrchestrator } from "../scenarios/orchestrators/phase.ts";
 import { ScenarioRunner } from "../scenarios/orchestrators/runner.ts";
-import type { AssertionStep, PhaseName, PhaseResult, RunContext, RunPlanPhase } from "../scenarios/types.ts";
+import type {
+  AssertionStep,
+  PhaseAction,
+  PhaseName,
+  PhaseResult,
+  RunContext,
+  RunPlanPhase,
+} from "../scenarios/types.ts";
 
 const REPO_ROOT = path.resolve(import.meta.dirname, "../../..");
 
@@ -59,6 +66,37 @@ function writeTempScript(dir: string, name: string, body: string): string {
   return p;
 }
 
+function shellAction(
+  id: string,
+  phase: PhaseName,
+  scriptRef: string,
+  opts: { timeoutSeconds?: number; arg?: string } = {},
+): PhaseAction {
+  return {
+    id,
+    phase,
+    kind: "shell",
+    scriptRef,
+    arg: opts.arg,
+    timeoutSeconds: opts.timeoutSeconds,
+  };
+}
+
+function makePhaseWithActions(
+  phase: PhaseName,
+  actions: PhaseAction[],
+  steps: AssertionStep[],
+): RunPlanPhase {
+  return {
+    name: phase,
+    actions,
+    assertionGroups:
+      steps.length > 0
+        ? [{ id: `group.${steps[0].id}`, phase, migrationStatus: "complete", steps }]
+        : [],
+  };
+}
+
 describe("phase orchestrators - top-level delegation", () => {
   it("test_should_execute_phase_assertions_from_phase_orchestrators_not_top_level_runner", async () => {
     const ctx = freshCtx();
@@ -68,7 +106,7 @@ describe("phase orchestrators - top-level delegation", () => {
       const fakeOrchestrator = (phase: PhaseName) => ({
         run: async (_ctx: RunContext, runPhase: RunPlanPhase, _prior?: PhaseResult[]): Promise<PhaseResult> => {
           calls.push(runPhase.name);
-          return { phase, status: "passed", assertions: [] };
+          return { phase, status: "passed", actions: [], assertions: [] };
         },
       });
       const runner = new ScenarioRunner({
@@ -220,6 +258,133 @@ describe("phase orchestrators - real shell execution", () => {
   });
 });
 
+describe("phase orchestrators - actions execute before assertions", () => {
+  it("phase_action_runs_before_assertions_and_records_evidence", async () => {
+    const ctx = freshCtx();
+    try {
+      const actionScript = writeTempScript(ctx.contextDir, "setup.sh", "echo phase-action-evidence");
+      const action = shellAction("environment.setup-ok", "environment", path.relative(REPO_ROOT, actionScript));
+      const stepScript = writeTempScript(ctx.contextDir, "after.sh", "echo after-action");
+      const step = shellStep("environment.assert-ok", "environment", path.relative(REPO_ROOT, stepScript));
+      const orchestrator = new PhaseOrchestrator("environment");
+
+      const result = await orchestrator.run(ctx, makePhaseWithActions("environment", [action], [step]));
+
+      expect(result.status).toBe("passed");
+      expect(result.actions).toHaveLength(1);
+      expect(result.actions[0]).toEqual(
+        expect.objectContaining({ id: "environment.setup-ok", status: "passed" }),
+      );
+      expect(result.actions[0].evidence).toBeTruthy();
+      const actionLog = fs.readFileSync(result.actions[0].evidence!, "utf8");
+      expect(actionLog).toContain("phase-action-evidence");
+      expect(result.assertions).toHaveLength(1);
+      expect(result.assertions[0].status).toBe("passed");
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("phase_action_failure_short_circuits_assertions", async () => {
+    const ctx = freshCtx();
+    try {
+      const failScript = writeTempScript(ctx.contextDir, "fail.sh", 'echo "setup boom" >&2; exit 5');
+      const action = shellAction("environment.setup-fail", "environment", path.relative(REPO_ROOT, failScript));
+      const stepScript = writeTempScript(ctx.contextDir, "after.sh", "echo should-not-run");
+      const step = shellStep("environment.never-runs", "environment", path.relative(REPO_ROOT, stepScript));
+      const orchestrator = new PhaseOrchestrator("environment");
+
+      const result = await orchestrator.run(ctx, makePhaseWithActions("environment", [action], [step]));
+
+      expect(result.status).toBe("failed");
+      expect(result.actions).toHaveLength(1);
+      expect(result.actions[0].status).toBe("failed");
+      expect(result.actions[0].message).toMatch(/exit 5/);
+      // Assertions must NOT have run, so they must NOT show a misleading
+      // pass for an environment that was never set up.
+      expect(result.assertions).toEqual([]);
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("phase_action_times_out_via_orchestrator_policy", async () => {
+    const ctx = freshCtx();
+    try {
+      const slow = writeTempScript(ctx.contextDir, "slow.sh", "sleep 30");
+      const action = shellAction("environment.setup-slow", "environment", path.relative(REPO_ROOT, slow), {
+        timeoutSeconds: 1,
+      });
+      const orchestrator = new PhaseOrchestrator("environment");
+
+      const started = Date.now();
+      const result = await orchestrator.run(ctx, makePhaseWithActions("environment", [action], []));
+
+      expect(result.status).toBe("failed");
+      expect(result.actions[0].status).toBe("failed");
+      expect(result.actions[0].message).toMatch(/exceeded 1s/);
+      // The orchestrator must enforce the timeout, not depend on the
+      // script self-killing. Allow some headroom but fail if we waited
+      // anywhere near the script's 30s sleep.
+      expect(Date.now() - started).toBeLessThan(15_000);
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("phase_action_evidence_log_is_flushed_before_resolve", async () => {
+    const ctx = freshCtx();
+    try {
+      const actionScript = writeTempScript(ctx.contextDir, "flush.sh", "echo flushed-phase-action-output");
+      const action = shellAction("environment.flush", "environment", path.relative(REPO_ROOT, actionScript));
+      const orchestrator = new PhaseOrchestrator("environment");
+
+      const result = await orchestrator.run(ctx, makePhaseWithActions("environment", [action], []));
+
+      // Synchronous read must already see the output - the orchestrator
+      // must wait for the WriteStream's 'finish' before resolving.
+      const log = fs.readFileSync(result.actions[0].evidence!, "utf8");
+      expect(log).toContain("flushed-phase-action-output");
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+});
+
+describe("plan compiler emits phase actions for canonical scenarios", () => {
+  it("compiler_emits_install_and_onboard_actions_for_canonical_scenarios", async () => {
+    const { compileRunPlans } = await import("../scenarios/compiler.ts");
+    const ids = [
+      "ubuntu-repo-cloud-openclaw",
+      "ubuntu-repo-cloud-hermes",
+      "gpu-repo-local-ollama-openclaw",
+      "macos-repo-cloud-openclaw",
+      "wsl-repo-cloud-openclaw",
+      "brev-launchable-cloud-openclaw",
+      "ubuntu-no-docker-preflight-negative",
+    ];
+    const plans = compileRunPlans(ids);
+    expect(plans).toHaveLength(ids.length);
+    for (const plan of plans) {
+      const env = plan.phases.find((p) => p.name === "environment")!;
+      const onb = plan.phases.find((p) => p.name === "onboarding")!;
+      expect(env.actions.map((a) => a.id)).toContain("environment.context.emit");
+      expect(env.actions.some((a) => a.id.startsWith("environment.install."))).toBe(true);
+      expect(onb.actions.some((a) => a.id.startsWith("onboarding.profile."))).toBe(true);
+      // Every install/onboard action must be a typed shell-fn referencing
+      // the canonical dispatcher script - no free-form strings.
+      for (const action of [...env.actions, ...onb.actions]) {
+        if (action.id.startsWith("environment.install.") || action.id.startsWith("onboarding.profile.")) {
+          expect(action.kind).toBe("shell-fn");
+          expect(action.scriptRef).toMatch(/dispatch\.sh$/);
+          expect(action.fn).toMatch(/^e2e_(install|onboard)$/);
+          expect(action.arg).toBeTruthy();
+        }
+      }
+    }
+  });
+});
+
 describe("clients are pass/fail/policy free", () => {
   it("test_should_keep_clients_free_of_pass_fail_and_retry_semantics", () => {
     const observation = new HostCliClient().observeVersion();
diff --git a/test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh b/test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh
new file mode 100755
index 0000000000..f83f6fca1b
--- /dev/null
+++ b/test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh
@@ -0,0 +1,72 @@
+#!/usr/bin/env bash
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Phase-action launcher for the hybrid scenario E2E framework.
+#
+# The phase orchestrators (EnvironmentOrchestrator, OnboardingOrchestrator)
+# call this launcher to invoke a function defined in a sourced shell
+# dispatcher (install/dispatch.sh or onboard/dispatch.sh). Those
+# dispatchers are intentionally library-style (function definitions
+# only); this script gives them a deterministic executable entrypoint
+# the typed runner can spawn.
+#
+# Usage:
+#   dispatch-action.sh <fn> <arg> <dispatcher-script>
+#
+# Examples:
+#   dispatch-action.sh e2e_install repo-current \
+#     test/e2e-scenario/nemoclaw_scenarios/install/dispatch.sh
+#
+#   dispatch-action.sh e2e_onboard cloud-openclaw \
+#     test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
+#
+# Environment (set by the orchestrator):
+#   E2E_CONTEXT_DIR  artifact directory
+#   E2E_PHASE        environment | onboarding
+#   E2E_ACTION_ID    stable action id, used for trace/log correlation
+
+set -euo pipefail
+
+if [[ $# -lt 3 ]]; then
+  echo "dispatch-action.sh: usage: <fn> <arg> <dispatcher-script>" >&2
+  exit 2
+fi
+
+ACTION_FN="$1"
+ACTION_ARG="$2"
+DISPATCHER="$3"
+
+if [[ ! -f "${DISPATCHER}" ]]; then
+  echo "dispatch-action.sh: dispatcher script not found: ${DISPATCHER}" >&2
+  exit 2
+fi
+
+# Source the runtime/lib helpers the dispatchers (and their workers) rely on.
+RUNTIME_LIB="$(cd "$(dirname "${BASH_SOURCE[0]}")/../runtime/lib" && pwd)"
+# shellcheck source=runtime/lib/env.sh
+. "${RUNTIME_LIB}/env.sh"
+# shellcheck source=runtime/lib/context.sh
+. "${RUNTIME_LIB}/context.sh"
+
+# Apply the standard non-interactive env once, on the very first action of
+# the run. Subsequent actions in the same run see the env via process
+# inheritance. e2e_env_apply_noninteractive is idempotent.
+e2e_env_apply_noninteractive
+e2e_env_trace "phase:${E2E_PHASE:-unknown}/action:${E2E_ACTION_ID:-unknown}"
+
+# Make sure the context directory exists; e2e_context_init is also
+# idempotent and safe to call from any phase.
+e2e_context_init || true
+
+# Source the dispatcher last so its function definitions are in scope
+# when we invoke the requested function.
+# shellcheck source=/dev/null
+. "${DISPATCHER}"
+
+if ! declare -F "${ACTION_FN}" >/dev/null 2>&1; then
+  echo "dispatch-action.sh: function not found in dispatcher: ${ACTION_FN}" >&2
+  exit 2
+fi
+
+"${ACTION_FN}" "${ACTION_ARG}"
diff --git a/test/e2e-scenario/scenarios/compiler.ts b/test/e2e-scenario/scenarios/compiler.ts
index 5046c77dd2..cbb095019f 100644
--- a/test/e2e-scenario/scenarios/compiler.ts
+++ b/test/e2e-scenario/scenarios/compiler.ts
@@ -6,7 +6,15 @@ import path from "node:path";
 import { fileURLToPath } from "node:url";
 import { loadManifest } from "./manifests.ts";
 import { requireScenarios } from "./registry.ts";
-import type { AssertionGroup, NemoClawInstanceManifest, PhaseName, RunPlan, ScenarioDefinition, SutBoundary } from "./types.ts";
+import type {
+  AssertionGroup,
+  NemoClawInstanceManifest,
+  PhaseAction,
+  PhaseName,
+  RunPlan,
+  ScenarioDefinition,
+  SutBoundary,
+} from "./types.ts";
 
 const PHASES: PhaseName[] = ["environment", "onboarding", "runtime"];
 const REPO_ROOT = path.resolve(path.dirname(fileURLToPath(import.meta.url)), "../../..");
@@ -67,17 +75,72 @@ function validateManifestCompatibility(scenario: ScenarioDefinition, manifest?:
   }
 }
 
-function phaseActions(phase: PhaseName, scenario: ScenarioDefinition): string[] {
+// Centralized paths to the existing shell helpers. Spec rule: shell
+// scripts can remain as implementations, but invocation goes through
+// typed assertion/action definitions, not bare workflow YAML or a
+// resurrected bash runner.
+const INSTALL_DISPATCH = "test/e2e-scenario/nemoclaw_scenarios/install/dispatch.sh";
+const ONBOARD_DISPATCH = "test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh";
+const CONTEXT_EMIT = "test/e2e-scenario/nemoclaw_scenarios/helpers/emit-context-from-plan.sh";
+
+// Default action timeouts. Install and onboarding can take a while on
+// cold runners (Docker pulls, image builds, sandbox bootstrap).
+const INSTALL_TIMEOUT_SECONDS = 900;
+const ONBOARD_TIMEOUT_SECONDS = 900;
+const CONTEXT_EMIT_TIMEOUT_SECONDS = 30;
+
+function phaseActions(phase: PhaseName, scenario: ScenarioDefinition): PhaseAction[] {
   if (phase === "environment") {
+    const installId = scenario.environment?.install;
+    if (!installId) {
+      // No install dimension defined - the scenario is malformed; surface
+      // it as a hard error rather than emitting a bogus action.
+      return [];
+    }
     return [
-      `install:${scenario.environment?.install ?? "unknown"}`,
-      `runtime:${scenario.environment?.runtime ?? "unknown"}`,
+      {
+        id: "environment.context.emit",
+        phase: "environment",
+        description: "Emit .e2e/context.env from the resolved run plan.",
+        kind: "shell",
+        scriptRef: CONTEXT_EMIT,
+        timeoutSeconds: CONTEXT_EMIT_TIMEOUT_SECONDS,
+        evidencePath: `.e2e/actions/environment.context.emit.log`,
+      },
+      {
+        id: `environment.install.${installId}`,
+        phase: "environment",
+        description: `Run e2e_install ${installId} to set up the host control plane.`,
+        kind: "shell-fn",
+        scriptRef: INSTALL_DISPATCH,
+        fn: "e2e_install",
+        arg: installId,
+        timeoutSeconds: INSTALL_TIMEOUT_SECONDS,
+        evidencePath: `.e2e/actions/environment.install.${installId}.log`,
+      },
     ];
   }
   if (phase === "onboarding") {
-    return [`onboard:${scenario.environment?.onboarding ?? "unknown"}`];
+    const onboardingId = scenario.environment?.onboarding;
+    if (!onboardingId) {
+      return [];
+    }
+    return [
+      {
+        id: `onboarding.profile.${onboardingId}`,
+        phase: "onboarding",
+        description: `Run e2e_onboard ${onboardingId} to bring the gateway and sandbox online.`,
+        kind: "shell-fn",
+        scriptRef: ONBOARD_DISPATCH,
+        fn: "e2e_onboard",
+        arg: onboardingId,
+        timeoutSeconds: ONBOARD_TIMEOUT_SECONDS,
+        evidencePath: `.e2e/actions/onboarding.profile.${onboardingId}.log`,
+      },
+    ];
   }
-  return (scenario.suiteIds ?? []).map((suiteId) => `suite:${suiteId}`);
+  // Runtime phase has no actions; suites are assertion groups.
+  return [];
 }
 
 const SUT_BOUNDARIES: SutBoundary[] = [
@@ -112,7 +175,7 @@ export function compileRunPlans(inputs: Array<string | ScenarioDefinition>): Run
     const plan: RunPlan = {
       scenarioId: scenario.id,
       status: "compiled",
-      note: "compiled plan-only preview; live execution lands in later phases",
+      note: "compiled plan; phase orchestrators execute actions then assertions",
       manifestPath: scenario.manifestPath,
       manifest,
       environment: scenario.environment,
@@ -182,6 +245,18 @@ export function renderPlanText(plans: RunPlan[]): string {
     }
     for (const phase of plan.phases) {
       lines.push(`Phase: ${phase.name}`);
+      for (const action of phase.actions) {
+        const policy: string[] = [];
+        if (action.timeoutSeconds) {
+          policy.push(`timeout=${action.timeoutSeconds}s`);
+        }
+        const target = action.kind === "shell-fn"
+          ? `${action.fn ?? ""}${action.arg ? ` ${action.arg}` : ""}`.trim()
+          : action.scriptRef;
+        const policySuffix = policy.length > 0 ? ` (${policy.join(", ")})` : "";
+        const targetSuffix = target ? ` -> ${target}` : "";
+        lines.push(`  Action: ${action.id}${policySuffix}${targetSuffix}`);
+      }
       for (const group of phase.assertionGroups) {
         lines.push(`  Group: ${group.id}`);
         for (const step of group.steps) {
diff --git a/test/e2e-scenario/scenarios/orchestrators/phase.ts b/test/e2e-scenario/scenarios/orchestrators/phase.ts
index b5abb107be..72e2b95ec4 100644
--- a/test/e2e-scenario/scenarios/orchestrators/phase.ts
+++ b/test/e2e-scenario/scenarios/orchestrators/phase.ts
@@ -8,6 +8,8 @@ import { fileURLToPath } from "node:url";
 import type {
   AssertionResult,
   AssertionStep,
+  PhaseAction,
+  PhaseActionResult,
   PhaseName,
   PhaseResult,
   RunContext,
@@ -45,20 +47,167 @@ export class PhaseOrchestrator {
   constructor(private readonly phaseName: PhaseName) {}
 
   async run(ctx: RunContext, phase: RunPlanPhase): Promise<PhaseResult> {
+    const actions: PhaseActionResult[] = [];
+    let actionFailed = false;
+    for (const action of phase.actions) {
+      const actionResult = await this.runAction(ctx, action);
+      actions.push(actionResult);
+      if (actionResult.status === "failed") {
+        actionFailed = true;
+        // Spec failure-layer rule: setup failure must not let assertions
+        // run and accidentally pass. Stop the phase here.
+        break;
+      }
+    }
     const assertions: AssertionResult[] = [];
-    for (const group of phase.assertionGroups) {
-      for (const step of group.steps) {
-        assertions.push(await this.runStep(ctx, step));
+    if (!actionFailed) {
+      for (const group of phase.assertionGroups) {
+        for (const step of group.steps) {
+          assertions.push(await this.runStep(ctx, step));
+        }
       }
     }
-    const failed = assertions.some((assertion) => assertion.status === "failed");
-    const allSkipped = assertions.length > 0 && assertions.every((assertion) => assertion.status === "skipped");
-    const status: PhaseResult["status"] = failed ? "failed" : allSkipped ? "skipped" : "passed";
-    const result: PhaseResult = { phase: this.phaseName, status, assertions };
+    const assertionsFailed = assertions.some((assertion) => assertion.status === "failed");
+    const allSkipped =
+      !actionFailed &&
+      assertions.length > 0 &&
+      assertions.every((assertion) => assertion.status === "skipped");
+    let status: PhaseResult["status"];
+    if (actionFailed || assertionsFailed) {
+      status = "failed";
+    } else if (allSkipped || (actions.length === 0 && assertions.length === 0)) {
+      status = "skipped";
+    } else {
+      status = "passed";
+    }
+    const result: PhaseResult = { phase: this.phaseName, status, actions, assertions };
     this.writePhaseResult(ctx, result);
     return result;
   }
 
+  private async runAction(ctx: RunContext, action: PhaseAction): Promise<PhaseActionResult> {
+    const startedAt = Date.now();
+    const scriptPath = path.isAbsolute(action.scriptRef)
+      ? action.scriptRef
+      : path.resolve(REPO_ROOT, action.scriptRef);
+    if (!fs.existsSync(scriptPath)) {
+      return {
+        id: action.id,
+        status: "failed",
+        durationMs: Date.now() - startedAt,
+        message: `phase action ${action.id} script not found: ${scriptPath}`,
+      };
+    }
+    const timeoutSeconds = action.timeoutSeconds ?? DEFAULT_STEP_TIMEOUT_SECONDS;
+    const logDir = path.join(ctx.contextDir, ".e2e", "actions");
+    fs.mkdirSync(logDir, { recursive: true });
+    const logPath = path.join(logDir, `${action.id}.log`);
+
+    // Compose the bash invocation. shell-fn sources the dispatcher and
+    // calls the named function with its single positional arg; shell
+    // executes the script directly. We always go through bash -lc so
+    // sourced shell helpers see a normal interactive-style env.
+    const dispatchAction = path.join(REPO_ROOT, "test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh");
+    const useDispatchLauncher = action.kind === "shell-fn" && fs.existsSync(dispatchAction);
+    const bashArgs: string[] = useDispatchLauncher
+      ? [dispatchAction, action.fn ?? "", action.arg ?? "", scriptPath]
+      : [scriptPath, ...(action.arg ? [action.arg] : [])];
+
+    const env: NodeJS.ProcessEnv = {
+      ...process.env,
+      E2E_CONTEXT_DIR: ctx.contextDir,
+      E2E_PHASE: action.phase,
+      E2E_ACTION_ID: action.id,
+    };
+
+    return await new Promise<PhaseActionResult>((resolve) => {
+      const child = spawn("bash", bashArgs, { env, cwd: REPO_ROOT, detached: true });
+      const pgid = child.pid;
+      const logStream = fs.createWriteStream(logPath);
+      let stderrTail = "";
+      child.stdout.pipe(logStream, { end: false });
+      child.stderr.pipe(logStream, { end: false });
+      child.stderr.on("data", (chunk: Buffer) => {
+        stderrTail = (stderrTail + chunk.toString("utf8")).slice(-4096);
+      });
+
+      const killGroup = (signal: NodeJS.Signals) => {
+        if (typeof pgid !== "number") {
+          child.kill(signal);
+          return;
+        }
+        try {
+          process.kill(-pgid, signal);
+        } catch {
+          /* group already gone */
+        }
+      };
+
+      let timedOut = false;
+      const timeout = setTimeout(() => {
+        timedOut = true;
+        killGroup("SIGTERM");
+        setTimeout(() => {
+          if (!child.killed) {
+            killGroup("SIGKILL");
+          }
+        }, 5_000).unref();
+      }, timeoutSeconds * 1_000);
+
+      const finishLog = (): Promise<void> =>
+        new Promise((res) => {
+          if ((logStream as unknown as { closed?: boolean }).closed) {
+            res();
+            return;
+          }
+          logStream.once("finish", () => res());
+          logStream.once("error", () => res());
+          logStream.end();
+        });
+
+      child.on("error", (err) => {
+        clearTimeout(timeout);
+        void finishLog().then(() =>
+          resolve({
+            id: action.id,
+            status: "failed",
+            durationMs: Date.now() - startedAt,
+            evidence: logPath,
+            message: `phase action ${action.id} spawn error: ${err.message}`,
+          }),
+        );
+      });
+
+      child.on("close", (code, signal) => {
+        clearTimeout(timeout);
+        void finishLog().then(() => {
+          const durationMs = Date.now() - startedAt;
+          if (timedOut) {
+            resolve({
+              id: action.id,
+              status: "failed",
+              durationMs,
+              evidence: logPath,
+              message: `phase action ${action.id} exceeded ${timeoutSeconds}s (signal=${signal ?? "SIGTERM"})`,
+            });
+            return;
+          }
+          if (code === 0) {
+            resolve({ id: action.id, status: "passed", durationMs, evidence: logPath });
+            return;
+          }
+          resolve({
+            id: action.id,
+            status: "failed",
+            durationMs,
+            evidence: logPath,
+            message: `phase action ${action.id} exit ${code ?? "null"}: ${stderrTail.split("\n").slice(-3).join(" | ").trim()}`,
+          });
+        });
+      });
+    });
+  }
+
   private async runStep(ctx: RunContext, step: AssertionStep): Promise<AssertionResult> {
     const startedAt = Date.now();
     const rawAttempts = step.reliability?.retry?.attempts;
diff --git a/test/e2e-scenario/scenarios/types.ts b/test/e2e-scenario/scenarios/types.ts
index c83464a5e9..a6282a32f5 100644
--- a/test/e2e-scenario/scenarios/types.ts
+++ b/test/e2e-scenario/scenarios/types.ts
@@ -100,9 +100,39 @@ export interface ScenarioDefinition {
   expectedFailure?: Record<string, unknown>;
 }
 
+// A phase action is real, deterministic setup work the phase orchestrator
+// performs BEFORE running its assertions: install nemoclaw, run
+// onboarding, emit context.env, etc. Actions short-circuit assertions on
+// failure (assertions don't run if the action they depend on failed).
+//
+// Spec ownership: phase orchestrators own actions. The top-level runner
+// must not execute actions; clients must not embed action policy.
+export interface PhaseAction {
+  id: string;
+  phase: PhaseName;
+  description?: string;
+  // "shell-fn" sources the bash dispatcher and invokes the named function.
+  // "shell"    runs an executable script (used for context-emit helper).
+  kind: "shell-fn" | "shell";
+  // Repo-relative path to the script.
+  scriptRef: string;
+  // For "shell-fn": the bash function to invoke after sourcing scriptRef.
+  fn?: string;
+  // Single positional arg passed to the function/script (install method or
+  // onboarding profile id today). Kept as a single string to keep stable
+  // ids predictable; multi-arg variants can extend this later.
+  arg?: string;
+  // Per-action timeout. No retry by default - install/onboard must fail
+  // loudly so the regression is visible. Retry stays a property of
+  // assertion steps, not actions.
+  timeoutSeconds?: number;
+  // Repo-relative evidence log path.
+  evidencePath?: string;
+}
+
 export interface RunPlanPhase {
   name: PhaseName;
-  actions: string[];
+  actions: PhaseAction[];
   assertionGroups: AssertionGroup[];
 }
 
@@ -138,8 +168,20 @@ export interface AssertionResult {
   message?: string;
 }
 
+export interface PhaseActionResult {
+  id: string;
+  status: "passed" | "failed" | "skipped";
+  durationMs: number;
+  evidence?: string;
+  message?: string;
+}
+
 export interface PhaseResult {
   phase: PhaseName;
   status: "passed" | "failed" | "skipped";
+  // Action results are recorded distinctly from assertion results so
+  // failure-layer attribution stays unambiguous: a failure in actions
+  // means setup never completed; assertions did not have a fair chance.
+  actions: PhaseActionResult[];
   assertions: AssertionResult[];
 }

From f81cf5a73a41dae4279e9c0450a7e3ba3d877163 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Wed, 27 May 2026 22:19:37 -0400
Subject: [PATCH 05/23] fix(e2e): emit context.env in TS framework +
 cross-phase short-circuit

The first live dispatch of the Phase 6 wiring (run 26550310438) gave us
real action evidence and surfaced three real bugs. All three are fixed
inside the spec's prescribed layers - no workflow YAML, no client, no
old-resolver path.

1. environment.context.emit was a shell action that called the legacy
   emit-context-from-plan.sh helper. That helper expects the OLD
   YAML-resolver plan.json shape (dimensions.platform.profile.os...),
   which the typed compiler does not produce. Drop the shell action;
   add scenarios/orchestrators/context.ts that derives a normalized
   context.env directly from the typed RunPlan and writes it from
   ScenarioRunner.run() before any phase. Spec: context emission is
   framework infrastructure, not a phase action.

2. PhaseOrchestrator.runShellStep was reading context.env from
   ${ctx.contextDir}/.e2e/context.env, but the shell helper writes
   to ${E2E_CONTEXT_DIR}/context.env (top-level). Fix the path so
   shell assertions see seeded keys.

3. ScenarioRunner did not short-circuit across phase boundaries: a
   failed environment ACTION (real setup work) still let onboarding
   and runtime run, producing a misleading 34-failure cascade.
   Runner now consults prior phase results: if any prior action
   failed, downstream phases are synthesized as skipped with a
   message naming the blocking phase+action+message. Assertion-only
   failures still propagate as failures.

Tests added (8 new, 292/292 scenario framework tests green).

Validation gate next: dispatch e2e-scenarios-all.yaml again.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../e2e-phase-orchestrators.test.ts           | 144 +++++++++++++++++-
 test/e2e-scenario/scenarios/compiler.ts       |  11 --
 .../scenarios/orchestrators/context.ts        | 108 +++++++++++++
 .../scenarios/orchestrators/phase.ts          |   9 +-
 .../scenarios/orchestrators/runner.ts         |  68 +++++++--
 5 files changed, 313 insertions(+), 27 deletions(-)
 create mode 100644 test/e2e-scenario/scenarios/orchestrators/context.ts

diff --git a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
index 729c8cbd04..4993a11a4c 100644
--- a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
@@ -368,9 +368,13 @@ describe("plan compiler emits phase actions for canonical scenarios", () => {
     for (const plan of plans) {
       const env = plan.phases.find((p) => p.name === "environment")!;
       const onb = plan.phases.find((p) => p.name === "onboarding")!;
-      expect(env.actions.map((a) => a.id)).toContain("environment.context.emit");
       expect(env.actions.some((a) => a.id.startsWith("environment.install."))).toBe(true);
       expect(onb.actions.some((a) => a.id.startsWith("onboarding.profile."))).toBe(true);
+      // context.env emission is framework infrastructure (ScenarioRunner),
+      // not a shell action. The compiler must NOT emit a shell context
+      // action - if it did we'd be coupling back to the old resolver's
+      // plan.json shape.
+      expect(env.actions.map((a) => a.id)).not.toContain("environment.context.emit");
       // Every install/onboard action must be a typed shell-fn referencing
       // the canonical dispatcher script - no free-form strings.
       for (const action of [...env.actions, ...onb.actions]) {
@@ -385,6 +389,144 @@ describe("plan compiler emits phase actions for canonical scenarios", () => {
   });
 });
 
+describe("ScenarioRunner seeds context.env and short-circuits across phases", () => {
+  it("seedContextEnv_writes_normalized_keys_at_top_level_context_env_path", async () => {
+    const { compileRunPlans } = await import("../scenarios/compiler.ts");
+    const { seedContextEnv } = await import("../scenarios/orchestrators/context.ts");
+    const ctx = freshCtx();
+    try {
+      const [plan] = compileRunPlans(["ubuntu-repo-cloud-openclaw"]);
+      const result = seedContextEnv(ctx, plan);
+
+      // Path matches the shell helper's e2e_context_init: top-level,
+      // not under .e2e/. Runtime steps source ${E2E_CONTEXT_DIR}/context.env.
+      expect(result.path).toBe(path.join(ctx.contextDir, "context.env"));
+      const body = fs.readFileSync(result.path, "utf8");
+      // Required keys downstream shell assertions look up.
+      expect(body).toMatch(/^E2E_SCENARIO=ubuntu-repo-cloud-openclaw$/m);
+      expect(body).toMatch(/^E2E_PLATFORM_OS=ubuntu$/m);
+      expect(body).toMatch(/^E2E_AGENT=openclaw$/m);
+      expect(body).toMatch(/^E2E_PROVIDER=nvidia$/m);
+      expect(body).toMatch(/^E2E_GATEWAY_URL=http:\/\/127\.0\.0\.1:18789$/m);
+      expect(body).toMatch(/^E2E_SANDBOX_NAME=e2e-ubuntu-repo-cloud-openclaw$/m);
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("hermes_scenario_seeds_hermes_gateway_url", async () => {
+    const { compileRunPlans } = await import("../scenarios/compiler.ts");
+    const { seedContextEnv } = await import("../scenarios/orchestrators/context.ts");
+    const ctx = freshCtx();
+    try {
+      const [plan] = compileRunPlans(["ubuntu-repo-cloud-hermes"]);
+      const result = seedContextEnv(ctx, plan);
+      const body = fs.readFileSync(result.path, "utf8");
+      expect(body).toMatch(/^E2E_AGENT=hermes$/m);
+      expect(body).toMatch(/^E2E_GATEWAY_URL=http:\/\/127\.0\.0\.1:8642$/m);
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("runner_skips_downstream_phases_when_prior_phase_action_fails", async () => {
+    const { ScenarioRunner } = await import("../scenarios/orchestrators/runner.ts");
+    const { compileRunPlans } = await import("../scenarios/compiler.ts");
+    const ctx = freshCtx();
+    try {
+      const [plan] = compileRunPlans(["ubuntu-repo-cloud-openclaw"]);
+      // Inject a failing environment phase to simulate an install action
+      // failure. Onboarding and runtime must report skipped, not run
+      // their own actions or assertions.
+      const failingEnv = {
+        run: async () => ({
+          phase: "environment" as const,
+          status: "failed" as const,
+          actions: [
+            {
+              id: "environment.install.repo-current",
+              status: "failed" as const,
+              durationMs: 5,
+              message: "simulated install failure",
+            },
+          ],
+          assertions: [],
+        }),
+      };
+      let onboardingCalled = false;
+      let runtimeCalled = false;
+      const onboarding = {
+        run: async () => {
+          onboardingCalled = true;
+          return { phase: "onboarding" as const, status: "passed" as const, actions: [], assertions: [] };
+        },
+      };
+      const runtime = {
+        run: async () => {
+          runtimeCalled = true;
+          return { phase: "runtime" as const, status: "passed" as const, actions: [], assertions: [] };
+        },
+      };
+      const runner = new ScenarioRunner({ environment: failingEnv, onboarding, runtime });
+
+      const results = await runner.run(ctx, plan);
+
+      // Downstream orchestrators must NOT have been invoked.
+      expect(onboardingCalled).toBe(false);
+      expect(runtimeCalled).toBe(false);
+      // Each phase still has a result, and the downstream ones are
+      // skipped with a message that names the blocking action.
+      expect(results.map((r) => r.phase)).toEqual(["environment", "onboarding", "runtime"]);
+      expect(results[1].status).toBe("skipped");
+      expect(results[2].status).toBe("skipped");
+      expect(results[1].assertions[0].message).toMatch(/blocked by prior failure/);
+      expect(results[1].assertions[0].message).toMatch(/environment.install.repo-current/);
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("runner_does_not_short_circuit_on_assertion_failure_only", async () => {
+    // Assertion failures (as opposed to action failures) must not block
+    // downstream phases - reviewers need to see all failure layers.
+    const { ScenarioRunner } = await import("../scenarios/orchestrators/runner.ts");
+    const { compileRunPlans } = await import("../scenarios/compiler.ts");
+    const ctx = freshCtx();
+    try {
+      const [plan] = compileRunPlans(["ubuntu-repo-cloud-openclaw"]);
+      const env = {
+        run: async () => ({
+          phase: "environment" as const,
+          status: "failed" as const,
+          actions: [],
+          assertions: [
+            { id: "environment.something", status: "failed" as const, attempts: 1, durationMs: 1 },
+          ],
+        }),
+      };
+      let onboardingCalled = false;
+      const onboarding = {
+        run: async () => {
+          onboardingCalled = true;
+          return { phase: "onboarding" as const, status: "passed" as const, actions: [], assertions: [] };
+        },
+      };
+      const runner = new ScenarioRunner({
+        environment: env,
+        onboarding,
+        runtime: {
+          run: async () => ({ phase: "runtime" as const, status: "passed" as const, actions: [], assertions: [] }),
+        },
+      });
+
+      await runner.run(ctx, plan);
+      expect(onboardingCalled).toBe(true);
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+});
+
 describe("clients are pass/fail/policy free", () => {
   it("test_should_keep_clients_free_of_pass_fail_and_retry_semantics", () => {
     const observation = new HostCliClient().observeVersion();
diff --git a/test/e2e-scenario/scenarios/compiler.ts b/test/e2e-scenario/scenarios/compiler.ts
index cbb095019f..e3bc2d46bf 100644
--- a/test/e2e-scenario/scenarios/compiler.ts
+++ b/test/e2e-scenario/scenarios/compiler.ts
@@ -81,13 +81,11 @@ function validateManifestCompatibility(scenario: ScenarioDefinition, manifest?:
 // resurrected bash runner.
 const INSTALL_DISPATCH = "test/e2e-scenario/nemoclaw_scenarios/install/dispatch.sh";
 const ONBOARD_DISPATCH = "test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh";
-const CONTEXT_EMIT = "test/e2e-scenario/nemoclaw_scenarios/helpers/emit-context-from-plan.sh";
 
 // Default action timeouts. Install and onboarding can take a while on
 // cold runners (Docker pulls, image builds, sandbox bootstrap).
 const INSTALL_TIMEOUT_SECONDS = 900;
 const ONBOARD_TIMEOUT_SECONDS = 900;
-const CONTEXT_EMIT_TIMEOUT_SECONDS = 30;
 
 function phaseActions(phase: PhaseName, scenario: ScenarioDefinition): PhaseAction[] {
   if (phase === "environment") {
@@ -98,15 +96,6 @@ function phaseActions(phase: PhaseName, scenario: ScenarioDefinition): PhaseActi
       return [];
     }
     return [
-      {
-        id: "environment.context.emit",
-        phase: "environment",
-        description: "Emit .e2e/context.env from the resolved run plan.",
-        kind: "shell",
-        scriptRef: CONTEXT_EMIT,
-        timeoutSeconds: CONTEXT_EMIT_TIMEOUT_SECONDS,
-        evidencePath: `.e2e/actions/environment.context.emit.log`,
-      },
       {
         id: `environment.install.${installId}`,
         phase: "environment",
diff --git a/test/e2e-scenario/scenarios/orchestrators/context.ts b/test/e2e-scenario/scenarios/orchestrators/context.ts
new file mode 100644
index 0000000000..35394121fc
--- /dev/null
+++ b/test/e2e-scenario/scenarios/orchestrators/context.ts
@@ -0,0 +1,108 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+import fs from "node:fs";
+import path from "node:path";
+import type { RunContext, RunPlan } from "../types.ts";
+
+// Spec ownership: emitting the normalized context.env that downstream
+// shell helpers consume is FRAMEWORK INFRASTRUCTURE, not a phase action.
+// Doing it as a shell action coupled the typed runner back to the old
+// resolver's plan.json shape; doing it here keeps the typed RunPlan as
+// the single source of truth.
+//
+// We seed context.env with values derivable from the typed RunPlan
+// (scenario id, install method, agent/provider/route, default sandbox
+// name and gateway URL). Onboarding helpers may overwrite these via
+// e2e_context_set (e.g. assigning a real sandbox name, real gateway
+// URL after the gateway boots).
+
+function platformOsFromManifest(plan: RunPlan): string {
+  const explicit = plan.manifest?.spec.setup.platform.os;
+  if (typeof explicit === "string" && explicit.length > 0) {
+    return explicit;
+  }
+  // Fall back to the scenario environment platform id ("ubuntu-local",
+  // "macos-local", "wsl-local", "gpu-runner", "brev-launchable").
+  const platform = plan.environment?.platform ?? "";
+  if (platform.startsWith("macos")) return "macos";
+  if (platform.startsWith("wsl")) return "wsl";
+  if (platform.startsWith("brev")) return "ubuntu";
+  if (platform.startsWith("gpu")) return "ubuntu";
+  return "ubuntu";
+}
+
+function executionTargetFromManifest(plan: RunPlan): string {
+  const explicit = plan.manifest?.spec.setup.platform.executionTarget;
+  if (typeof explicit === "string" && explicit.length > 0) {
+    return explicit;
+  }
+  return plan.environment?.platform === "brev-launchable" ? "remote" : "local";
+}
+
+function containerEngine(plan: RunPlan): string {
+  const explicit = plan.manifest?.spec.setup.runtime.containerEngine;
+  return typeof explicit === "string" && explicit.length > 0 ? explicit : "docker";
+}
+
+function containerDaemon(plan: RunPlan): string {
+  const explicit = plan.manifest?.spec.setup.runtime.containerDaemon;
+  if (typeof explicit === "string" && explicit.length > 0) {
+    return explicit;
+  }
+  return plan.environment?.runtime === "docker-missing" ? "missing" : "running";
+}
+
+function defaultGatewayUrl(agent: string): string {
+  // Mirrors the historical defaults from emit-context-from-plan.sh so
+  // existing shell helpers see the same seed values they used to.
+  return agent === "hermes" ? "http://127.0.0.1:8642" : "http://127.0.0.1:18789";
+}
+
+function escapeContextValue(value: string): string {
+  // The context library accepts plain `KEY=value` lines without quoting.
+  // Reject newlines (would corrupt the file) and otherwise pass through.
+  if (value.includes("\n")) {
+    throw new Error(`context.env value for must not contain newline: ${JSON.stringify(value)}`);
+  }
+  return value;
+}
+
+export interface ContextSeedResult {
+  path: string;
+  keys: string[];
+}
+
+export function seedContextEnv(ctx: RunContext, plan: RunPlan): ContextSeedResult {
+  const onboarding = plan.manifest?.spec.onboarding;
+  const agent = onboarding?.agent ?? "openclaw";
+  const provider = onboarding?.provider ?? "nvidia";
+  const inferenceRoute = onboarding?.modelRoute ?? "inference-local";
+  const onboardingPath = plan.environment?.onboarding ?? "unknown";
+  const installMethod = plan.environment?.install ?? "unknown";
+
+  const entries: Record<string, string> = {
+    E2E_SCENARIO: plan.scenarioId,
+    E2E_PLATFORM_OS: platformOsFromManifest(plan),
+    E2E_EXECUTION_TARGET: executionTargetFromManifest(plan),
+    E2E_INSTALL_METHOD: installMethod,
+    E2E_CONTAINER_ENGINE: containerEngine(plan),
+    E2E_CONTAINER_DAEMON: containerDaemon(plan),
+    E2E_ONBOARDING_PATH: onboardingPath,
+    E2E_AGENT: agent,
+    E2E_PROVIDER: provider,
+    E2E_INFERENCE_ROUTE: inferenceRoute,
+    E2E_SANDBOX_NAME: `e2e-${plan.scenarioId}`,
+    E2E_GATEWAY_URL: defaultGatewayUrl(agent),
+  };
+
+  // Path matches the shell helper's e2e_context_init: ${E2E_CONTEXT_DIR}/context.env
+  const contextPath = path.join(ctx.contextDir, "context.env");
+  fs.mkdirSync(ctx.contextDir, { recursive: true });
+  const lines = Object.entries(entries)
+    .map(([key, value]) => `${key}=${escapeContextValue(value)}`)
+    .join("\n");
+  fs.writeFileSync(contextPath, `${lines}\n`);
+
+  return { path: contextPath, keys: Object.keys(entries) };
+}
diff --git a/test/e2e-scenario/scenarios/orchestrators/phase.ts b/test/e2e-scenario/scenarios/orchestrators/phase.ts
index 72e2b95ec4..940df6a6ac 100644
--- a/test/e2e-scenario/scenarios/orchestrators/phase.ts
+++ b/test/e2e-scenario/scenarios/orchestrators/phase.ts
@@ -298,9 +298,12 @@ export class PhaseOrchestrator {
       E2E_STEP_ID: step.id,
       E2E_PHASE: step.phase,
     };
-    // Surface scenario-derived context (E2E_SANDBOX_NAME, E2E_GATEWAY_URL,
-    // etc.) that the environment+onboarding phases wrote into context.env.
-    const contextEnvPath = path.join(ctx.contextDir, ".e2e", "context.env");
+    // Surface scenario-derived context (E2E_SCENARIO, E2E_SANDBOX_NAME,
+    // E2E_GATEWAY_URL, etc.) that the framework wrote at the start of the
+    // run and that environment+onboarding phases extended via
+    // e2e_context_set. The shell context library writes to
+    // ${E2E_CONTEXT_DIR}/context.env, NOT to ${E2E_CONTEXT_DIR}/.e2e/.
+    const contextEnvPath = path.join(ctx.contextDir, "context.env");
     if (fs.existsSync(contextEnvPath)) {
       const contextEnv = fs.readFileSync(contextEnvPath, "utf8");
       for (const line of contextEnv.split("\n")) {
diff --git a/test/e2e-scenario/scenarios/orchestrators/runner.ts b/test/e2e-scenario/scenarios/orchestrators/runner.ts
index 6ab3b76c62..228d32d452 100644
--- a/test/e2e-scenario/scenarios/orchestrators/runner.ts
+++ b/test/e2e-scenario/scenarios/orchestrators/runner.ts
@@ -1,7 +1,8 @@
 // SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 // SPDX-License-Identifier: Apache-2.0
 
-import type { PhaseResult, RunContext, RunPlan, RunPlanPhase } from "../types.ts";
+import type { PhaseActionResult, PhaseResult, RunContext, RunPlan, RunPlanPhase } from "../types.ts";
+import { seedContextEnv } from "./context.ts";
 import { EnvironmentOrchestrator } from "./environment.ts";
 import { OnboardingOrchestrator } from "./onboarding.ts";
 import { RuntimeOrchestrator } from "./runtime.ts";
@@ -28,22 +29,65 @@ export class ScenarioRunner {
   }
 
   async run(ctx: RunContext, plan: RunPlan): Promise<PhaseResult[]> {
+    // Seed context.env from the typed RunPlan once, before any phase
+    // runs. Spec ownership: framework infrastructure (the runner), not
+    // a shell action. Onboarding may extend context.env via
+    // e2e_context_set; the runtime phase reads whatever is on disk.
+    seedContextEnv(ctx, plan);
+
     const results: PhaseResult[] = [];
     for (const phase of plan.phases) {
-      if (phase.name === "environment") {
-        results.push(await this.environment.run(ctx, phase, results));
-        continue;
-      }
-      if (phase.name === "onboarding") {
-        results.push(await this.onboarding.run(ctx, phase, results));
+      const blocked = blockingPriorResult(results);
+      if (blocked) {
+        // Cross-phase short-circuit: the previous phase's setup work
+        // failed, so this phase cannot meaningfully run. Synthesize a
+        // skipped PhaseResult with a clear reason so artifacts stay
+        // honest (no false greens, no <1s assertion explosion).
+        results.push({
+          phase: phase.name,
+          status: "skipped",
+          actions: [],
+          assertions: [
+            {
+              id: `${phase.name}.blocked`,
+              status: "skipped",
+              attempts: 0,
+              durationMs: 0,
+              message: `phase blocked by prior failure: ${blocked.phase} action ${blocked.action.id} failed (${blocked.action.message ?? "no message"})`,
+            },
+          ],
+        });
         continue;
       }
-      if (phase.name === "runtime") {
-        results.push(await this.runtime.run(ctx, phase, results));
-        continue;
-      }
-      throw new Error(`Unsupported phase: ${String(phase.name)}`);
+      const orchestrator = this.orchestratorFor(phase.name);
+      results.push(await orchestrator.run(ctx, phase, results));
     }
     return results;
   }
+
+  private orchestratorFor(name: RunPlanPhase["name"]): PhaseRunner {
+    if (name === "environment") return this.environment;
+    if (name === "onboarding") return this.onboarding;
+    if (name === "runtime") return this.runtime;
+    throw new Error(`Unsupported phase: ${String(name)}`);
+  }
+}
+
+interface BlockingFailure {
+  phase: PhaseResult["phase"];
+  action: PhaseActionResult;
+}
+
+function blockingPriorResult(results: PhaseResult[]): BlockingFailure | undefined {
+  // A phase action failure (real setup work didn't succeed) blocks
+  // downstream phases. Assertion failures do NOT block downstream
+  // phases - they are expected to be reported alongside other phase
+  // results so reviewers can see all failure layers at once.
+  for (const result of results) {
+    const failedAction = result.actions.find((action) => action.status === "failed");
+    if (failedAction) {
+      return { phase: result.phase, action: failedAction };
+    }
+  }
+  return undefined;
 }

From 2dc94399f304464f04e606aa9eaf1ee79824ecc3 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Wed, 27 May 2026 23:10:38 -0400
Subject: [PATCH 06/23] fix(e2e): stop dispatcher from truncating context.env;
 alias onboard.log

Run 26550936453 surfaced two more real bugs after Phase 6 wiring went
live. Both fixed inside the spec's prescribed layers; nothing leaks
into workflow YAML or clients.

1. dispatch-action.sh called e2e_context_init unconditionally before
   sourcing the install/onboard dispatcher. e2e_context_init opens
   context.env with `: > ctx`, which truncated the file the
   ScenarioRunner had just seeded. All runtime assertions then failed
   with 'e2e context: missing required key(s): E2E_SCENARIO ...'.
   Fix: dispatch-action.sh no longer calls e2e_context_init. The TS
   framework owns context.env initialization; workers may still extend
   it via e2e_context_set.

2. The legacy onboarding.preflight.passed assertion expects an
   onboard.log file at ${E2E_CONTEXT_DIR}/onboard.log. The old bash
   runner used to redirect onboarding output there; the typed
   orchestrator captured it under .e2e/actions/<action-id>.log. Fix:
   add optional aliasPath to PhaseAction; compiler sets aliasPath to
   'onboard.log' for the onboarding profile action; orchestrator
   copies the action evidence log to the alias on success. Best-
   effort - alias copy failures do not fail the action.

Live evidence from run 26550936453 (canonical ubuntu-repo-cloud-openclaw):
- environment.install.repo-current: passed in 14.2s
- onboarding.profile.cloud-openclaw: passed in 302s (real onboarding!)
- onboarding.base.cli-installed: passed
- onboarding.preflight.passed: failed (onboard.log not found) <- fixed
- runtime.* (10 steps): all 'missing key(s)' <- fixed by #1

Tests: 38/38 phase-orchestrator (was 36; +2 alias tests), 294/294
scenario framework. shellcheck clean.

Validation gate next: redispatch e2e-scenarios-all and confirm
runtime steps actually exercise the SUT (real pass/fail, not key
errors).

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../e2e-phase-orchestrators.test.ts           | 28 +++++++++++++++++++
 .../nemoclaw_scenarios/dispatch-action.sh     |  9 ++++--
 test/e2e-scenario/scenarios/compiler.ts       |  3 ++
 .../scenarios/orchestrators/phase.ts          | 15 ++++++++++
 test/e2e-scenario/scenarios/types.ts          |  5 ++++
 5 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
index 4993a11a4c..a587c50112 100644
--- a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
@@ -332,6 +332,29 @@ describe("phase orchestrators - actions execute before assertions", () => {
     }
   });
 
+  it("phase_action_publishes_alias_path_on_success", async () => {
+    const ctx = freshCtx();
+    try {
+      const actionScript = writeTempScript(ctx.contextDir, "alias.sh", "echo aliased-output");
+      const action: PhaseAction = {
+        id: "onboarding.profile.alias-demo",
+        phase: "onboarding",
+        kind: "shell",
+        scriptRef: path.relative(REPO_ROOT, actionScript),
+        aliasPath: "onboard.log",
+      };
+      const orchestrator = new PhaseOrchestrator("onboarding");
+
+      const result = await orchestrator.run(ctx, makePhaseWithActions("onboarding", [action], []));
+
+      expect(result.actions[0].status).toBe("passed");
+      const aliasContents = fs.readFileSync(path.join(ctx.contextDir, "onboard.log"), "utf8");
+      expect(aliasContents).toContain("aliased-output");
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
   it("phase_action_evidence_log_is_flushed_before_resolve", async () => {
     const ctx = freshCtx();
     try {
@@ -375,6 +398,11 @@ describe("plan compiler emits phase actions for canonical scenarios", () => {
       // action - if it did we'd be coupling back to the old resolver's
       // plan.json shape.
       expect(env.actions.map((a) => a.id)).not.toContain("environment.context.emit");
+      // Onboarding action must publish a stable alias path so legacy
+      // shell assertions referencing ${E2E_CONTEXT_DIR}/onboard.log
+      // keep working without coupling them to action ids.
+      const onboardingAction = onb.actions.find((a) => a.id.startsWith("onboarding.profile."));
+      expect(onboardingAction?.aliasPath).toBe("onboard.log");
       // Every install/onboard action must be a typed shell-fn referencing
       // the canonical dispatcher script - no free-form strings.
       for (const action of [...env.actions, ...onb.actions]) {
diff --git a/test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh b/test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh
index f83f6fca1b..5aaca1b2c1 100755
--- a/test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh
+++ b/test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh
@@ -55,9 +55,12 @@ RUNTIME_LIB="$(cd "$(dirname "${BASH_SOURCE[0]}")/../runtime/lib" && pwd)"
 e2e_env_apply_noninteractive
 e2e_env_trace "phase:${E2E_PHASE:-unknown}/action:${E2E_ACTION_ID:-unknown}"
 
-# Make sure the context directory exists; e2e_context_init is also
-# idempotent and safe to call from any phase.
-e2e_context_init || true
+# IMPORTANT: do NOT call e2e_context_init here. The TS framework
+# (ScenarioRunner.seedContextEnv) is the single owner of context.env
+# initialization for the run; e2e_context_init opens with `: > ctx`
+# which would truncate the file and wipe seeded keys (E2E_SCENARIO,
+# E2E_GATEWAY_URL, ...) that runtime assertions require.
+# Workers may still call e2e_context_set to extend context.env in place.
 
 # Source the dispatcher last so its function definitions are in scope
 # when we invoke the requested function.
diff --git a/test/e2e-scenario/scenarios/compiler.ts b/test/e2e-scenario/scenarios/compiler.ts
index e3bc2d46bf..3feab952f7 100644
--- a/test/e2e-scenario/scenarios/compiler.ts
+++ b/test/e2e-scenario/scenarios/compiler.ts
@@ -125,6 +125,9 @@ function phaseActions(phase: PhaseName, scenario: ScenarioDefinition): PhaseActi
         arg: onboardingId,
         timeoutSeconds: ONBOARD_TIMEOUT_SECONDS,
         evidencePath: `.e2e/actions/onboarding.profile.${onboardingId}.log`,
+        // Legacy preflight assertions look for ${E2E_CONTEXT_DIR}/onboard.log;
+        // publish a stable alias so they keep working without rewiring.
+        aliasPath: "onboard.log",
       },
     ];
   }
diff --git a/test/e2e-scenario/scenarios/orchestrators/phase.ts b/test/e2e-scenario/scenarios/orchestrators/phase.ts
index 940df6a6ac..df45279eea 100644
--- a/test/e2e-scenario/scenarios/orchestrators/phase.ts
+++ b/test/e2e-scenario/scenarios/orchestrators/phase.ts
@@ -193,6 +193,21 @@ export class PhaseOrchestrator {
             return;
           }
           if (code === 0) {
+            // Publish the action's evidence log under a stable alias for
+            // legacy assertions that reference fixed filenames
+            // (onboard.log, install.log, ...). Best-effort; alias copy
+            // failures do not fail the action.
+            if (action.aliasPath) {
+              try {
+                const aliasFull = path.isAbsolute(action.aliasPath)
+                  ? action.aliasPath
+                  : path.join(ctx.contextDir, action.aliasPath);
+                fs.mkdirSync(path.dirname(aliasFull), { recursive: true });
+                fs.copyFileSync(logPath, aliasFull);
+              } catch {
+                /* alias is a convenience; never fail action on copy */
+              }
+            }
             resolve({ id: action.id, status: "passed", durationMs, evidence: logPath });
             return;
           }
diff --git a/test/e2e-scenario/scenarios/types.ts b/test/e2e-scenario/scenarios/types.ts
index a6282a32f5..084973679d 100644
--- a/test/e2e-scenario/scenarios/types.ts
+++ b/test/e2e-scenario/scenarios/types.ts
@@ -128,6 +128,11 @@ export interface PhaseAction {
   timeoutSeconds?: number;
   // Repo-relative evidence log path.
   evidencePath?: string;
+  // Optional stable alias the orchestrator copies the evidence log to
+  // after a successful action. Lets legacy shell assertions that
+  // reference well-known filenames (e.g. ${E2E_CONTEXT_DIR}/onboard.log)
+  // keep working without coupling them to the action's stable id.
+  aliasPath?: string;
 }
 
 export interface RunPlanPhase {

From 0aa107a4e7298bca7591b711b14d22bef584ad23 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Wed, 27 May 2026 23:56:05 -0400
Subject: [PATCH 07/23] fix(e2e): tighten preflight-passed regex to actual
 failure markers

The first canonical-scenario run that reached the onboarding-assertion
phase (run 26552550140 / ubuntu-repo-cloud-openclaw) showed that the
legacy onboarding.preflight.passed assertion fails on every successful
run because its regex matches any mention of 'docker' / 'container' /
'daemon' / 'socket' in onboard.log - and a normal nemoclaw onboarding
mentions all of those many times.

The action itself succeeded (exit 0, 263s of real onboarding work);
the assertion is meant to confirm onboard.log does not contain
explicit preflight FAILURE markers. Tighten the regex accordingly:
match phrases like 'preflight failed/error', 'cannot connect to the
docker daemon', 'onboarding aborted', 'FATAL: docker', 'ERROR: docker
daemon' - not bare topic words.

Verified: shellcheck passes; bash -n passes.

Why we stop here on this PR:

This commit lands the last small framework-level fix produced by live
action evidence. The Phase 6 wiring is now fully validated end-to-end:

  Install:         passed (~12s)
  Onboarding:      action passed (~263s real onboarding)
                   base.cli-installed passed
                   preflight.passed will now pass
  Runtime:         9 passed / 25 failed / 5 skipped against live SUT

The remaining 25 runtime failures are real product/test bugs surfaced
by finally executing the suite against a live SUT (sandbox-shell
timeouts, inference 30-60s timeouts, lifecycle.sandbox_operations
exit-1 mismatches, lifecycle.rebuild/upgrade 120s timeouts even after
retries). They are pre-existing and out of scope for 'execute real
shell assertions; delete dry-run, --validate-only, and the bash
runner'. They become productive follow-up issues.

The 5 skipped runtime steps are 'probe not registered' - known per
spec; probe registry lands in a follow-up.

Negative scenarios (ubuntu-no-docker-preflight-negative,
invalid-key-negative, gateway-port-conflict-negative) need expected-
failure semantics and a way to actually simulate docker-missing on
the runner. Out of scope here; tracked as follow-up.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../preflight/00-preflight-passed.sh                     | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/test/e2e-scenario/onboarding_assertions/preflight/00-preflight-passed.sh b/test/e2e-scenario/onboarding_assertions/preflight/00-preflight-passed.sh
index 69bda6c47c..24267ee068 100755
--- a/test/e2e-scenario/onboarding_assertions/preflight/00-preflight-passed.sh
+++ b/test/e2e-scenario/onboarding_assertions/preflight/00-preflight-passed.sh
@@ -9,7 +9,14 @@ if [[ ! -f "${E2E_CONTEXT_DIR:-}/onboard.log" ]]; then
   exit 1
 fi
 
-if grep -Eiq "preflight.*(fail|error)|docker|container|daemon|socket" "${E2E_CONTEXT_DIR}/onboard.log"; then
+# The onboarding action already completed (exit 0) for this assertion to
+# run; we only need to confirm the captured onboard.log does not contain
+# explicit preflight FAILURE markers. The previous regex matched any
+# mention of 'docker' / 'container' / 'daemon' / 'socket', which a normal
+# successful onboarding always logs. Tighten to actual failure phrases.
+if grep -Eiq \
+    "preflight[[:space:]]+(failed|error)|cannot connect to[[:space:]]+(the[[:space:]]+)?docker daemon|permission denied[[:space:]]+while trying to connect to.*docker.*sock|onboarding aborted|FATAL: docker|ERROR: docker daemon" \
+    "${E2E_CONTEXT_DIR}/onboard.log"; then
   echo "FAIL: onboarding.preflight.passed - onboard log contains preflight failure evidence"
   exit 1
 fi

From 1d128a152674403fc411b56b715072dd5eb0226f Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Thu, 28 May 2026 00:05:01 -0400
Subject: [PATCH 08/23] style(e2e): shfmt indentation on preflight-passed
 continuation lines

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../onboarding_assertions/preflight/00-preflight-passed.sh    | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/test/e2e-scenario/onboarding_assertions/preflight/00-preflight-passed.sh b/test/e2e-scenario/onboarding_assertions/preflight/00-preflight-passed.sh
index 24267ee068..fb05606494 100755
--- a/test/e2e-scenario/onboarding_assertions/preflight/00-preflight-passed.sh
+++ b/test/e2e-scenario/onboarding_assertions/preflight/00-preflight-passed.sh
@@ -15,8 +15,8 @@ fi
 # mention of 'docker' / 'container' / 'daemon' / 'socket', which a normal
 # successful onboarding always logs. Tighten to actual failure phrases.
 if grep -Eiq \
-    "preflight[[:space:]]+(failed|error)|cannot connect to[[:space:]]+(the[[:space:]]+)?docker daemon|permission denied[[:space:]]+while trying to connect to.*docker.*sock|onboarding aborted|FATAL: docker|ERROR: docker daemon" \
-    "${E2E_CONTEXT_DIR}/onboard.log"; then
+  "preflight[[:space:]]+(failed|error)|cannot connect to[[:space:]]+(the[[:space:]]+)?docker daemon|permission denied[[:space:]]+while trying to connect to.*docker.*sock|onboarding aborted|FATAL: docker|ERROR: docker daemon" \
+  "${E2E_CONTEXT_DIR}/onboard.log"; then
   echo "FAIL: onboarding.preflight.passed - onboard log contains preflight failure evidence"
   exit 1
 fi

From c359f8a45e3619ea60c519ef3adfc55d1cb5f266 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Thu, 28 May 2026 00:22:09 -0400
Subject: [PATCH 09/23] fix(e2e): address CodeRabbit batch (3 quick wins)

All three findings are valid mechanical bugs introduced/touched by this
PR. Batched per the safe-batch policy: same-risk, independently
obvious, testable together.

1. orchestrators/phase.ts classifierForRef:
   Outer guard is /i (case-insensitive), but the inner branch used
   case-sensitive ref.includes("tunnel") / ref.includes("cloudflared")
   - mixed-case refs would fall through and misclassify as
   provider-transient. Replace with /tunnel|cloudflared/i.test(ref).

2. scenarios/compiler.ts phaseActions:
   Inline comment said "the scenario is malformed; surface it as a
   hard error" but the code returned []. Hard-fail instead, with a
   message that names the missing dimension. Empty environment is
   still tolerated (skeleton scenarios can carry no setup yet).

3. validation_suites/lib/sandbox_lifecycle.sh:
   Substring match `*${E2E_SANDBOX_NAME}*` would let sb1 falsely
   match sb10. Use awk with a whole-token equality check on column
   one of `nemoclaw list` output.

Tests: 294/294 scenario framework still green. shellcheck + shfmt
clean. No behavior change for canonical scenarios; affected paths
were either dormant (case-mixed classifier) or returning a slightly
stricter outcome (compiler hard-fail, sandbox exact match).

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 test/e2e-scenario/scenarios/compiler.ts       | 21 +++++++++++++------
 .../scenarios/orchestrators/phase.ts          |  5 ++++-
 .../lib/sandbox_lifecycle.sh                  |  5 ++++-
 3 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/test/e2e-scenario/scenarios/compiler.ts b/test/e2e-scenario/scenarios/compiler.ts
index 3feab952f7..e26afb9820 100644
--- a/test/e2e-scenario/scenarios/compiler.ts
+++ b/test/e2e-scenario/scenarios/compiler.ts
@@ -89,12 +89,18 @@ const ONBOARD_TIMEOUT_SECONDS = 900;
 
 function phaseActions(phase: PhaseName, scenario: ScenarioDefinition): PhaseAction[] {
   if (phase === "environment") {
-    const installId = scenario.environment?.install;
-    if (!installId) {
-      // No install dimension defined - the scenario is malformed; surface
-      // it as a hard error rather than emitting a bogus action.
+    if (!scenario.environment) {
+      // Scenarios without any environment dimension (skeleton scenarios)
+      // legitimately have no actions yet. Don't fail-fast here.
       return [];
     }
+    const installId = scenario.environment.install;
+    if (!installId) {
+      // Environment is declared but install is missing - that IS a
+      // malformed scenario; fail fast so the caller sees a clear error
+      // rather than a phase that silently no-ops setup work.
+      throw new Error(`Scenario ${scenario.id} is missing environment.install`);
+    }
     return [
       {
         id: `environment.install.${installId}`,
@@ -110,10 +116,13 @@ function phaseActions(phase: PhaseName, scenario: ScenarioDefinition): PhaseActi
     ];
   }
   if (phase === "onboarding") {
-    const onboardingId = scenario.environment?.onboarding;
-    if (!onboardingId) {
+    if (!scenario.environment) {
       return [];
     }
+    const onboardingId = scenario.environment.onboarding;
+    if (!onboardingId) {
+      throw new Error(`Scenario ${scenario.id} is missing environment.onboarding`);
+    }
     return [
       {
         id: `onboarding.profile.${onboardingId}`,
diff --git a/test/e2e-scenario/scenarios/orchestrators/phase.ts b/test/e2e-scenario/scenarios/orchestrators/phase.ts
index df45279eea..0bb97f4e80 100644
--- a/test/e2e-scenario/scenarios/orchestrators/phase.ts
+++ b/test/e2e-scenario/scenarios/orchestrators/phase.ts
@@ -32,7 +32,10 @@ interface StepAttemptOutcome {
 // clients/scripts do not.
 function classifierForRef(ref: string): TransientClassifier {
   if (/provider|inference|chat-completion|cloudflared|tunnel/i.test(ref)) {
-    return ref.includes("tunnel") || ref.includes("cloudflared") ? "external-tunnel" : "provider-transient";
+    // Use case-insensitive matching here too; the outer guard is /i, so
+    // mixed-case refs (Tunnel, Cloudflared) must still classify as
+    // external-tunnel rather than fall through to provider-transient.
+    return /tunnel|cloudflared/i.test(ref) ? "external-tunnel" : "provider-transient";
   }
   if (/gateway/i.test(ref)) {
     return "gateway-transient";
diff --git a/test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh b/test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
index 3b186093d1..aef572c190 100755
--- a/test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
+++ b/test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
@@ -59,7 +59,10 @@ sandbox_lifecycle_assert_nemoclaw_list_contains_sandbox() {
     sandbox_lifecycle_fail "${id}" "nemoclaw list failed"
     return 1
   }
-  [[ "${SANDBOX_LIFECYCLE_LAST_OUTPUT}" == *"${E2E_SANDBOX_NAME}"* ]] || {
+  # Match the sandbox name exactly as a whole token; substring match
+  # would let `sb1` falsely match `sb10`.
+  awk -v n="${E2E_SANDBOX_NAME}" '$1 == n { found = 1 } END { exit !found }' \
+    <<<"${SANDBOX_LIFECYCLE_LAST_OUTPUT}" || {
     sandbox_lifecycle_fail "${id}" "sandbox not listed: ${E2E_SANDBOX_NAME}"
     return 1
   }

From 34f3ca2e059d59916662b702c506b1d8ca25daac Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Thu, 28 May 2026 07:02:35 -0400
Subject: [PATCH 10/23] fix(e2e): remove residual E2E_DRY_RUN branch and
 phase-1-skeleton refs

The PR body audits two invariants as absent from production paths:
  - E2E_DRY_RUN / e2e_env_is_dry_run
  - phase-1-skeleton
The PR Review Advisor caught counterexamples to both. Fix them now.

E2E_DRY_RUN: validation_suites/messaging/slack/00-slack-provider-state.sh
still branched on ${E2E_DRY_RUN:-} and emitted dry-run pass markers
without reading the OpenClaw config surface or querying runtime
discovery. Removed; the assertion now always reads real config and
queries OpenClaw runtime state. One execution mode.

phase-1-skeleton: scenarios/assertions/{environment,onboarding,runtime}.ts
each defined a 'kind: pending, ref: phase-1-skeleton' step that the
orchestrator silently skips. Of the three:
  - environmentBaseline() was in the registry and every scenario plan;
    environment work is performed by typed PhaseAction entries
    (context.emit + install.<id>) from compiler.phaseActions(), so the
    assertion group is redundant. Removed from registry and
    assertionGroupsForScenario(); deleted the file.
  - onboardingBaseline() and runtimeSmokeSkeleton() were never
    imported. Deleted as dead code.

Verification: rg of E2E_DRY_RUN, e2e_env_is_dry_run, phase-1-skeleton
across scenarios, runtime, validation_suites, onboarding_assertions,
nemoclaw_scenarios, and the scenario workflows returns nothing outside
framework-tests. 294/294 e2e-scenario framework tests pass. bash -n
clean on the touched shell script.

Addresses PR Review Advisor findings:
  - 'Residual E2E_DRY_RUN branch contradicts the one-live-mode contract'
  - 'Skeleton pending refs remain in production scenario plans'

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../scenarios/assertions/environment.ts       | 22 ---------------
 .../scenarios/assertions/onboarding.ts        | 21 ---------------
 .../scenarios/assertions/registry.ts          |  8 +++---
 .../scenarios/assertions/runtime.ts           | 21 ---------------
 .../slack/00-slack-provider-state.sh          | 27 ++++++++-----------
 5 files changed, 16 insertions(+), 83 deletions(-)
 delete mode 100644 test/e2e-scenario/scenarios/assertions/environment.ts
 delete mode 100644 test/e2e-scenario/scenarios/assertions/onboarding.ts
 delete mode 100644 test/e2e-scenario/scenarios/assertions/runtime.ts

diff --git a/test/e2e-scenario/scenarios/assertions/environment.ts b/test/e2e-scenario/scenarios/assertions/environment.ts
deleted file mode 100644
index be7a62e6fb..0000000000
--- a/test/e2e-scenario/scenarios/assertions/environment.ts
+++ /dev/null
@@ -1,22 +0,0 @@
-// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-// SPDX-License-Identifier: Apache-2.0
-
-import type { AssertionGroup } from "../types.ts";
-
-export function environmentBaseline(): AssertionGroup {
-  return {
-    id: "environment.baseline",
-    phase: "environment",
-    description: "Skeleton environment baseline assertion group.",
-    migrationStatus: "complete",
-    steps: [
-      {
-        id: "environment.plan.skeleton",
-        phase: "environment",
-        description: "Placeholder step until live environment orchestration is migrated.",
-        implementation: { kind: "pending", ref: "phase-1-skeleton" },
-        evidencePath: ".e2e/environment.result.json",
-      },
-    ],
-  };
-}
diff --git a/test/e2e-scenario/scenarios/assertions/onboarding.ts b/test/e2e-scenario/scenarios/assertions/onboarding.ts
deleted file mode 100644
index 9886a701fb..0000000000
--- a/test/e2e-scenario/scenarios/assertions/onboarding.ts
+++ /dev/null
@@ -1,21 +0,0 @@
-// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-// SPDX-License-Identifier: Apache-2.0
-
-import type { AssertionGroup } from "../types.ts";
-
-export function onboardingBaseline(): AssertionGroup {
-  return {
-    id: "onboarding.baseline",
-    phase: "onboarding",
-    description: "Skeleton onboarding assertion group.",
-    steps: [
-      {
-        id: "onboarding.plan.skeleton",
-        phase: "onboarding",
-        description: "Placeholder step until onboarding assertions are migrated.",
-        implementation: { kind: "pending", ref: "phase-1-skeleton" },
-        evidencePath: ".e2e/onboarding.result.json",
-      },
-    ],
-  };
-}
diff --git a/test/e2e-scenario/scenarios/assertions/registry.ts b/test/e2e-scenario/scenarios/assertions/registry.ts
index d6ef59fe1c..2d83ea0341 100644
--- a/test/e2e-scenario/scenarios/assertions/registry.ts
+++ b/test/e2e-scenario/scenarios/assertions/registry.ts
@@ -3,7 +3,6 @@
 
 import fs from "node:fs";
 import path from "node:path";
-import { environmentBaseline } from "./environment.ts";
 import type { AssertionGroup, AssertionStep, PhaseName, ScenarioDefinition } from "../types.ts";
 
 type Reliability = AssertionStep["reliability"];
@@ -254,7 +253,7 @@ export const validationSuiteGroups: AssertionGroup[] = [
 ];
 
 export const assertionRegistry = {
-  groups: [environmentBaseline(), ...onboardingAssertionGroups, ...runtimeControlGroups, ...validationSuiteGroups],
+  groups: [...onboardingAssertionGroups, ...runtimeControlGroups, ...validationSuiteGroups],
 };
 
 export function assertionGroupForSuite(suiteId: string): AssertionGroup | undefined {
@@ -349,8 +348,11 @@ export function assertionGroupsForScenario(scenario: ScenarioDefinition): Assert
     return group;
   });
 
+  // Environment phase work is performed by typed PhaseAction entries
+  // (context.emit + install.<id>) emitted from compiler.phaseActions(),
+  // not by assertion groups. No environment-phase assertion group is
+  // included in scenario plans.
   const groups: (AssertionGroup | undefined)[] = [
-    environmentBaseline(),
     ...onboardingGroups,
     ...suiteGroups,
     ...supplementalGroups,
diff --git a/test/e2e-scenario/scenarios/assertions/runtime.ts b/test/e2e-scenario/scenarios/assertions/runtime.ts
deleted file mode 100644
index 5ed7031279..0000000000
--- a/test/e2e-scenario/scenarios/assertions/runtime.ts
+++ /dev/null
@@ -1,21 +0,0 @@
-// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-// SPDX-License-Identifier: Apache-2.0
-
-import type { AssertionGroup } from "../types.ts";
-
-export function runtimeSmokeSkeleton(): AssertionGroup {
-  return {
-    id: "runtime.smoke.skeleton",
-    phase: "runtime",
-    description: "Skeleton runtime smoke assertion group.",
-    steps: [
-      {
-        id: "runtime.plan.skeleton",
-        phase: "runtime",
-        description: "Placeholder step until validation suites are migrated.",
-        implementation: { kind: "pending", ref: "phase-1-skeleton" },
-        evidencePath: ".e2e/runtime.result.json",
-      },
-    ],
-  };
-}
diff --git a/test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh b/test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh
index 0f1afa2e14..a8ef4d11b8 100755
--- a/test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh
+++ b/test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh
@@ -12,25 +12,21 @@ case "${provider}" in
 esac
 e2e_messaging_assert_provider_attached
 if [[ "$(e2e_context_get E2E_AGENT)" == "openclaw" ]]; then
-  if [[ -n "${E2E_DRY_RUN:-}" ]]; then
-    e2e_pass "expected-state.messaging.slack.openclaw-enabled dry-run"
-    e2e_pass "expected-state.messaging.slack.runtime-discovery dry-run"
-  else
-    content="$(e2e_messaging_read_config_surface)"
-    if ! printf '%s\n' "${content}" | python3 -c '
+  content="$(e2e_messaging_read_config_surface)"
+  if ! printf '%s\n' "${content}" | python3 -c '
 import json
 import sys
 cfg = json.load(sys.stdin)
 assert cfg["channels"]["slack"]["enabled"] is True
 assert cfg["plugins"]["entries"]["slack"]["enabled"] is True
 '; then
-      e2e_fail "expected-state.messaging.slack.openclaw-enabled missing channels.slack.enabled or plugins.entries.slack.enabled"
-    fi
-    e2e_pass "expected-state.messaging.slack.openclaw-enabled channel and plugin enabled"
+    e2e_fail "expected-state.messaging.slack.openclaw-enabled missing channels.slack.enabled or plugins.entries.slack.enabled"
+  fi
+  e2e_pass "expected-state.messaging.slack.openclaw-enabled channel and plugin enabled"
 
-    sandbox_name="$(e2e_context_get E2E_SANDBOX_NAME)"
-    runtime_json="$(openshell sandbox exec --name "${sandbox_name}" -- timeout 45 openclaw channels list --all --json --no-color 2>/dev/null || true)"
-    runtime_state="$(printf '%s\n' "${runtime_json}" | python3 -c '
+  sandbox_name="$(e2e_context_get E2E_SANDBOX_NAME)"
+  runtime_json="$(openshell sandbox exec --name "${sandbox_name}" -- timeout 45 openclaw channels list --all --json --no-color 2>/dev/null || true)"
+  runtime_state="$(printf '%s\n' "${runtime_json}" | python3 -c '
 import json
 import sys
 try:
@@ -44,10 +40,9 @@ try:
 except Exception as exc:
     print("error %s" % exc)
 ' 2>/dev/null || true)"
-    if [[ "${runtime_state}" != "yes" ]]; then
-      e2e_fail "expected-state.messaging.slack.runtime-discovery OpenClaw did not report Slack installed/configured (${runtime_state}; output=${runtime_json:0:300})"
-    fi
-    e2e_pass "expected-state.messaging.slack.runtime-discovery OpenClaw reports Slack installed and configured"
+  if [[ "${runtime_state}" != "yes" ]]; then
+    e2e_fail "expected-state.messaging.slack.runtime-discovery OpenClaw did not report Slack installed/configured (${runtime_state}; output=${runtime_json:0:300})"
   fi
+  e2e_pass "expected-state.messaging.slack.runtime-discovery OpenClaw reports Slack installed and configured"
 fi
 e2e_pass "expected-state.messaging.slack.provider-state ${provider} provider state configured"

From 083b9b662dacf1bbdab67fd6b086ff5d8b95a851 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Thu, 28 May 2026 07:20:04 -0400
Subject: [PATCH 11/23] test(e2e): wrap openshell sandbox exec with per-call
 timeout + diagnostic
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The PR #4380 scenario run produced a SIGTERM cascade across runtime
suite steps (sandbox-shell, inference.*, kimi.*, lifecycle.rebuild.*,
lifecycle.upgrade.*) because the canonical sandbox-exec wrapper had no
inner timeout. When openshell ssh-into-sandbox wedged, the orchestrator
killed the process group at the step cap (30/45/60/120s) and the script
log captured only the header line — no diagnostic, no classifier.

Mirror the legacy test/e2e/ defense-in-depth pattern (outer process
timeout + inner curl --max-time) inside the new framework's canonical
wrapper:

- sandbox-exec.sh now applies a per-call timeout via timeout/gtimeout
  with --kill-after=5s. Default 25s sits below the common 30s step cap;
  callers override via E2E_SANDBOX_EXEC_TIMEOUT_SECONDS for steps with
  larger budgets (chat-completion 50s, sandbox-local 35s, slack 50s,
  gateway-alive fallback 15s). On timeout (124/137) the wrapper emits
  a single-line classified diagnostic so phase.ts captures observable
  evidence instead of a SIGTERM black hole.

- Migrate 8 BARE leaf scripts that called openshell sandbox exec
  directly to route through the wrapper:
    smoke/03-sandbox-shell.sh
    assert/gateway-alive.sh (ollama fallback path)
    inference/cloud/00-models-health.sh
    inference/cloud/01-chat-completion.sh
    inference/cloud/02-inference-local-from-sandbox.sh
    inference/ollama-auth-proxy/00-proxy-reachable.sh
    messaging/slack/00-slack-provider-state.sh

  kimi-compatibility steps already route through inference_routing.sh
  which uses the wrapper, so they inherit the fix transparently.
  rebuild_upgrade.sh and security_policy_credentials.sh continue to use
  their own ad-hoc timeouts; unifying those onto the wrapper is a
  follow-up.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../validation_suites/assert/gateway-alive.sh |  6 +-
 .../inference/cloud/00-models-health.sh       |  6 +-
 .../inference/cloud/01-chat-completion.sh     |  8 ++-
 .../cloud/02-inference-local-from-sandbox.sh  |  6 +-
 .../ollama-auth-proxy/00-proxy-reachable.sh   |  4 +-
 .../slack/00-slack-provider-state.sh          | 11 +++-
 .../validation_suites/sandbox-exec.sh         | 65 ++++++++++++++++++-
 .../smoke/03-sandbox-shell.sh                 |  5 +-
 8 files changed, 101 insertions(+), 10 deletions(-)

diff --git a/test/e2e-scenario/validation_suites/assert/gateway-alive.sh b/test/e2e-scenario/validation_suites/assert/gateway-alive.sh
index 5eec76073c..42f33e1c50 100755
--- a/test/e2e-scenario/validation_suites/assert/gateway-alive.sh
+++ b/test/e2e-scenario/validation_suites/assert/gateway-alive.sh
@@ -9,6 +9,8 @@ _E2E_GW_LIB_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../../runtime/lib" && pwd)
 . "${_E2E_GW_LIB_DIR}/env.sh"
 # shellcheck source=../../runtime/lib/context.sh
 . "${_E2E_GW_LIB_DIR}/context.sh"
+# shellcheck source=../sandbox-exec.sh
+. "$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)/sandbox-exec.sh"
 
 # e2e_gateway_assert_healthy [url]
 # Defaults to E2E_GATEWAY_URL from context; returns non-zero with a clear
@@ -37,7 +39,9 @@ e2e_gateway_assert_healthy() {
     local sandbox_name
     sandbox_name="$(e2e_context_get E2E_SANDBOX_NAME)"
     if [[ -n "${sandbox_name}" ]] && command -v openshell >/dev/null 2>&1; then
-      http_code="$(openshell sandbox exec -n "${sandbox_name}" -- curl -fsS -o /dev/null -w '%{http_code}' --max-time 5 http://localhost:18789/health 2>/dev/null || echo 000)"
+      # Wrapper applies a per-call timeout so a wedged ssh handshake here
+      # cannot consume the orchestrator's whole step budget.
+      http_code="$(E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=15 e2e_sandbox_exec "${sandbox_name}" -- curl -fsS -o /dev/null -w '%{http_code}' --max-time 5 http://localhost:18789/health 2>/dev/null || echo 000)"
       if [[ "${http_code}" == "200" || "${http_code}" == "401" ]]; then
         return 0
       fi
diff --git a/test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh b/test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
index af8ad99081..8277f05f38 100755
--- a/test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
+++ b/test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
@@ -13,12 +13,16 @@ LIB_DIR="$(cd "${SCRIPT_DIR}/../../../runtime/lib" && pwd)"
 . "${LIB_DIR}/env.sh"
 # shellcheck source=../../../runtime/lib/context.sh
 . "${LIB_DIR}/context.sh"
+# shellcheck source=../../sandbox-exec.sh
+. "${SCRIPT_DIR}/../../sandbox-exec.sh"
 
 echo "inference:models-health"
 e2e_context_require E2E_SANDBOX_NAME
 
 name="$(e2e_context_get E2E_SANDBOX_NAME)"
-body="$(openshell sandbox exec --name "${name}" -- curl -fsS --max-time 30 "https://inference.local/v1/models")"
+# Orchestrator step cap is 30s; wrapper default 25s applies. Inner curl
+# --max-time keeps a hung HTTP read from consuming the whole budget.
+body="$(e2e_sandbox_exec "${name}" -- curl -fsS --max-time 20 "https://inference.local/v1/models")"
 if [[ -z "${body}" ]]; then
   echo "inference:models-health: no response from models endpoint" >&2
   exit 1
diff --git a/test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh b/test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
index 242a9496f1..208be3b1f8 100755
--- a/test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
+++ b/test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
@@ -12,13 +12,19 @@ LIB_DIR="$(cd "${SCRIPT_DIR}/../../../runtime/lib" && pwd)"
 . "${LIB_DIR}/env.sh"
 # shellcheck source=../../../runtime/lib/context.sh
 . "${LIB_DIR}/context.sh"
+# shellcheck source=../../sandbox-exec.sh
+. "${SCRIPT_DIR}/../../sandbox-exec.sh"
 
 echo "inference:chat-completion"
 e2e_context_require E2E_SANDBOX_NAME
 
 name="$(e2e_context_get E2E_SANDBOX_NAME)"
 payload='{"model":"nvidia/nemotron-3-super-120b-a12b","messages":[{"role":"user","content":"Reply with exactly one word: PONG"}],"max_tokens":100}'
-response="$(openshell sandbox exec --name "${name}" -- curl -fsS --max-time 60 -H 'Content-Type: application/json' \
+# Orchestrator step cap is 60s; widen the wrapper cap to 50s so a hung
+# upstream surfaces with a clear diagnostic before SIGTERM. Inner curl
+# --max-time stays ~10s under the wrapper cap.
+E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=50 \
+response="$(e2e_sandbox_exec "${name}" -- curl -fsS --max-time 40 -H 'Content-Type: application/json' \
   -d "${payload}" "https://inference.local/v1/chat/completions")"
 # CodeRabbit review item #12: substring expansion instead of `| head`
 # avoids SIGPIPE-driven false failures under `set -o pipefail`.
diff --git a/test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh b/test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
index c3253a966a..7e51f9e3e7 100755
--- a/test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
+++ b/test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
@@ -13,13 +13,17 @@ LIB_DIR="$(cd "${SCRIPT_DIR}/../../../runtime/lib" && pwd)"
 . "${LIB_DIR}/env.sh"
 # shellcheck source=../../../runtime/lib/context.sh
 . "${LIB_DIR}/context.sh"
+# shellcheck source=../../sandbox-exec.sh
+. "${SCRIPT_DIR}/../../sandbox-exec.sh"
 
 echo "inference:sandbox-inference-local"
 e2e_context_require E2E_SANDBOX_NAME E2E_INFERENCE_ROUTE
 
 name="$(e2e_context_get E2E_SANDBOX_NAME)"
 route="$(e2e_context_get E2E_INFERENCE_ROUTE)"
+# Orchestrator step cap is 45s; widen wrapper cap to 35s.
 # CodeRabbit review item #13: capture then truncate to avoid `| head` racing
 # curl under `pipefail` and flagging a successful request as failed.
-body="$(openshell sandbox exec --name "${name}" -- curl -fsS --max-time 10 "https://${route}/v1/models")"
+E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=35 \
+body="$(e2e_sandbox_exec "${name}" -- curl -fsS --max-time 25 "https://${route}/v1/models")"
 printf '%s\n' "${body:0:512}"
diff --git a/test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh b/test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
index a3887864b2..d172615795 100755
--- a/test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
+++ b/test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
@@ -12,6 +12,8 @@ LIB_DIR="$(cd "${SCRIPT_DIR}/../../../runtime/lib" && pwd)"
 . "${LIB_DIR}/env.sh"
 # shellcheck source=../../../runtime/lib/context.sh
 . "${LIB_DIR}/context.sh"
+# shellcheck source=../../sandbox-exec.sh
+. "${SCRIPT_DIR}/../../sandbox-exec.sh"
 
 echo "ollama-proxy:proxy-reachable"
 e2e_context_require E2E_SANDBOX_NAME
@@ -19,7 +21,7 @@ name="$(e2e_context_get E2E_SANDBOX_NAME)"
 # The Ollama auth proxy intentionally rejects unauthenticated requests to
 # /api/tags (legacy test-gpu-e2e.sh accepts 401/403 as proof the proxy is
 # live and enforcing auth). Do not use curl -f here.
-status="$(openshell sandbox exec --name "${name}" -- curl -sS -o /dev/null -w '%{http_code}' --max-time 10 "http://inference-local/api/tags" 2>/dev/null || echo 000)"
+status="$(e2e_sandbox_exec "${name}" -- curl -sS -o /dev/null -w '%{http_code}' --max-time 10 "http://inference-local/api/tags" 2>/dev/null || echo 000)"
 case "${status}" in
   200 | 401 | 403)
     echo "ollama-proxy:proxy-reachable status=${status}"
diff --git a/test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh b/test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh
index a8ef4d11b8..bac54bb501 100755
--- a/test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh
+++ b/test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh
@@ -3,7 +3,10 @@
 # SPDX-License-Identifier: Apache-2.0
 
 set -euo pipefail
-. "$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)/lib/messaging_providers.sh"
+_SLACK_SUITES_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)"
+. "${_SLACK_SUITES_DIR}/lib/messaging_providers.sh"
+# shellcheck source=../../sandbox-exec.sh
+. "${_SLACK_SUITES_DIR}/sandbox-exec.sh"
 e2e_messaging_load_context
 provider="$(e2e_messaging_provider_name)"
 case "${provider}" in
@@ -25,7 +28,11 @@ assert cfg["plugins"]["entries"]["slack"]["enabled"] is True
   e2e_pass "expected-state.messaging.slack.openclaw-enabled channel and plugin enabled"
 
   sandbox_name="$(e2e_context_get E2E_SANDBOX_NAME)"
-  runtime_json="$(openshell sandbox exec --name "${sandbox_name}" -- timeout 45 openclaw channels list --all --json --no-color 2>/dev/null || true)"
+  # Wrapper cap (50s) sits just above the inner `timeout 45` so the inner
+  # cap is what fires under normal upstream slowness; the wrapper only
+  # catches the case where openshell itself wedges before delivering the
+  # `timeout` invocation to the sandbox.
+  runtime_json="$(E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=50 e2e_sandbox_exec "${sandbox_name}" -- timeout 45 openclaw channels list --all --json --no-color 2>/dev/null || true)"
   runtime_state="$(printf '%s\n' "${runtime_json}" | python3 -c '
 import json
 import sys
diff --git a/test/e2e-scenario/validation_suites/sandbox-exec.sh b/test/e2e-scenario/validation_suites/sandbox-exec.sh
index c9e3ec06c6..c69f0c963d 100755
--- a/test/e2e-scenario/validation_suites/sandbox-exec.sh
+++ b/test/e2e-scenario/validation_suites/sandbox-exec.sh
@@ -22,6 +22,34 @@ _E2E_SBEX_LIB_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../runtime/lib" && pwd)"
 # shellcheck source=../runtime/lib/env.sh
 . "${_E2E_SBEX_LIB_DIR}/env.sh"
 
+# Per-call timeout (seconds) applied to every `openshell sandbox exec`
+# invocation routed through this wrapper. Callers MAY override per call:
+#   E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=50 e2e_sandbox_exec ...
+#
+# Why a wrapper-level cap exists:
+#   The orchestrator (phase.ts) enforces step-level timeouts via SIGTERM on
+#   the script's process group. When openshell ssh-into-sandbox hangs,
+#   SIGTERM eventually kills the script — but the script has no chance to
+#   emit a structured diagnostic, so logs end mid-line. An inner per-call
+#   `timeout` lets the wrapper observe the hang, emit a classified
+#   diagnostic, and exit cleanly *before* the orchestrator's SIGTERM.
+#
+# The default (25s) sits below the most common orchestrator step caps
+# (30s smoke / kimi, 45s sandbox-local). Steps with longer caps (60s
+# chat-completion, 120s rebuild) export a larger value before calling.
+: "${E2E_SANDBOX_EXEC_TIMEOUT_SECONDS:=25}"
+
+# Resolve the timeout binary once. Empty string == not available.
+_e2e_sbex_resolve_timeout_cmd() {
+  if command -v timeout >/dev/null 2>&1; then
+    printf '%s' timeout
+  elif command -v gtimeout >/dev/null 2>&1; then
+    printf '%s' gtimeout
+  else
+    printf '%s' ''
+  fi
+}
+
 # _e2e_sbex_split_args <sandbox> -- <cmd> [args...]
 # Parses the shared calling convention. Prints on stderr on misuse and
 # returns 2. On success, sets the two global arrays _E2E_SBEX_SB_NAME and
@@ -55,7 +83,26 @@ e2e_sandbox_exec() {
     echo "e2e_sandbox_exec: openshell CLI not on PATH" >&2
     return 127
   fi
-  openshell sandbox exec --name "${_E2E_SBEX_SB_NAME}" -- "${_E2E_SBEX_CMD[@]}"
+  local timeout_cmd seconds="${E2E_SANDBOX_EXEC_TIMEOUT_SECONDS}"
+  timeout_cmd="$(_e2e_sbex_resolve_timeout_cmd)"
+  if [[ -z "${timeout_cmd}" ]]; then
+    # No timeout binary available — fall back to bare exec but make the
+    # missing safety net visible so CI can flag it.
+    echo "e2e_sandbox_exec: 'timeout' not available; running without per-call cap (sandbox=${_E2E_SBEX_SB_NAME})" >&2
+    openshell sandbox exec --name "${_E2E_SBEX_SB_NAME}" -- "${_E2E_SBEX_CMD[@]}"
+    return $?
+  fi
+  local rc=0
+  "${timeout_cmd}" --kill-after=5s "${seconds}" \
+    openshell sandbox exec --name "${_E2E_SBEX_SB_NAME}" -- "${_E2E_SBEX_CMD[@]}"
+  rc=$?
+  if [[ "${rc}" -eq 124 || "${rc}" -eq 137 ]]; then
+    # 124 = timeout fired SIGTERM, 137 = --kill-after fired SIGKILL.
+    # Emit a single-line classified diagnostic so phase.ts captures
+    # something more useful than a SIGTERM black hole.
+    echo "e2e_sandbox_exec: openshell sandbox exec hung after ${seconds}s (sandbox=${_E2E_SBEX_SB_NAME}, cmd=${_E2E_SBEX_CMD[0]:-?}; classifier=gateway-transient)" >&2
+  fi
+  return "${rc}"
 }
 
 # e2e_sandbox_exec_stdin <sandbox> -- <cmd> [args...]
@@ -69,5 +116,19 @@ e2e_sandbox_exec_stdin() {
     echo "e2e_sandbox_exec_stdin: openshell CLI not on PATH" >&2
     return 127
   fi
-  openshell sandbox exec --name "${_E2E_SBEX_SB_NAME}" -- "${_E2E_SBEX_CMD[@]}"
+  local timeout_cmd seconds="${E2E_SANDBOX_EXEC_TIMEOUT_SECONDS}"
+  timeout_cmd="$(_e2e_sbex_resolve_timeout_cmd)"
+  if [[ -z "${timeout_cmd}" ]]; then
+    echo "e2e_sandbox_exec_stdin: 'timeout' not available; running without per-call cap (sandbox=${_E2E_SBEX_SB_NAME})" >&2
+    openshell sandbox exec --name "${_E2E_SBEX_SB_NAME}" -- "${_E2E_SBEX_CMD[@]}"
+    return $?
+  fi
+  local rc=0
+  "${timeout_cmd}" --kill-after=5s "${seconds}" \
+    openshell sandbox exec --name "${_E2E_SBEX_SB_NAME}" -- "${_E2E_SBEX_CMD[@]}"
+  rc=$?
+  if [[ "${rc}" -eq 124 || "${rc}" -eq 137 ]]; then
+    echo "e2e_sandbox_exec_stdin: openshell sandbox exec hung after ${seconds}s (sandbox=${_E2E_SBEX_SB_NAME}, cmd=${_E2E_SBEX_CMD[0]:-?}; classifier=gateway-transient)" >&2
+  fi
+  return "${rc}"
 }
diff --git a/test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh b/test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
index d27e8b7a24..966efeb2d8 100755
--- a/test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
+++ b/test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
@@ -13,12 +13,15 @@ LIB_DIR="$(cd "${SCRIPT_DIR}/../../runtime/lib" && pwd)"
 . "${LIB_DIR}/env.sh"
 # shellcheck source=../../runtime/lib/context.sh
 . "${LIB_DIR}/context.sh"
+# shellcheck source=../sandbox-exec.sh
+. "${SCRIPT_DIR}/../sandbox-exec.sh"
 
 echo "smoke:sandbox-shell"
 e2e_context_require E2E_SANDBOX_NAME
 
 name="$(e2e_context_get E2E_SANDBOX_NAME)"
-output="$(openshell sandbox exec --name "${name}" -- echo ok 2>&1)"
+# Orchestrator step cap is 30s; wrapper default 25s applies.
+output="$(e2e_sandbox_exec "${name}" -- echo ok 2>&1)"
 echo "${output}"
 if ! echo "${output}" | grep -q '^ok$'; then
   echo "smoke:sandbox-shell: did not receive expected 'ok' from sandbox" >&2

From 14ab98f987adb0df31c41a287ed79b673260f2af Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Thu, 28 May 2026 07:47:05 -0400
Subject: [PATCH 12/23] ci(e2e): silence SC2034 on prefix-env sandbox exec
 timeout overrides

shellcheck flags 'E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=N \\\n cmd' as
SC2034 (apparently unused) because static analysis can't follow that the
wrapper reads it via environment. The pattern is correct (prefix-env
assignment exports the variable to the immediately-following command),
but the warning blocks the 'checks' job.

Add a localized 'shellcheck disable=SC2034' directly above each affected
prefix-env line in the two cloud-inference assertion scripts that
override the default wrapper timeout. No behavior change.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../validation_suites/inference/cloud/01-chat-completion.sh      | 1 +
 .../inference/cloud/02-inference-local-from-sandbox.sh           | 1 +
 2 files changed, 2 insertions(+)

diff --git a/test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh b/test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
index 208be3b1f8..20f481504e 100755
--- a/test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
+++ b/test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
@@ -23,6 +23,7 @@ payload='{"model":"nvidia/nemotron-3-super-120b-a12b","messages":[{"role":"user"
 # Orchestrator step cap is 60s; widen the wrapper cap to 50s so a hung
 # upstream surfaces with a clear diagnostic before SIGTERM. Inner curl
 # --max-time stays ~10s under the wrapper cap.
+# shellcheck disable=SC2034 # consumed by e2e_sandbox_exec via env
 E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=50 \
 response="$(e2e_sandbox_exec "${name}" -- curl -fsS --max-time 40 -H 'Content-Type: application/json' \
   -d "${payload}" "https://inference.local/v1/chat/completions")"
diff --git a/test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh b/test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
index 7e51f9e3e7..f5102efd74 100755
--- a/test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
+++ b/test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
@@ -24,6 +24,7 @@ route="$(e2e_context_get E2E_INFERENCE_ROUTE)"
 # Orchestrator step cap is 45s; widen wrapper cap to 35s.
 # CodeRabbit review item #13: capture then truncate to avoid `| head` racing
 # curl under `pipefail` and flagging a successful request as failed.
+# shellcheck disable=SC2034 # consumed by e2e_sandbox_exec via env
 E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=35 \
 body="$(e2e_sandbox_exec "${name}" -- curl -fsS --max-time 25 "https://${route}/v1/models")"
 printf '%s\n' "${body:0:512}"

From c167d35d57e5592527172c357c6e6bd3304dda40 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Thu, 28 May 2026 07:59:05 -0400
Subject: [PATCH 13/23] feat(e2e): framework-owned secret hygiene at the spawn
 boundary

Addresses PR Review Advisor finding #1: 'Secret-bearing child process
output is persisted and uploaded without redaction'.

Designed in the spirit of the test/e2e-scenario architecture: secret
hygiene is FRAMEWORK INFRASTRUCTURE, not a per-script / per-action /
per-workflow concern. Same one-mode discipline that motivates the rest
of this PR (no flag, no env var, no helper bypasses redaction).

New module: test/e2e-scenario/scenarios/orchestrators/redaction.ts
  Sits next to context.ts, owns:
    - redactString(text):       canonical token-shape redaction
    - buildChildEnv(base, opts): minimal allowlisted env + declared
                                 secretEnv passthrough only
    - pipeRedacted(src, log):   single I/O wrapper used by both
                                 runAction and runShellStep
  Pattern set is a framework-local mirror of
  src/lib/security/secret-patterns.ts. The framework deliberately does
  NOT import from src/lib/security/ to keep the framework-vs-product
  boundary clean and avoid cross-tsconfig ESM issues. A new parity
  test (e2e-redaction-parity.test.ts) asserts the two pattern sets
  stay in lockstep so adding a token shape in one place keeps both
  layers honest.

Typed contract extension: PhaseAction.secretEnv?: readonly string[]
  AssertionStep.secretEnv?: readonly string[]
  Compiler decides which secrets cross the framework boundary, the
  same way aliasPath / scriptRef / timeoutSeconds are declared
  metadata. buildChildEnv enforces a 'secret-key shape' contract on
  every entry (must end with API_KEY/TOKEN/SECRET/PASSWORD/CREDENTIAL/
  PASSPHRASE/PRIVATE_KEY/ACCESS_KEY) so the secretEnv channel cannot
  silently allowlist non-secret variables.

compiler.ts: declares onboard secretEnv per profile
  cloud-* profiles -> ['NVIDIA_API_KEY']
  local-ollama-openclaw -> []
  Install actions declare nothing (the installer itself does not
  authenticate to the cloud).

phase.ts: two call sites updated, identical pattern in both
  runAction:    buildChildEnv(...) + pipeRedacted(stdout, log) +
                pipeRedacted(stderr, log, tail<-redactedChunk)
  runShellStep: same shape
  Spawn-error 'message' fields are also wrapped with redactString as
  defense-in-depth.

Workflow upload tightening (.github/workflows/e2e-scenarios.yaml):
  - include-hidden-files: false
    Hidden dotfiles under the workspace can carry raw secrets
    (notably .e2e/context.env, written by e2e_context_set without
    redaction). Diagnostic dumps of context use e2e_context_dump
    which redacts on emit.
  - path: explicit subpath list (.e2e/actions/, .e2e/logs/, result
    JSONs, plan/run-plan, onboard.log) instead of blanket .e2e/.
  workflow-boundary.mts contract test updated to assert the new
  invariant; e2e-scenarios-workflow.test.ts expectations updated.

Bash side: unchanged. test/e2e-scenario/runtime/lib/context.sh::
e2e_context_dump already redacts via _e2e_context_is_sensitive_key.
This module covers the TS-spawned-child path that lacked coverage.

Tests (e2e-phase-orchestrators.test.ts, +5):
  - test_should_not_persist_secret_shaped_child_output_into_evidence
    Real shell child writes nvapi-, ghp_, sk-, xoxb-, Bearer ...
    Asserts evidence log, phase result.json, and result.message
    contain none of the literal tokens but do contain <REDACTED>.
  - test_should_drop_non_allowlisted_parent_env_unless_declared_in_secretEnv
    Sets a sentinel parent-env var, runs child without declaring it,
    asserts child cannot see it. PATH and E2E_PHASE still arrive.
  - test_should_pass_declared_secretEnv_through_to_child
    Declares the var in step.secretEnv, asserts child sees it.
  - test_should_reject_non_secret_shaped_keys_in_secretEnv_at_runtime
    buildChildEnv throws on FOO_VAR; the secret-key-shape contract
    keeps the allowlist boundary honest.
  - test_should_declare_NVIDIA_API_KEY_only_for_cloud_onboarding_actions
    Compiler-level contract: cloud onboard declares NVIDIA_API_KEY,
    local-ollama declares nothing.

Tests (e2e-redaction-parity.test.ts, new file, +2):
  - TOKEN_PREFIX_PATTERNS framework copy matches product source
  - CONTEXT_PATTERNS framework copy matches product source

308/308 e2e-scenario framework tests pass.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .github/workflows/e2e-scenarios.yaml          |  15 +-
 .../e2e-phase-orchestrators.test.ts           | 144 ++++++++++++
 .../e2e-redaction-parity.test.ts              |  73 ++++++
 .../e2e-scenarios-workflow.test.ts            |   5 +-
 test/e2e-scenario/scenarios/compiler.ts       |  25 +++
 .../scenarios/orchestrators/phase.ts          |  61 +++--
 .../scenarios/orchestrators/redaction.ts      | 212 ++++++++++++++++++
 test/e2e-scenario/scenarios/types.ts          |  15 ++
 tools/e2e-scenarios/workflow-boundary.mts     |  25 ++-
 9 files changed, 545 insertions(+), 30 deletions(-)
 create mode 100644 test/e2e-scenario/framework-tests/e2e-redaction-parity.test.ts
 create mode 100644 test/e2e-scenario/scenarios/orchestrators/redaction.ts

diff --git a/.github/workflows/e2e-scenarios.yaml b/.github/workflows/e2e-scenarios.yaml
index 10292b50a4..13da3bccfd 100644
--- a/.github/workflows/e2e-scenarios.yaml
+++ b/.github/workflows/e2e-scenarios.yaml
@@ -334,14 +334,25 @@ jobs:
         uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
         with:
           name: e2e-scenario-${{ inputs.scenarios || github.event.inputs.scenarios }}
+          # Explicit subpath list, NOT a blanket .e2e/ + hidden files.
+          # The framework redacts every byte that flows from spawned
+          # children into actions/*.log, logs/*.log, and onboard.log via
+          # orchestrators/redaction.ts::pipeRedacted. Anything outside
+          # the listed paths (notably the raw context.env file) is
+          # excluded so secret-bearing key=value lines cannot leak via
+          # the artifact even if a future helper writes there.
+          # Diagnostic dumps of context use e2e_context_dump, which
+          # redacts on emit (runtime/lib/context.sh).
           path: |
             .e2e/run-plan.json
             .e2e/plan.txt
             .e2e/environment.result.json
             .e2e/onboarding.result.json
             .e2e/runtime.result.json
-            .e2e/
+            .e2e/actions/
+            .e2e/logs/
+            .e2e/onboard.log
             test/e2e/logs/
           if-no-files-found: warn
           retention-days: 14
-          include-hidden-files: true
+          include-hidden-files: false
diff --git a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
index a587c50112..a09e5be6a8 100644
--- a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
@@ -555,6 +555,150 @@ describe("ScenarioRunner seeds context.env and short-circuits across phases", ()
   });
 });
 
+describe("framework-owned secret hygiene at the spawn boundary", () => {
+  it("test_should_not_persist_secret_shaped_child_output_into_evidence", async () => {
+    const ctx = freshCtx();
+    try {
+      // Child writes secret-shaped tokens (NVIDIA, GitHub, OpenAI,
+      // Slack, Bearer-prefixed) on both stdout and stderr, then exits
+      // non-zero so stderrTail also flows into result.message. None of
+      // those literal tokens may persist anywhere in the evidence.
+      const body = [
+        'echo "step prints nvapi-1234567890abcdef0123456789"',
+        'echo "and ghp_abcdefghijklmnopqrstuvwxyz0123456789"',
+        'echo "and sk-abcdefghijklmnopqrstuvwxyz0123456789"',
+        'echo "and xoxb-9876543210-fake-bot-token-abc"',
+        'echo "Authorization: Bearer eyJhbGciOiJIUzI1NiJ9.payload.signature" 1>&2',
+        'exit 7',
+      ].join("\n");
+      const script = writeTempScript(ctx.contextDir, "leak.sh", body);
+      const ref = path.relative(REPO_ROOT, script);
+      const step = shellStep("runtime.leak", "runtime", ref);
+      const orchestrator = new PhaseOrchestrator("runtime");
+
+      const result = await orchestrator.run(ctx, makePhase([step]));
+      const assertion = result.assertions[0];
+      const logBody = fs.readFileSync(path.join(ctx.contextDir, ".e2e", "logs", `${step.id}.log`), "utf8");
+      const phaseResultJson = fs.readFileSync(
+        path.join(ctx.contextDir, ".e2e", "runtime.result.json"),
+        "utf8",
+      );
+      const surfaces = [logBody, assertion.message ?? "", phaseResultJson];
+
+      // Every secret-shaped token canonicalized in
+      // src/lib/security/secret-patterns.ts must be redacted on the
+      // way to disk, regardless of which surface is read.
+      const forbiddenPatterns = [
+        /nvapi-[A-Za-z0-9_-]{10,}/,
+        /ghp_[A-Za-z0-9_-]{10,}/,
+        /sk-[A-Za-z0-9_-]{20,}/,
+        /(?:xox[bpas]|xapp)-[A-Za-z0-9-]{10,}/,
+        /Bearer\s+[A-Za-z0-9_.+\/=-]{10,}/i,
+      ];
+      for (const surface of surfaces) {
+        for (const pat of forbiddenPatterns) {
+          expect(surface, `evidence surface must not contain ${pat}`).not.toMatch(pat);
+        }
+        expect(surface).toMatch(/<REDACTED>/);
+      }
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("test_should_drop_non_allowlisted_parent_env_unless_declared_in_secretEnv", async () => {
+    const ctx = freshCtx();
+    const sentinelKey = "SECRET_LEAK_PROBE_TOKEN";
+    const previous = process.env[sentinelKey];
+    process.env[sentinelKey] = "sentinel-value-that-must-not-leak";
+    try {
+      const script = writeTempScript(
+        ctx.contextDir,
+        "env-leak.sh",
+        `printenv | sort\n`,
+      );
+      const ref = path.relative(REPO_ROOT, script);
+      // Step does NOT declare SECRET_LEAK_PROBE_TOKEN in secretEnv,
+      // so the framework must drop it before spawn.
+      const step = shellStep("runtime.env-drop", "runtime", ref);
+      const orchestrator = new PhaseOrchestrator("runtime");
+
+      const result = await orchestrator.run(ctx, makePhase([step]));
+      const logBody = fs.readFileSync(path.join(ctx.contextDir, ".e2e", "logs", `${step.id}.log`), "utf8");
+
+      expect(result.assertions[0].status).toBe("passed");
+      expect(logBody, "non-allowlisted parent env must not reach the child").not.toContain(sentinelKey);
+      expect(logBody).not.toContain("sentinel-value-that-must-not-leak");
+      // Framework allowlist + overlay still arrive: PATH and E2E_PHASE.
+      expect(logBody).toMatch(/^PATH=/m);
+      expect(logBody).toMatch(/^E2E_PHASE=runtime$/m);
+    } finally {
+      if (previous === undefined) delete process.env[sentinelKey];
+      else process.env[sentinelKey] = previous;
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("test_should_pass_declared_secretEnv_through_to_child", async () => {
+    const ctx = freshCtx();
+    const declaredKey = "NEMOCLAW_TEST_API_KEY"; // matches SECRET_ENV_KEY_SHAPE
+    const previous = process.env[declaredKey];
+    process.env[declaredKey] = "declared-secret-value-passes-through";
+    try {
+      const script = writeTempScript(
+        ctx.contextDir,
+        "declared.sh",
+        `printenv ${declaredKey} || echo MISSING\n`,
+      );
+      const ref = path.relative(REPO_ROOT, script);
+      const step: AssertionStep = {
+        ...shellStep("runtime.env-declared", "runtime", ref),
+        secretEnv: [declaredKey],
+      };
+      const orchestrator = new PhaseOrchestrator("runtime");
+
+      const result = await orchestrator.run(ctx, makePhase([step]));
+      const logBody = fs.readFileSync(path.join(ctx.contextDir, ".e2e", "logs", `${step.id}.log`), "utf8");
+
+      expect(result.assertions[0].status).toBe("passed");
+      // Declared secret reaches the child verbatim.
+      expect(logBody).toContain("declared-secret-value-passes-through");
+      // It is NOT redacted in printenv output because nothing about
+      // the literal value matches a token-shape pattern. (Real
+      // secrets that match secret-patterns.ts WILL be redacted as a
+      // second line of defense; this synthetic value is intentionally
+      // shape-free to isolate the env-passthrough behavior.)
+    } finally {
+      if (previous === undefined) delete process.env[declaredKey];
+      else process.env[declaredKey] = previous;
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("test_should_reject_non_secret_shaped_keys_in_secretEnv_at_runtime", async () => {
+    const { buildChildEnv } = await import("../scenarios/orchestrators/redaction.ts");
+    expect(() =>
+      buildChildEnv(process.env, { secretEnv: ["FOO_VAR"], frameworkOverlay: {} }),
+    ).toThrow(/secret-key shape/);
+  });
+
+  it("test_should_declare_NVIDIA_API_KEY_only_for_cloud_onboarding_actions", async () => {
+    const { compileRunPlans } = await import("../scenarios/compiler.ts");
+    const plans = compileRunPlans([
+      "ubuntu-repo-cloud-openclaw",
+      "gpu-repo-local-ollama-openclaw",
+    ]);
+    const cloudOnboard = plans[0].phases
+      .find((p) => p.name === "onboarding")
+      ?.actions.find((a) => a.id.startsWith("onboarding.profile."));
+    const localOnboard = plans[1].phases
+      .find((p) => p.name === "onboarding")
+      ?.actions.find((a) => a.id.startsWith("onboarding.profile."));
+    expect(cloudOnboard?.secretEnv).toEqual(["NVIDIA_API_KEY"]);
+    expect(localOnboard?.secretEnv).toEqual([]);
+  });
+});
+
 describe("clients are pass/fail/policy free", () => {
   it("test_should_keep_clients_free_of_pass_fail_and_retry_semantics", () => {
     const observation = new HostCliClient().observeVersion();
diff --git a/test/e2e-scenario/framework-tests/e2e-redaction-parity.test.ts b/test/e2e-scenario/framework-tests/e2e-redaction-parity.test.ts
new file mode 100644
index 0000000000..eb6c785a91
--- /dev/null
+++ b/test/e2e-scenario/framework-tests/e2e-redaction-parity.test.ts
@@ -0,0 +1,73 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+/**
+ * Parity test: the framework's local secret-pattern set
+ * (test/e2e-scenario/scenarios/orchestrators/redaction.ts) must stay in
+ * lockstep with the canonical product source
+ * (src/lib/security/secret-patterns.ts).
+ *
+ * The framework deliberately mirrors rather than imports — see the
+ * "Framework-local mirror" comment in redaction.ts for why — but the
+ * mirror is only safe if it is actually a mirror. This test parses
+ * both source files at the textual level and compares the regex
+ * literals.
+ */
+
+import { describe, expect, it } from "vitest";
+import fs from "node:fs";
+import path from "node:path";
+
+const REPO_ROOT = path.resolve(import.meta.dirname, "../../..");
+
+// Pull only regex literals (lines starting with `/` and ending with
+// a flag set like /g or /gi). Filters out comment lines like `// NVIDIA`
+// that begin with `/` but are not regex.
+const REGEX_LITERAL_LINE = /^\/.+\/[a-z]*,?$/;
+
+function extractFromBlock(block: string): string[] {
+  return block
+    .split("\n")
+    .map((line) => line.trim())
+    .filter((line) => REGEX_LITERAL_LINE.test(line))
+    .map((line) => line.replace(/,\s*$/, ""));
+}
+
+function extractRegexLiterals(source: string, exportName: string): string[] {
+  const re = new RegExp(`export const ${exportName}[^=]*=\\s*\\[([\\s\\S]*?)\\];`, "m");
+  const m = source.match(re);
+  return m ? extractFromBlock(m[1]) : [];
+}
+
+function extractFrameworkArray(source: string, constName: string): string[] {
+  const re = new RegExp(`const ${constName}: RegExp\\[\\] = \\[([\\s\\S]*?)\\];`, "m");
+  const m = source.match(re);
+  return m ? extractFromBlock(m[1]) : [];
+}
+
+describe("framework redaction parity with product source-of-truth", () => {
+  const productSource = fs.readFileSync(
+    path.join(REPO_ROOT, "src/lib/security/secret-patterns.ts"),
+    "utf8",
+  );
+  const frameworkSource = fs.readFileSync(
+    path.join(REPO_ROOT, "test/e2e-scenario/scenarios/orchestrators/redaction.ts"),
+    "utf8",
+  );
+
+  it("test_framework_TOKEN_PREFIX_PATTERNS_matches_product_source", () => {
+    const product = extractRegexLiterals(productSource, "TOKEN_PREFIX_PATTERNS");
+    const framework = extractFrameworkArray(frameworkSource, "TOKEN_PREFIX_PATTERNS");
+    expect(framework.length).toBeGreaterThan(0);
+    expect(product.length).toBeGreaterThan(0);
+    expect(framework).toEqual(product);
+  });
+
+  it("test_framework_CONTEXT_PATTERNS_matches_product_source", () => {
+    const product = extractRegexLiterals(productSource, "CONTEXT_PATTERNS");
+    const framework = extractFrameworkArray(frameworkSource, "CONTEXT_PATTERNS");
+    expect(framework.length).toBeGreaterThan(0);
+    expect(product.length).toBeGreaterThan(0);
+    expect(framework).toEqual(product);
+  });
+});
diff --git a/test/e2e-scenario/framework-tests/e2e-scenarios-workflow.test.ts b/test/e2e-scenario/framework-tests/e2e-scenarios-workflow.test.ts
index eb1be9ae19..5a1e3d8906 100644
--- a/test/e2e-scenario/framework-tests/e2e-scenarios-workflow.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-scenarios-workflow.test.ts
@@ -50,8 +50,9 @@ jobs:
           "run-scenario job must use the resolved runner output",
           "run-scenario job missing step: Run typed scenarios in WSL",
           "artifact upload name must include the scenarios input",
-          "artifact upload must include hidden .e2e files",
-          "artifact upload path must include .e2e/",
+          "artifact upload must set include-hidden-files: false (raw context.env must not leak)",
+          "artifact upload path must include .e2e/actions/ (redacted action evidence)",
+          "artifact upload path must include .e2e/logs/ (redacted shell-step evidence)",
         ]),
       );
     } finally {
diff --git a/test/e2e-scenario/scenarios/compiler.ts b/test/e2e-scenario/scenarios/compiler.ts
index e26afb9820..8b6e47019d 100644
--- a/test/e2e-scenario/scenarios/compiler.ts
+++ b/test/e2e-scenario/scenarios/compiler.ts
@@ -87,6 +87,25 @@ const ONBOARD_DISPATCH = "test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.
 const INSTALL_TIMEOUT_SECONDS = 900;
 const ONBOARD_TIMEOUT_SECONDS = 900;
 
+// Declared parent-env secrets each onboarding profile actually needs.
+// Anything not listed here (and not in the framework allowlist) is
+// dropped before spawn by buildChildEnv. Keep this list minimal —
+// every entry widens the secret blast radius if the child or one of
+// its descendants logs unredacted output.
+const ONBOARD_PROFILE_SECRET_ENV: Readonly<Record<string, readonly string[]>> = {
+  // Cloud profiles invoke `nemoclaw onboard` which authenticates to the
+  // NVIDIA cloud provider via NVIDIA_API_KEY.
+  "cloud-openclaw": ["NVIDIA_API_KEY"],
+  "cloud-openclaw-custom-policies": ["NVIDIA_API_KEY"],
+  "cloud-openclaw-invalid-nvidia-key": ["NVIDIA_API_KEY"],
+  "cloud-openclaw-gateway-port-conflict": ["NVIDIA_API_KEY"],
+  "cloud-hermes": ["NVIDIA_API_KEY"],
+  "cloud-hermes-discord": ["NVIDIA_API_KEY"],
+  "cloud-hermes-slack": ["NVIDIA_API_KEY"],
+  // Local profiles do not need any cloud secret.
+  "local-ollama-openclaw": [],
+};
+
 function phaseActions(phase: PhaseName, scenario: ScenarioDefinition): PhaseAction[] {
   if (phase === "environment") {
     if (!scenario.environment) {
@@ -123,6 +142,11 @@ function phaseActions(phase: PhaseName, scenario: ScenarioDefinition): PhaseActi
     if (!onboardingId) {
       throw new Error(`Scenario ${scenario.id} is missing environment.onboarding`);
     }
+    // secretEnv defaults to [] (no parent-env secrets pass through)
+    // unless the profile is explicitly listed above. Unknown profiles
+    // get the safest setting and surface the gap loudly the first
+    // time they actually need a secret to authenticate.
+    const secretEnv = ONBOARD_PROFILE_SECRET_ENV[onboardingId] ?? [];
     return [
       {
         id: `onboarding.profile.${onboardingId}`,
@@ -137,6 +161,7 @@ function phaseActions(phase: PhaseName, scenario: ScenarioDefinition): PhaseActi
         // Legacy preflight assertions look for ${E2E_CONTEXT_DIR}/onboard.log;
         // publish a stable alias so they keep working without rewiring.
         aliasPath: "onboard.log",
+        secretEnv,
       },
     ];
   }
diff --git a/test/e2e-scenario/scenarios/orchestrators/phase.ts b/test/e2e-scenario/scenarios/orchestrators/phase.ts
index 0bb97f4e80..7bf075b743 100644
--- a/test/e2e-scenario/scenarios/orchestrators/phase.ts
+++ b/test/e2e-scenario/scenarios/orchestrators/phase.ts
@@ -16,6 +16,7 @@ import type {
   RunPlanPhase,
   TransientClassifier,
 } from "../types.ts";
+import { buildChildEnv, pipeRedacted, redactString } from "./redaction.ts";
 
 const REPO_ROOT = path.resolve(path.dirname(fileURLToPath(import.meta.url)), "../../../..");
 const DEFAULT_STEP_TIMEOUT_SECONDS = 300;
@@ -116,22 +117,30 @@ export class PhaseOrchestrator {
       ? [dispatchAction, action.fn ?? "", action.arg ?? "", scriptPath]
       : [scriptPath, ...(action.arg ? [action.arg] : [])];
 
-    const env: NodeJS.ProcessEnv = {
-      ...process.env,
-      E2E_CONTEXT_DIR: ctx.contextDir,
-      E2E_PHASE: action.phase,
-      E2E_ACTION_ID: action.id,
-    };
+    // Framework-owned secret hygiene at the spawn boundary. The child
+    // gets a minimal allowlisted env plus only the secrets this action
+    // explicitly declared via PhaseAction.secretEnv. See
+    // orchestrators/redaction.ts for the full contract.
+    const env = buildChildEnv(process.env, {
+      secretEnv: action.secretEnv,
+      frameworkOverlay: {
+        E2E_CONTEXT_DIR: ctx.contextDir,
+        E2E_PHASE: action.phase,
+        E2E_ACTION_ID: action.id,
+      },
+    });
 
     return await new Promise<PhaseActionResult>((resolve) => {
       const child = spawn("bash", bashArgs, { env, cwd: REPO_ROOT, detached: true });
       const pgid = child.pid;
       const logStream = fs.createWriteStream(logPath);
       let stderrTail = "";
-      child.stdout.pipe(logStream, { end: false });
-      child.stderr.pipe(logStream, { end: false });
-      child.stderr.on("data", (chunk: Buffer) => {
-        stderrTail = (stderrTail + chunk.toString("utf8")).slice(-4096);
+      // Every byte from the child passes through redactString before
+      // hitting the evidence log or the stderr tail; raw output never
+      // touches disk or PhaseActionResult.message.
+      pipeRedacted(child.stdout, logStream);
+      pipeRedacted(child.stderr, logStream, (redactedChunk) => {
+        stderrTail = (stderrTail + redactedChunk).slice(-4096);
       });
 
       const killGroup = (signal: NodeJS.Signals) => {
@@ -176,7 +185,7 @@ export class PhaseOrchestrator {
             status: "failed",
             durationMs: Date.now() - startedAt,
             evidence: logPath,
-            message: `phase action ${action.id} spawn error: ${err.message}`,
+            message: redactString(`phase action ${action.id} spawn error: ${err.message}`),
           }),
         );
       });
@@ -310,12 +319,18 @@ export class PhaseOrchestrator {
     fs.mkdirSync(logDir, { recursive: true });
     const logPath = path.join(logDir, `${step.id}.log`);
 
-    const env: NodeJS.ProcessEnv = {
-      ...process.env,
-      E2E_CONTEXT_DIR: ctx.contextDir,
-      E2E_STEP_ID: step.id,
-      E2E_PHASE: step.phase,
-    };
+    // Framework-owned secret hygiene at the spawn boundary (mirrors
+    // runAction). The shell step's child gets only the framework
+    // allowlist + scenario context.env keys + step.secretEnv
+    // declarations. See orchestrators/redaction.ts.
+    const env = buildChildEnv(process.env, {
+      secretEnv: step.secretEnv,
+      frameworkOverlay: {
+        E2E_CONTEXT_DIR: ctx.contextDir,
+        E2E_STEP_ID: step.id,
+        E2E_PHASE: step.phase,
+      },
+    });
     // Surface scenario-derived context (E2E_SCENARIO, E2E_SANDBOX_NAME,
     // E2E_GATEWAY_URL, etc.) that the framework wrote at the start of the
     // run and that environment+onboarding phases extended via
@@ -352,10 +367,12 @@ export class PhaseOrchestrator {
       const pgid = child.pid;
       const logStream = fs.createWriteStream(logPath);
       let stderrTail = "";
-      child.stdout.pipe(logStream, { end: false });
-      child.stderr.pipe(logStream, { end: false });
-      child.stderr.on("data", (chunk: Buffer) => {
-        stderrTail = (stderrTail + chunk.toString("utf8")).slice(-4096);
+      // Redact at the I/O boundary; raw bytes from the child must not
+      // reach the evidence log or the stderr tail that flows into
+      // step result.message.
+      pipeRedacted(child.stdout, logStream);
+      pipeRedacted(child.stderr, logStream, (redactedChunk) => {
+        stderrTail = (stderrTail + redactedChunk).slice(-4096);
       });
 
       const killGroup = (signal: NodeJS.Signals) => {
@@ -401,7 +418,7 @@ export class PhaseOrchestrator {
         void finishLog().then(() =>
           resolve({
             status: "failed",
-            message: `shell step ${step.id} spawn error: ${err.message}`,
+            message: redactString(`shell step ${step.id} spawn error: ${err.message}`),
             evidence: logPath,
           }),
         );
diff --git a/test/e2e-scenario/scenarios/orchestrators/redaction.ts b/test/e2e-scenario/scenarios/orchestrators/redaction.ts
new file mode 100644
index 0000000000..745ec61126
--- /dev/null
+++ b/test/e2e-scenario/scenarios/orchestrators/redaction.ts
@@ -0,0 +1,212 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+/**
+ * Framework-owned secret hygiene at the spawn boundary.
+ *
+ * Spec ownership: redaction and child-env minimization are FRAMEWORK
+ * INFRASTRUCTURE, not a per-action / per-script / per-workflow concern.
+ * Children spawned by PhaseOrchestrator must (a) receive a minimal,
+ * typed env (framework allowlist + per-action declared `secretEnv`
+ * passthrough only), and (b) have their stdout/stderr passed through
+ * redaction before any byte reaches an evidence log or
+ * PhaseResult.message. There is no opt-out flag, no env switch, no
+ * helper that bypasses this. One execution mode, secrets always
+ * redacted in evidence — same one-mode discipline that motivates the
+ * rest of this PR.
+ *
+ * Pattern source-of-truth: src/lib/security/secret-patterns.ts. We
+ * import the canonical regex sets and apply them here so framework
+ * redaction stays in lockstep with product-runtime redaction without
+ * coupling the framework to product runtime modules.
+ *
+ * Bash side: test/e2e-scenario/runtime/lib/context.sh::e2e_context_dump
+ * already redacts on dump via _e2e_context_is_sensitive_key. Bash
+ * helpers must continue to use that for diagnostic dumps; this module
+ * only covers the TS-spawned-child I/O path.
+ *
+ * Tests:
+ *   test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
+ *     - test_should_not_persist_secret_shaped_child_output_into_evidence
+ *     - test_should_drop_non_allowlisted_parent_env_unless_declared_in_secretEnv
+ *     - test_should_pass_declared_secretEnv_through_to_child
+ */
+
+import type { Readable, Writable } from "node:stream";
+
+const REDACTED = "<REDACTED>";
+
+// Framework-local mirror of src/lib/security/secret-patterns.ts. The
+// framework deliberately does not import from src/lib/security/ so it
+// stays decoupled from product runtime modules and the cross-tsconfig
+// boundary. A parity test
+// (test/e2e-scenario/framework-tests/e2e-redaction-parity.test.ts)
+// asserts these regex sources stay in lockstep with the canonical
+// product source so adding a token shape there keeps both layers
+// honest at once.
+const TOKEN_PREFIX_PATTERNS: RegExp[] = [
+  /nvapi-[A-Za-z0-9_-]{10,}/g,
+  /nvcf-[A-Za-z0-9_-]{10,}/g,
+  /ghp_[A-Za-z0-9_-]{10,}/g,
+  /(?:github_pat_)[A-Za-z0-9_]{30,}/g,
+  /sk-proj-[A-Za-z0-9_-]{10,}/g,
+  /sk-ant-[A-Za-z0-9_-]{10,}/g,
+  /sk-[A-Za-z0-9_-]{20,}/g,
+  /(?:xox[bpas]|xapp)-[A-Za-z0-9-]{10,}/g,
+  /A(?:K|S)IA[A-Z0-9]{16}/g,
+  /hf_[A-Za-z0-9]{10,}/g,
+  /glpat-[A-Za-z0-9_-]{10,}/g,
+  /gsk_[A-Za-z0-9]{10,}/g,
+  /pypi-[A-Za-z0-9_-]{10,}/g,
+  /\bbot\d{8,10}:[A-Za-z0-9_-]{35}\b/g,
+  /\b\d{8,10}:[A-Za-z0-9_-]{35}\b/g,
+  /\b[A-Za-z0-9]{24}\.[A-Za-z0-9_-]{6}\.[A-Za-z0-9_-]{27,}\b/g,
+];
+
+const CONTEXT_PATTERNS: RegExp[] = [
+  /(?<=Bearer\s+)[A-Za-z0-9_.+/=-]{10,}/gi,
+  /(?<=(?:_KEY|API_KEY|SECRET|TOKEN|PASSWORD|CREDENTIAL)[=: ]['"]?)[A-Za-z0-9_.+/=-]{10,}/gi,
+];
+
+/**
+ * Replace every secret-shaped token in `text` with `<REDACTED>`. Uses
+ * the canonical TOKEN_PREFIX_PATTERNS + CONTEXT_PATTERNS sets.
+ *
+ * Best-effort against unknown token shapes. The actual defense is the
+ * env allowlist (buildChildEnv); pattern redaction catches what slips
+ * through (e.g. error messages that echo a secret value).
+ */
+export function redactString(text: string): string {
+  if (!text) return text;
+  let out = text;
+  for (const p of TOKEN_PREFIX_PATTERNS) {
+    p.lastIndex = 0;
+    out = out.replace(p, REDACTED);
+  }
+  for (const p of CONTEXT_PATTERNS) {
+    p.lastIndex = 0;
+    out = out.replace(p, REDACTED);
+  }
+  return out;
+}
+
+// Env keys the framework guarantees children may always see. Anything
+// outside this set, outside FRAMEWORK_ENV_PREFIXES, and not declared
+// in PhaseAction.secretEnv / AssertionStep.secretEnv is dropped before
+// the child spawns.
+const FRAMEWORK_ENV_ALLOWLIST: ReadonlySet<string> = new Set([
+  "PATH",
+  "HOME",
+  "SHELL",
+  "USER",
+  "LOGNAME",
+  "LANG",
+  "LC_ALL",
+  "LC_CTYPE",
+  "TZ",
+  "TERM",
+  "TMPDIR",
+  "RUNNER_TEMP",
+  "RUNNER_OS",
+  "GITHUB_ACTIONS",
+  "CI",
+  "NEMOCLAW_NON_INTERACTIVE",
+  "NEMOCLAW_ACCEPT_THIRD_PARTY_SOFTWARE",
+]);
+
+const FRAMEWORK_ENV_PREFIXES: readonly string[] = ["E2E_", "NEMOCLAW_LOG_"];
+
+// Shape required of any declared secretEnv key — must look like a
+// secret-bearing variable. Prevents accidental allowlisting of
+// non-secret values via the secretEnv channel and keeps the
+// "framework-allowlist vs declared-secret" distinction honest.
+const SECRET_ENV_KEY_SHAPE =
+  /^[A-Z][A-Z0-9_]*(?:API[_]?KEY|TOKEN|SECRET|PASSWORD|CREDENTIAL|PASSPHRASE|PRIVATE[_]?KEY|ACCESS[_]?KEY)$/;
+
+export function isValidSecretEnvKey(key: string): boolean {
+  return SECRET_ENV_KEY_SHAPE.test(key);
+}
+
+export interface BuildChildEnvOptions {
+  /** Per-action / per-step declared secret-bearing env keys to pass through. */
+  secretEnv?: readonly string[];
+  /** Framework-controlled overlay (E2E_CONTEXT_DIR, E2E_PHASE, E2E_*_ID). */
+  frameworkOverlay: NodeJS.ProcessEnv;
+}
+
+/**
+ * Build the child's env from `base` (typically `process.env`) by
+ * keeping only:
+ *   1. keys in FRAMEWORK_ENV_ALLOWLIST
+ *   2. keys starting with one of FRAMEWORK_ENV_PREFIXES
+ *   3. keys explicitly declared in `opts.secretEnv` (validated shape)
+ * then layering `opts.frameworkOverlay` on top.
+ *
+ * Throws if a `secretEnv` entry doesn't match the secret-key shape;
+ * better to fail loudly at compile/runtime than silently leak a
+ * non-secret env var (which would defeat the allowlist purpose).
+ */
+export function buildChildEnv(
+  base: NodeJS.ProcessEnv,
+  opts: BuildChildEnvOptions,
+): NodeJS.ProcessEnv {
+  const out: NodeJS.ProcessEnv = {};
+  for (const [key, value] of Object.entries(base)) {
+    if (value === undefined) continue;
+    if (FRAMEWORK_ENV_ALLOWLIST.has(key)) {
+      out[key] = value;
+      continue;
+    }
+    if (FRAMEWORK_ENV_PREFIXES.some((prefix) => key.startsWith(prefix))) {
+      out[key] = value;
+      continue;
+    }
+  }
+  for (const key of opts.secretEnv ?? []) {
+    if (!isValidSecretEnvKey(key)) {
+      throw new Error(
+        `secretEnv entry '${key}' does not match the secret-key shape ` +
+          `(must end with API_KEY, TOKEN, SECRET, PASSWORD, CREDENTIAL, ` +
+          `PASSPHRASE, PRIVATE_KEY, or ACCESS_KEY). Refusing to allowlist.`,
+      );
+    }
+    if (base[key] !== undefined) {
+      out[key] = base[key];
+    }
+  }
+  Object.assign(out, opts.frameworkOverlay);
+  return out;
+}
+
+/**
+ * Pipe `src` into `log`, redacting every chunk on the way through.
+ * Optional `onChunk` receives the already-redacted text (used by the
+ * orchestrator to keep a redacted stderr tail for failure messages).
+ *
+ * No raw bytes from the child ever reach `log` or the tail callback.
+ */
+export function pipeRedacted(
+  src: Readable,
+  log: Writable,
+  onChunk?: (redactedChunk: string) => void,
+): void {
+  src.on("data", (chunk: Buffer) => {
+    const redacted = redactString(chunk.toString("utf8"));
+    log.write(redacted);
+    onChunk?.(redacted);
+  });
+}
+
+/**
+ * Compact array of all framework env keys the child sees by default.
+ * Exported for tests/diagnostics; do not use to bypass the boundary.
+ */
+export function frameworkEnvAllowlistSnapshot(): {
+  keys: string[];
+  prefixes: string[];
+} {
+  return {
+    keys: [...FRAMEWORK_ENV_ALLOWLIST].sort(),
+    prefixes: [...FRAMEWORK_ENV_PREFIXES],
+  };
+}
diff --git a/test/e2e-scenario/scenarios/types.ts b/test/e2e-scenario/scenarios/types.ts
index 084973679d..1d76c3cf47 100644
--- a/test/e2e-scenario/scenarios/types.ts
+++ b/test/e2e-scenario/scenarios/types.ts
@@ -66,6 +66,12 @@ export interface AssertionStep {
   };
   evidencePath?: string;
   reliability?: AssertionStepReliability;
+  // Declared parent-env keys this step requires beyond the framework's
+  // allowlist. Anything not allowlisted and not declared here is
+  // dropped before spawn. See orchestrators/redaction.ts. Each entry
+  // must match the secret-key shape; the framework rejects non-secret
+  // names to keep the allowlist-vs-declared-secret boundary honest.
+  secretEnv?: readonly string[];
 }
 
 export interface AssertionGroup {
@@ -133,6 +139,15 @@ export interface PhaseAction {
   // reference well-known filenames (e.g. ${E2E_CONTEXT_DIR}/onboard.log)
   // keep working without coupling them to the action's stable id.
   aliasPath?: string;
+  // Declared parent-env keys this action requires beyond the
+  // framework's allowlist (PATH, HOME, E2E_*, NEMOCLAW_*, ...).
+  // Anything not allowlisted and not declared here is dropped before
+  // spawn. See orchestrators/redaction.ts. Each entry must match the
+  // secret-key shape; the framework rejects non-secret names so the
+  // allowlist-vs-declared-secret boundary stays honest. Cloud install
+  // declares ["NVIDIA_API_KEY"]; slack onboarding declares the slack
+  // tokens it actually needs; etc.
+  secretEnv?: readonly string[];
 }
 
 export interface RunPlanPhase {
diff --git a/tools/e2e-scenarios/workflow-boundary.mts b/tools/e2e-scenarios/workflow-boundary.mts
index dfa61b1df1..a06b21f3ea 100644
--- a/tools/e2e-scenarios/workflow-boundary.mts
+++ b/tools/e2e-scenarios/workflow-boundary.mts
@@ -143,11 +143,28 @@ export function validateE2eScenariosWorkflowBoundary(
   if (uploadWith.name !== "e2e-scenario-${{ inputs.scenarios || github.event.inputs.scenarios }}") {
     errors.push("artifact upload name must include the scenarios input");
   }
-  if (uploadWith["include-hidden-files"] !== true) {
-    errors.push("artifact upload must include hidden .e2e files");
+  // Framework-owned secret hygiene: include-hidden-files MUST be false.
+  // Hidden dotfiles under the workspace can carry raw secrets (notably
+  // .e2e/context.env, written by e2e_context_set without redaction).
+  // The redacted surfaces are explicit subpaths under .e2e/ that the
+  // framework writes via orchestrators/redaction.ts::pipeRedacted.
+  if (uploadWith["include-hidden-files"] !== false) {
+    errors.push("artifact upload must set include-hidden-files: false (raw context.env must not leak)");
   }
-  if (!stringValue(uploadWith.path).includes(".e2e/")) {
-    errors.push("artifact upload path must include .e2e/");
+  const uploadPath = stringValue(uploadWith.path);
+  if (!uploadPath.includes(".e2e/actions/")) {
+    errors.push("artifact upload path must include .e2e/actions/ (redacted action evidence)");
+  }
+  if (!uploadPath.includes(".e2e/logs/")) {
+    errors.push("artifact upload path must include .e2e/logs/ (redacted shell-step evidence)");
+  }
+  // Bare blanket '.e2e/' (without a trailing subdir) would re-include
+  // the raw context.env file. Reject it so the explicit-subpath
+  // contract stays honest. Subpaths like '.e2e/actions/' are fine.
+  for (const line of uploadPath.split("\n")) {
+    if (line.trim() === ".e2e/") {
+      errors.push("artifact upload path must not list bare .e2e/ (use explicit subpaths to avoid context.env leakage)");
+    }
   }
 
   return errors;

From cc6b7a2205e9896c02f05c13fbbf6e79309e0371 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Thu, 28 May 2026 08:05:34 -0400
Subject: [PATCH 14/23] fix(e2e): fail-closed for required security probes and
 expected-failure pending steps

Addresses PR Review Advisor finding #2: 'Required security probes and
expected-failure checks still skip without failing live runs'.

Same architectural failure mode as findings #5 (E2E_DRY_RUN) and #6
(phase-1-skeleton): the typed framework declared a contract, but a
code path silently produced non-failing skips that contradicted this
PR's 'one mode, no fake green' invariant.

Three suites (security-shields, security-policy, security-injection)
were wired as kind: 'probe' steps; the orchestrator marked them
status: 'skipped' because the probe registry hasn't landed yet. The
expected-failure side-effect validator (runtime.expected-failure.no-
side-effects) was wired as kind: 'pending'. run.ts only sets
process.exitCode = 1 on status === 'failed', so a live run could
omit security shields, network policy, prompt-injection blocking,
AND the negative-scenario forbidden-side-effect contract while still
appearing non-failing.

Fix follows the framework spirit: declare requirement in the typed
metadata, enforce it once at the orchestrator boundary.

Typed contract extension (types.ts):
  AssertionStep.required?: boolean
  When true, a probe/pending step that resolves as 'skipped' is
  reclassified as 'failed' by the phase orchestrator. Defaults to
  false (existing behavior preserved for diagnostics, docs-validation,
  and other non-security probes).

Orchestrator (phase.ts):
  executeStep: probe and pending branches now check step.required
  and return status: 'failed' with a 'required <kind> not <state>'
  message when set. Non-required probes/pending steps continue to
  surface as visible skipped results so unrelated gaps stay
  diagnostic.

Registry (assertions/registry.ts):
  - probeStep / pendingStep helpers grew an options arg with
    required?: boolean (matches the secretEnv pattern: options come
    in via typed objects, not positional args).
  - security-shields, security-policy, security-injection probes
    annotated required: true. These are the suites the run is not
    safe without; failing closed beats fake green.
  - runtime.expected-failure.no-side-effects pending step annotated
    required: true. Negative scenarios cannot honestly pass while
    their core side-effect contract is unimplemented.
  - diagnostics, docs-validation probes intentionally remain
    non-required: they are informational, not safety-critical, and
    will switch to required when the probe registry lands and they
    have real implementations.

run.ts: no change needed. The new failed-status flows through the
existing exit-code path (process.exitCode = 1 when any phase status
is failed), so the workflow correctly reports red for missing
required probes/pending steps.

Tests (e2e-phase-orchestrators.test.ts, +5):
  - test_required_probe_step_that_is_unregistered_fails_the_phase
  - test_non_required_probe_step_continues_to_skip_visibly
  - test_required_pending_step_fails_closed
  - test_security_suite_groups_in_registry_mark_their_steps_as_required
  - test_expected_failure_no_side_effects_step_in_registry_is_required

318/318 e2e-scenario framework tests pass.

Until the probe registry follow-up PR registers actual
implementations for shieldsConfigProbe / networkPolicyProbe /
injectionBlockedProbe / expectedFailureNoSideEffectsProbe, scenarios
that include those suites/expected-failure groups will fail loudly
with 'required probe not registered' or 'required pending step not
implemented'. That is the correct intermediate state: visible red
until real, instead of green-by-default.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../e2e-phase-orchestrators.test.ts           | 95 +++++++++++++++++++
 .../scenarios/assertions/registry.ts          | 58 +++++++++--
 .../scenarios/orchestrators/phase.ts          | 36 +++++--
 test/e2e-scenario/scenarios/types.ts          |  9 ++
 4 files changed, 183 insertions(+), 15 deletions(-)

diff --git a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
index a09e5be6a8..2ab43e5c67 100644
--- a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
@@ -555,6 +555,101 @@ describe("ScenarioRunner seeds context.env and short-circuits across phases", ()
   });
 });
 
+describe("required probe and pending steps fail closed", () => {
+  it("test_required_probe_step_that_is_unregistered_fails_the_phase", async () => {
+    const ctx = freshCtx();
+    try {
+      const step: AssertionStep = {
+        id: "runtime.security.required-probe",
+        phase: "runtime",
+        implementation: { kind: "probe", ref: "unregisteredSecurityProbe" },
+        evidencePath: ".e2e/assertions/runtime.security.required-probe.json",
+        required: true,
+      };
+      const orchestrator = new PhaseOrchestrator("runtime");
+
+      const result = await orchestrator.run(ctx, makePhase([step]));
+
+      expect(result.status).toBe("failed");
+      expect(result.assertions[0].status).toBe("failed");
+      expect(result.assertions[0].message).toMatch(/required probe not registered/);
+      expect(result.assertions[0].message).toContain("unregisteredSecurityProbe");
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("test_non_required_probe_step_continues_to_skip_visibly", async () => {
+    const ctx = freshCtx();
+    try {
+      const step: AssertionStep = {
+        id: "runtime.diagnostics.non-required-probe",
+        phase: "runtime",
+        implementation: { kind: "probe", ref: "diagnosticsProbe" },
+        evidencePath: ".e2e/assertions/runtime.diagnostics.non-required-probe.json",
+        // required intentionally omitted (defaults to false)
+      };
+      const orchestrator = new PhaseOrchestrator("runtime");
+
+      const result = await orchestrator.run(ctx, makePhase([step]));
+
+      expect(result.assertions[0].status).toBe("skipped");
+      expect(result.assertions[0].message).toMatch(/probe not registered/);
+      // Non-required skipped step does not fail the phase.
+      expect(result.status).not.toBe("failed");
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("test_required_pending_step_fails_closed", async () => {
+    const ctx = freshCtx();
+    try {
+      const step: AssertionStep = {
+        id: "runtime.expected-failure.no-side-effects",
+        phase: "runtime",
+        implementation: { kind: "pending", ref: "expectedFailureNoSideEffectsProbe" },
+        evidencePath: ".e2e/assertions/runtime.expected-failure.no-side-effects.json",
+        required: true,
+      };
+      const orchestrator = new PhaseOrchestrator("runtime");
+
+      const result = await orchestrator.run(ctx, makePhase([step]));
+
+      expect(result.status).toBe("failed");
+      expect(result.assertions[0].status).toBe("failed");
+      expect(result.assertions[0].message).toMatch(/required pending step not implemented/);
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("test_security_suite_groups_in_registry_mark_their_steps_as_required", async () => {
+    const { assertionGroupForSuite } = await import("../scenarios/assertions/registry.ts");
+    for (const suiteId of ["security-shields", "security-policy", "security-injection"]) {
+      const group = assertionGroupForSuite(suiteId);
+      expect(group, `missing assertion group for suite ${suiteId}`).toBeDefined();
+      for (const step of group?.steps ?? []) {
+        expect(
+          step.required,
+          `${suiteId} step ${step.id} must be required so it fails closed`,
+        ).toBe(true);
+      }
+    }
+  });
+
+  it("test_expected_failure_no_side_effects_step_in_registry_is_required", async () => {
+    const { assertionRegistry } = await import("../scenarios/assertions/registry.ts");
+    const group = assertionRegistry.groups.find(
+      (g) => g.id === "runtime.expected-failure.no-side-effects",
+    );
+    expect(group).toBeDefined();
+    for (const step of group?.steps ?? []) {
+      expect(step.required).toBe(true);
+    }
+  });
+});
+
 describe("framework-owned secret hygiene at the spawn boundary", () => {
   it("test_should_not_persist_secret_shaped_child_output_into_evidence", async () => {
     const ctx = freshCtx();
diff --git a/test/e2e-scenario/scenarios/assertions/registry.ts b/test/e2e-scenario/scenarios/assertions/registry.ts
index 2d83ea0341..2a7d6603f4 100644
--- a/test/e2e-scenario/scenarios/assertions/registry.ts
+++ b/test/e2e-scenario/scenarios/assertions/registry.ts
@@ -24,22 +24,42 @@ function shellStep(input: ShellStepInput): AssertionStep {
   };
 }
 
-function probeStep(id: string, phase: PhaseName, ref: string, reliability?: Reliability): AssertionStep {
+interface ProbeStepOptions {
+  reliability?: Reliability;
+  // When true, an unregistered probe fails the phase (and the run)
+  // instead of skipping. Use for security-sensitive probes the run
+  // is not safe without.
+  required?: boolean;
+}
+
+function probeStep(
+  id: string,
+  phase: PhaseName,
+  ref: string,
+  options: ProbeStepOptions = {},
+): AssertionStep {
   return {
     id,
     phase,
     implementation: { kind: "probe", ref },
     evidencePath: `.e2e/assertions/${id}.json`,
-    reliability,
+    reliability: options.reliability,
+    required: options.required,
   };
 }
 
-function pendingStep(id: string, phase: PhaseName, ref: string): AssertionStep {
+function pendingStep(
+  id: string,
+  phase: PhaseName,
+  ref: string,
+  options: { required?: boolean } = {},
+): AssertionStep {
   return {
     id,
     phase,
     implementation: { kind: "pending", ref },
     evidencePath: `.e2e/assertions/${id}.json`,
+    required: options.required,
   };
 }
 
@@ -185,7 +205,21 @@ export const runtimeControlGroups: AssertionGroup[] = [
     phase: "runtime",
     description: "Negative scenario runtime check ensuring forbidden side effects did not occur.",
     migrationStatus: "complete",
-    steps: [pendingStep("runtime.expected-failure.no-side-effects", "runtime", "expectedFailureNoSideEffectsProbe")],
+    steps: [
+      pendingStep(
+        "runtime.expected-failure.no-side-effects",
+        "runtime",
+        "expectedFailureNoSideEffectsProbe",
+        // Negative scenarios assert that a declared failure mode
+        // produced no forbidden side effects. Until the side-effect
+        // validator is implemented, this step must fail closed for
+        // any scenario that opts into runtimeControlGroups[0]
+        // (i.e. scenario.expectedFailure is set). Skipping it would
+        // let negative scenarios silently "pass" without verifying
+        // their core contract.
+        { required: true },
+      ),
+    ],
   },
 ];
 
@@ -218,9 +252,19 @@ export const validationSuiteGroups: AssertionGroup[] = [
   ]),
   suiteGroup("credentials", credentialsSteps),
   suiteGroup("security-credentials", credentialsSteps),
-  suiteGroup("security-shields", [probeStep("security.shields.config", "runtime", "shieldsConfigProbe")]),
-  suiteGroup("security-policy", [probeStep("security.policy.enforced", "runtime", "networkPolicyProbe")]),
-  suiteGroup("security-injection", [probeStep("security.injection.blocked", "runtime", "injectionBlockedProbe")]),
+  // Security-sensitive probes MUST fail closed until the probe
+  // registry lands. A skipped shields/policy/injection check would
+  // produce fake-green for the exact suites these scenarios exist to
+  // protect.
+  suiteGroup("security-shields", [
+    probeStep("security.shields.config", "runtime", "shieldsConfigProbe", { required: true }),
+  ]),
+  suiteGroup("security-policy", [
+    probeStep("security.policy.enforced", "runtime", "networkPolicyProbe", { required: true }),
+  ]),
+  suiteGroup("security-injection", [
+    probeStep("security.injection.blocked", "runtime", "injectionBlockedProbe", { required: true }),
+  ]),
   suiteGroup("messaging-telegram", [
     shellStep({ id: "messaging.telegram.injection-safety", phase: "runtime", ref: "test/e2e-scenario/validation_suites/messaging/telegram/00-telegram-injection-safety.sh", reliability: { timeoutSeconds: 30, retry: { attempts: 2, on: ["external-tunnel"] } } }),
     shellStep({ id: "messaging.telegram.injection-payload-classes", phase: "runtime", ref: "test/e2e-scenario/validation_suites/messaging/telegram/01-telegram-injection-payload-classes.sh", reliability: { timeoutSeconds: 30, retry: { attempts: 2, on: ["external-tunnel"] } } }),
diff --git a/test/e2e-scenario/scenarios/orchestrators/phase.ts b/test/e2e-scenario/scenarios/orchestrators/phase.ts
index 7bf075b743..de952b23fc 100644
--- a/test/e2e-scenario/scenarios/orchestrators/phase.ts
+++ b/test/e2e-scenario/scenarios/orchestrators/phase.ts
@@ -289,17 +289,37 @@ export class PhaseOrchestrator {
       return this.runShellStep(ctx, step);
     }
     if (kind === "probe") {
-      // Probe registry lands in a follow-up PR. Until then, surface
-      // unimplemented probes as visibly skipped — never as fake green.
-      return {
-        status: "skipped",
-        message: `probe not registered: ${step.implementation?.ref ?? "<no ref>"}`,
-      };
+      // Probe registry lands in a follow-up PR. Until then, probes
+      // surface as visibly skipped — never as fake green. For
+      // security-sensitive or otherwise required probes, the run
+      // must NOT pass on this gap; the typed registry marks those
+      // with `required: true` and we reclassify the skip as a
+      // failure so the phase result fails closed.
+      const ref = step.implementation?.ref ?? "<no ref>";
+      if (step.required) {
+        return {
+          status: "failed",
+          classifier: "runner-infra",
+          message: `required probe not registered: ${ref} (step ${step.id})`,
+        };
+      }
+      return { status: "skipped", message: `probe not registered: ${ref}` };
     }
     if (kind === "pending") {
       // pending steps surface as skipped with the placeholder ref so
-      // gaps are visible in plan output and phase results.
-      return { status: "skipped", message: `pending: ${step.implementation?.ref ?? ""}` };
+      // gaps are visible in plan output and phase results. Required
+      // pending steps (e.g. expected-failure side-effect validators
+      // for negative scenarios) fail closed instead — the run cannot
+      // honestly pass while the contract is unimplemented.
+      const ref = step.implementation?.ref ?? "";
+      if (step.required) {
+        return {
+          status: "failed",
+          classifier: "runner-infra",
+          message: `required pending step not implemented: ${ref} (step ${step.id})`,
+        };
+      }
+      return { status: "skipped", message: `pending: ${ref}` };
     }
     throw new Error(`Unknown assertion step kind for ${step.id}: ${String(kind)}`);
   }
diff --git a/test/e2e-scenario/scenarios/types.ts b/test/e2e-scenario/scenarios/types.ts
index 1d76c3cf47..46201f55a2 100644
--- a/test/e2e-scenario/scenarios/types.ts
+++ b/test/e2e-scenario/scenarios/types.ts
@@ -72,6 +72,15 @@ export interface AssertionStep {
   // must match the secret-key shape; the framework rejects non-secret
   // names to keep the allowlist-vs-declared-secret boundary honest.
   secretEnv?: readonly string[];
+  // When true, a probe/pending step that resolves as "skipped" is
+  // reclassified as "failed" by the phase orchestrator. Required
+  // steps fail closed when their underlying implementation isn't
+  // available yet (probe registry not landed, expected-failure
+  // side-effect validator not implemented, ...) instead of silently
+  // producing fake green. Defaults to false; set true for security-
+  // sensitive suites and expected-failure validators that the run
+  // is not safe without.
+  required?: boolean;
 }
 
 export interface AssertionGroup {

From 5bb3bc772bb4d63522152264c03d88cb3e769cf0 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Thu, 28 May 2026 08:11:54 -0400
Subject: [PATCH 15/23] test(e2e): prefer openshell sandbox ssh-config + ssh -F
 transport

Diagnoses surfaced by the timeout wrapper on PR #4380's scenario run
showed every assertion that goes through 'openshell sandbox exec' hangs
in CI (host can curl the gateway and list sandboxes, but openshell-exec
into the sandbox never returns). The legacy test/e2e/ scripts have
always entered the sandbox via 'openshell sandbox ssh-config' + 'ssh -F'
and don't exhibit this hang, so adopt that pattern as the default
transport in the canonical wrapper.

Behavior:

- On first call per (sandbox, script), e2e_sandbox_exec calls
  'openshell sandbox ssh-config <name>' once and caches the result
  under ${E2E_CONTEXT_DIR}/.ssh-config-cache/<name>.cfg.
- Subsequent calls reuse the cached config and invoke
  'ssh -F <cfg> -o ConnectTimeout=10 -o StrictHostKeyChecking=no
   -o UserKnownHostsFile=/dev/null -o LogLevel=ERROR
   openshell-<name> <quoted remote-cmd>'.
- Args are quoted via printf '%q' so payloads with shell
  metacharacters (e.g. JSON bodies for chat-completion) survive
  intact.
- Fallback: if 'openshell sandbox ssh-config' returns non-zero (e.g.
  older openshell builds, runners without an ssh client), the wrapper
  falls back to 'openshell sandbox exec' and emits a stderr breadcrumb.
- Opt-out: set E2E_SANDBOX_EXEC_VIA_OPENSHELL=1 to force the original
  transport (used by the framework unit tests that stub openshell
  but not ssh).

Tests:

- Two new vitest cases:
  * sandbox_exec_should_prefer_ssh_config_transport_when_openshell_offers_one
  * sandbox_exec_should_fall_back_to_openshell_when_ssh_config_unavailable
- Existing exit-code propagation and stdin-quoting tests opt into the
  openshell-direct transport so their stub-openshell expectations
  remain valid.
- 322 framework tests pass.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../framework-tests/e2e-lib-helpers.test.ts   | 116 ++++++++++-
 .../validation_suites/sandbox-exec.sh         | 185 ++++++++++++++----
 2 files changed, 256 insertions(+), 45 deletions(-)

diff --git a/test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts b/test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
index 27a3cc0662..b1d1c8946a 100644
--- a/test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
@@ -597,7 +597,9 @@ exec "$@"
         e2e_sandbox_exec sb1 -- false
         echo "rc=$?"
       `,
-        { PATH: `${bin}:${process.env.PATH}` },
+        // Force the openshell-direct transport so the stubbed openshell
+        // (which has no `sandbox ssh-config` subcommand) is exercised.
+        { PATH: `${bin}:${process.env.PATH}`, E2E_SANDBOX_EXEC_VIA_OPENSHELL: "1" },
       );
       expect(r.stdout).toMatch(/rc=1/);
     } finally {
@@ -624,7 +626,12 @@ exec "$@"
           . "${VALIDATION_SUITES}/sandbox-exec.sh"
           printf 'hello $TOKEN' | e2e_sandbox_exec_stdin sb1 -- cat
         `,
-        { PATH: `${bin}:${process.env.PATH}`, TOKEN: "SHOULD_NOT_EXPAND" },
+        {
+          PATH: `${bin}:${process.env.PATH}`,
+          TOKEN: "SHOULD_NOT_EXPAND",
+          // Stub only handles the openshell-direct transport.
+          E2E_SANDBOX_EXEC_VIA_OPENSHELL: "1",
+        },
       );
       expect(r.status, r.stderr).toBe(0);
       expect(r.stdout).toContain("hello $TOKEN");
@@ -633,6 +640,111 @@ exec "$@"
       fs.rmSync(tmp, { recursive: true, force: true });
     }
   });
+
+  it("sandbox_exec_should_prefer_ssh_config_transport_when_openshell_offers_one", () => {
+    // Verify the new default: when `openshell sandbox ssh-config <name>`
+    // succeeds, the wrapper routes through `ssh -F <cfg>` instead of
+    // `openshell sandbox exec`.
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-sbex-ssh-"));
+    try {
+      const bin = path.join(tmp, "bin");
+      fs.mkdirSync(bin);
+      const trace = path.join(tmp, "ssh.trace");
+      fs.writeFileSync(
+        path.join(bin, "openshell"),
+        `#!/usr/bin/env bash
+set -euo pipefail
+if [[ "$1" == "sandbox" && "$2" == "ssh-config" ]]; then
+  printf 'Host openshell-%s\\n  HostName 127.0.0.1\\n  Port 2222\\n  User sandbox\\n' "$3"
+  exit 0
+fi
+echo "unexpected openshell call: $*" >&2
+exit 99
+`,
+        { mode: 0o755 },
+      );
+      fs.writeFileSync(
+        path.join(bin, "ssh"),
+        `#!/usr/bin/env bash
+set -euo pipefail
+printf '%s\\n' "ssh-args:$*" >> "${trace}"
+remote="\${@: -1}"
+printf '%s\\n' "remote-cmd:\${remote}" >> "${trace}"
+echo ok-from-ssh
+exit 0
+`,
+        { mode: 0o755 },
+      );
+      const ctxDir = path.join(tmp, "ctx");
+      fs.mkdirSync(ctxDir);
+      const r = runBash(
+        `
+          set -euo pipefail
+          . "${VALIDATION_SUITES}/sandbox-exec.sh"
+          e2e_sandbox_exec sb1 -- echo hello
+        `,
+        {
+          PATH: `${bin}:${process.env.PATH}`,
+          E2E_CONTEXT_DIR: ctxDir,
+        },
+      );
+      expect(r.status, r.stderr).toBe(0);
+      expect(r.stdout).toContain("ok-from-ssh");
+      const traceContents = fs.readFileSync(trace, "utf8");
+      expect(traceContents).toMatch(/ssh-args:.*-F /);
+      expect(traceContents).toContain("openshell-sb1");
+      expect(traceContents).toMatch(/remote-cmd:echo hello$/m);
+      const cfg = path.join(ctxDir, ".ssh-config-cache", "sb1.cfg");
+      expect(fs.existsSync(cfg)).toBe(true);
+    } finally {
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+
+  it("sandbox_exec_should_fall_back_to_openshell_when_ssh_config_unavailable", () => {
+    // If `openshell sandbox ssh-config` fails, the wrapper must fall
+    // back to `openshell sandbox exec`.
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-sbex-fb-"));
+    try {
+      const bin = path.join(tmp, "bin");
+      fs.mkdirSync(bin);
+      fs.writeFileSync(
+        path.join(bin, "openshell"),
+        `#!/usr/bin/env bash
+set -uo pipefail
+if [[ "$1" == "sandbox" && "$2" == "ssh-config" ]]; then
+  exit 1
+fi
+if [[ "$1" == "sandbox" && "$2" == "exec" ]]; then
+  shift 2
+  while [[ "$#" -gt 0 && "$1" != "--" ]]; do shift; done
+  shift || true
+  exec "$@"
+fi
+exit 99
+`,
+        { mode: 0o755 },
+      );
+      const ctxDir = path.join(tmp, "ctx");
+      fs.mkdirSync(ctxDir);
+      const r = runBash(
+        `
+          set -euo pipefail
+          . "${VALIDATION_SUITES}/sandbox-exec.sh"
+          e2e_sandbox_exec sb1 -- echo fallback-ok
+        `,
+        {
+          PATH: `${bin}:${process.env.PATH}`,
+          E2E_CONTEXT_DIR: ctxDir,
+        },
+      );
+      expect(r.status, r.stderr).toBe(0);
+      expect(r.stdout).toContain("fallback-ok");
+      expect(r.stderr).toMatch(/ssh-config unavailable for sb1/);
+    } finally {
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
 });
 
 // ─────────────────────────────────────────────────────────────────────────────
diff --git a/test/e2e-scenario/validation_suites/sandbox-exec.sh b/test/e2e-scenario/validation_suites/sandbox-exec.sh
index c69f0c963d..44e4288111 100755
--- a/test/e2e-scenario/validation_suites/sandbox-exec.sh
+++ b/test/e2e-scenario/validation_suites/sandbox-exec.sh
@@ -50,6 +50,146 @@ _e2e_sbex_resolve_timeout_cmd() {
   fi
 }
 
+# ----------------------------------------------------------------------
+# ssh-config transport (preferred)
+#
+# `openshell sandbox exec` has been observed to wedge in CI (PR #4380
+# scenario run — host can curl the gateway but `openshell sandbox exec`
+# never returns). The legacy test/e2e/ scripts have always entered the
+# sandbox via `openshell sandbox ssh-config` + `ssh -F`, which works in
+# the same environments. We mirror that pattern here:
+#
+#   1. On first call per sandbox, materialize an ssh-config under
+#      ${E2E_CONTEXT_DIR}/.ssh-config-cache/<sandbox>.cfg.
+#   2. Subsequent calls reuse the cached config.
+#   3. Each ssh invocation gets `-o ConnectTimeout=10`,
+#      `-o StrictHostKeyChecking=no`, `-o UserKnownHostsFile=/dev/null`,
+#      `-o LogLevel=ERROR` to mirror the legacy pattern.
+#
+# Opt-out: set E2E_SANDBOX_EXEC_VIA_OPENSHELL=1 to force the original
+# `openshell sandbox exec` transport (e.g. for debugging or for runners
+# where ssh-config is unavailable).
+# ----------------------------------------------------------------------
+
+_e2e_sbex_ssh_cfg_dir() {
+  local base="${E2E_CONTEXT_DIR:-/tmp}"
+  printf '%s/.ssh-config-cache' "${base}"
+}
+
+# _e2e_sbex_ssh_config_for <sandbox>
+# Prints the path to a populated ssh-config for <sandbox> on stdout.
+# Returns non-zero (and prints nothing) if `openshell sandbox ssh-config`
+# fails — callers fall back to `openshell sandbox exec`.
+_e2e_sbex_ssh_config_for() {
+  local sandbox="$1"
+  local dir cfg
+  dir="$(_e2e_sbex_ssh_cfg_dir)"
+  mkdir -p "${dir}" || return 1
+  cfg="${dir}/${sandbox}.cfg"
+  if [[ ! -s "${cfg}" ]]; then
+    if ! openshell sandbox ssh-config "${sandbox}" >"${cfg}" 2>/dev/null; then
+      rm -f "${cfg}"
+      return 1
+    fi
+  fi
+  printf '%s' "${cfg}"
+}
+
+# _e2e_sbex_quote_args <args...>
+# Outputs the args quoted into a single shell string suitable for
+# embedding as the remote command in `ssh host 'cmd args ...'`.
+_e2e_sbex_quote_args() {
+  local arg out=""
+  for arg in "$@"; do
+    out+="$(printf '%q' "${arg}") "
+  done
+  printf '%s' "${out% }"
+}
+
+# _e2e_sbex_invoke_via_ssh <cfg> <stdin_mode> <seconds> <timeout_cmd>
+# stdin_mode is 'pipe' (forward caller stdin) or 'none' (close stdin).
+# Returns ssh's exit code (124 if timed out, 137 if SIGKILLed).
+_e2e_sbex_invoke_via_ssh() {
+  local cfg="$1" stdin_mode="$2" seconds="$3" timeout_cmd="$4"
+  local remote_cmd ssh_args
+  remote_cmd="$(_e2e_sbex_quote_args "${_E2E_SBEX_CMD[@]}")"
+  ssh_args=(
+    -F "${cfg}"
+    -o ConnectTimeout=10
+    -o StrictHostKeyChecking=no
+    -o UserKnownHostsFile=/dev/null
+    -o LogLevel=ERROR
+    "openshell-${_E2E_SBEX_SB_NAME}"
+    "${remote_cmd}"
+  )
+  if [[ "${stdin_mode}" == "none" ]]; then
+    if [[ -z "${timeout_cmd}" ]]; then
+      ssh "${ssh_args[@]}" </dev/null
+    else
+      "${timeout_cmd}" --kill-after=5s "${seconds}" ssh "${ssh_args[@]}" </dev/null
+    fi
+  else
+    if [[ -z "${timeout_cmd}" ]]; then
+      ssh "${ssh_args[@]}"
+    else
+      "${timeout_cmd}" --kill-after=5s "${seconds}" ssh "${ssh_args[@]}"
+    fi
+  fi
+}
+
+# _e2e_sbex_invoke_via_openshell <stdin_mode> <seconds> <timeout_cmd>
+# Fallback path that uses `openshell sandbox exec`.
+_e2e_sbex_invoke_via_openshell() {
+  local stdin_mode="$1" seconds="$2" timeout_cmd="$3"
+  if [[ -z "${timeout_cmd}" ]]; then
+    openshell sandbox exec --name "${_E2E_SBEX_SB_NAME}" -- "${_E2E_SBEX_CMD[@]}"
+  else
+    "${timeout_cmd}" --kill-after=5s "${seconds}" \
+      openshell sandbox exec --name "${_E2E_SBEX_SB_NAME}" -- "${_E2E_SBEX_CMD[@]}"
+  fi
+}
+
+# _e2e_sbex_dispatch <stdin_mode>
+# Shared body for e2e_sandbox_exec / e2e_sandbox_exec_stdin. Picks the
+# transport (ssh-config preferred; openshell sandbox exec on opt-out or
+# ssh-config failure), applies the per-call timeout, and emits a
+# classified diagnostic on hang.
+_e2e_sbex_dispatch() {
+  local stdin_mode="$1"
+  if ! command -v openshell >/dev/null 2>&1; then
+    echo "e2e_sandbox_exec: openshell CLI not on PATH" >&2
+    return 127
+  fi
+  local timeout_cmd seconds="${E2E_SANDBOX_EXEC_TIMEOUT_SECONDS}"
+  timeout_cmd="$(_e2e_sbex_resolve_timeout_cmd)"
+  if [[ -z "${timeout_cmd}" ]]; then
+    # Make the missing safety net visible so CI can flag it; do not
+    # abort — the orchestrator's step-level timeout still applies.
+    echo "e2e_sandbox_exec: 'timeout' not available; running without per-call cap (sandbox=${_E2E_SBEX_SB_NAME})" >&2
+  fi
+
+  local cfg="" via="ssh" rc=0
+  if [[ "${E2E_SANDBOX_EXEC_VIA_OPENSHELL:-0}" == "1" ]]; then
+    via="openshell"
+  elif ! cfg="$(_e2e_sbex_ssh_config_for "${_E2E_SBEX_SB_NAME}")"; then
+    echo "e2e_sandbox_exec: ssh-config unavailable for ${_E2E_SBEX_SB_NAME}; falling back to 'openshell sandbox exec'" >&2
+    via="openshell"
+  fi
+
+  if [[ "${via}" == "ssh" ]]; then
+    _e2e_sbex_invoke_via_ssh "${cfg}" "${stdin_mode}" "${seconds}" "${timeout_cmd}"
+    rc=$?
+  else
+    _e2e_sbex_invoke_via_openshell "${stdin_mode}" "${seconds}" "${timeout_cmd}"
+    rc=$?
+  fi
+
+  if [[ "${rc}" -eq 124 || "${rc}" -eq 137 ]]; then
+    echo "e2e_sandbox_exec: ${via} transport hung after ${seconds}s (sandbox=${_E2E_SBEX_SB_NAME}, cmd=${_E2E_SBEX_CMD[0]:-?}; classifier=gateway-transient)" >&2
+  fi
+  return "${rc}"
+}
+
 # _e2e_sbex_split_args <sandbox> -- <cmd> [args...]
 # Parses the shared calling convention. Prints on stderr on misuse and
 # returns 2. On success, sets the two global arrays _E2E_SBEX_SB_NAME and
@@ -79,30 +219,7 @@ _e2e_sbex_parse() {
 e2e_sandbox_exec() {
   _e2e_sbex_parse "$@" || return $?
   e2e_env_trace "sandbox:exec" "${_E2E_SBEX_SB_NAME}" "${_E2E_SBEX_CMD[*]}"
-  if ! command -v openshell >/dev/null 2>&1; then
-    echo "e2e_sandbox_exec: openshell CLI not on PATH" >&2
-    return 127
-  fi
-  local timeout_cmd seconds="${E2E_SANDBOX_EXEC_TIMEOUT_SECONDS}"
-  timeout_cmd="$(_e2e_sbex_resolve_timeout_cmd)"
-  if [[ -z "${timeout_cmd}" ]]; then
-    # No timeout binary available — fall back to bare exec but make the
-    # missing safety net visible so CI can flag it.
-    echo "e2e_sandbox_exec: 'timeout' not available; running without per-call cap (sandbox=${_E2E_SBEX_SB_NAME})" >&2
-    openshell sandbox exec --name "${_E2E_SBEX_SB_NAME}" -- "${_E2E_SBEX_CMD[@]}"
-    return $?
-  fi
-  local rc=0
-  "${timeout_cmd}" --kill-after=5s "${seconds}" \
-    openshell sandbox exec --name "${_E2E_SBEX_SB_NAME}" -- "${_E2E_SBEX_CMD[@]}"
-  rc=$?
-  if [[ "${rc}" -eq 124 || "${rc}" -eq 137 ]]; then
-    # 124 = timeout fired SIGTERM, 137 = --kill-after fired SIGKILL.
-    # Emit a single-line classified diagnostic so phase.ts captures
-    # something more useful than a SIGTERM black hole.
-    echo "e2e_sandbox_exec: openshell sandbox exec hung after ${seconds}s (sandbox=${_E2E_SBEX_SB_NAME}, cmd=${_E2E_SBEX_CMD[0]:-?}; classifier=gateway-transient)" >&2
-  fi
-  return "${rc}"
+  _e2e_sbex_dispatch none
 }
 
 # e2e_sandbox_exec_stdin <sandbox> -- <cmd> [args...]
@@ -112,23 +229,5 @@ e2e_sandbox_exec() {
 e2e_sandbox_exec_stdin() {
   _e2e_sbex_parse "$@" || return $?
   e2e_env_trace "sandbox:exec_stdin" "${_E2E_SBEX_SB_NAME}" "${_E2E_SBEX_CMD[*]}"
-  if ! command -v openshell >/dev/null 2>&1; then
-    echo "e2e_sandbox_exec_stdin: openshell CLI not on PATH" >&2
-    return 127
-  fi
-  local timeout_cmd seconds="${E2E_SANDBOX_EXEC_TIMEOUT_SECONDS}"
-  timeout_cmd="$(_e2e_sbex_resolve_timeout_cmd)"
-  if [[ -z "${timeout_cmd}" ]]; then
-    echo "e2e_sandbox_exec_stdin: 'timeout' not available; running without per-call cap (sandbox=${_E2E_SBEX_SB_NAME})" >&2
-    openshell sandbox exec --name "${_E2E_SBEX_SB_NAME}" -- "${_E2E_SBEX_CMD[@]}"
-    return $?
-  fi
-  local rc=0
-  "${timeout_cmd}" --kill-after=5s "${seconds}" \
-    openshell sandbox exec --name "${_E2E_SBEX_SB_NAME}" -- "${_E2E_SBEX_CMD[@]}"
-  rc=$?
-  if [[ "${rc}" -eq 124 || "${rc}" -eq 137 ]]; then
-    echo "e2e_sandbox_exec_stdin: openshell sandbox exec hung after ${seconds}s (sandbox=${_E2E_SBEX_SB_NAME}, cmd=${_E2E_SBEX_CMD[0]:-?}; classifier=gateway-transient)" >&2
-  fi
-  return "${rc}"
+  _e2e_sbex_dispatch pipe
 }

From 10230d968ee0ac66543ad66da0c9ddf2cfd30166 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Thu, 28 May 2026 08:16:07 -0400
Subject: [PATCH 16/23] test(e2e): route rebuild_upgrade.sh sandbox exec
 through canonical wrapper

The lifecycle.rebuild.* and lifecycle.upgrade.survivor-reachable
assertions on PR #4380's scenario run still produced raw 120s SIGTERM
cascades because rebuild_upgrade.sh used its own ad-hoc
_rebuild_upgrade_run helper that called 'openshell sandbox exec'
without a per-call timeout or transport choice. The orchestrator's
120s step cap was the only safety net, and on hang the script log
captured no diagnostic.

Migrate the lib onto the canonical wrapper:

- Add _rebuild_upgrade_sandbox_exec(sandbox, cmd...) that:
  * Honors the existing REBUILD_UPGRADE_SANDBOX_CMD test override (with
    the same -n <sandbox> -- <cmd>... contract used by existing fakes)
    so framework unit tests keep working unchanged.
  * Routes production calls through e2e_sandbox_exec, picking up the
    ssh-config-preferred transport, the per-call timeout, and the
    classified diagnostic on hang.
- Replace the seven openshell-sandbox-exec call sites with the new
  helper.
- Default E2E_SANDBOX_EXEC_TIMEOUT_SECONDS to 100s for this lib's
  callers (orchestrator caps these steps at 120s; 100s leaves room for
  the wrapper to emit a diagnostic and exit cleanly before SIGTERM).

The existing rebuild_upgrade_checks_should_allow_command_fakes test
continues to pass, proving the override contract is preserved.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../validation_suites/lib/rebuild_upgrade.sh  | 38 ++++++++++++++++---
 1 file changed, 33 insertions(+), 5 deletions(-)

diff --git a/test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh b/test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
index 99138304de..4870a68c64 100755
--- a/test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
+++ b/test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
@@ -10,6 +10,15 @@ _REBUILD_UPGRADE_REPO_ROOT="$(cd "${_REBUILD_UPGRADE_DIR}/../../../.." && pwd)"
 . "${_REBUILD_UPGRADE_REPO_ROOT}/test/e2e-scenario/runtime/lib/context.sh"
 # shellcheck source=../../runtime/lib/logging.sh
 . "${_REBUILD_UPGRADE_REPO_ROOT}/test/e2e-scenario/runtime/lib/logging.sh"
+# shellcheck source=../sandbox-exec.sh
+. "${_REBUILD_UPGRADE_REPO_ROOT}/test/e2e-scenario/validation_suites/sandbox-exec.sh"
+
+# Sandbox-exec calls in this lib feed the lifecycle.rebuild/upgrade
+# orchestrator steps, which carry 120s caps. Default the per-call wrapper
+# cap to 100s so a hung 'openshell sandbox exec'/'ssh -F' surfaces as a
+# classified exit 124 well before the orchestrator's SIGTERM. Callers
+# may still override per-call.
+: "${E2E_SANDBOX_EXEC_TIMEOUT_SECONDS:=100}"
 
 rebuild_upgrade_require_context() {
   e2e_context_require E2E_SCENARIO E2E_AGENT E2E_SANDBOX_NAME E2E_GATEWAY_URL
@@ -30,11 +39,30 @@ _rebuild_upgrade_run() {
   "$@"
 }
 
+# _rebuild_upgrade_sandbox_exec <sandbox> <cmd> [args...]
+# Routes through the canonical `e2e_sandbox_exec` wrapper (ssh-config
+# preferred, openshell-exec fallback, per-call timeout, classified
+# diagnostic on hang) for production; honors the legacy
+# REBUILD_UPGRADE_SANDBOX_CMD override so tests can inject a fake. The
+# override contract preserves the original argv shape
+# (`<override> -n <sandbox> -- <cmd>...`) so existing test fakes
+# (e.g. `REBUILD_UPGRADE_SANDBOX_CMD=fake_sandbox`) keep working.
+_rebuild_upgrade_sandbox_exec() {
+  local sandbox="$1"
+  shift
+  if [[ -n "${REBUILD_UPGRADE_SANDBOX_CMD:-}" ]]; then
+    # shellcheck disable=SC2086
+    ${REBUILD_UPGRADE_SANDBOX_CMD} -n "${sandbox}" -- "$@"
+    return $?
+  fi
+  e2e_sandbox_exec "${sandbox}" -- "$@"
+}
+
 rebuild_upgrade_assert_sandbox_reachable() {
   rebuild_upgrade_require_context || return 1
   local sandbox
   sandbox="$(_rebuild_upgrade_ctx E2E_SANDBOX_NAME)"
-  if _rebuild_upgrade_run REBUILD_UPGRADE_SANDBOX_CMD openshell sandbox exec -n "${sandbox}" -- true; then
+  if _rebuild_upgrade_sandbox_exec "${sandbox}" true; then
     e2e_pass "suite.upgrade.survivor_agent_reachable"
   else
     e2e_fail "suite.upgrade.survivor_agent_reachable"
@@ -47,7 +75,7 @@ rebuild_upgrade_assert_marker_preserved() {
   sandbox="$(_rebuild_upgrade_ctx E2E_SANDBOX_NAME)"
   marker_path="${E2E_REBUILD_MARKER_PATH:-/workspace/.nemoclaw-rebuild-marker}"
   expected="${E2E_REBUILD_MARKER_EXPECTED:-${E2E_STATE_MARKER_EXPECTED:-}}"
-  actual="$(_rebuild_upgrade_run REBUILD_UPGRADE_SANDBOX_CMD openshell sandbox exec -n "${sandbox}" -- cat "${marker_path}" 2>/dev/null || true)"
+  actual="$(_rebuild_upgrade_sandbox_exec "${sandbox}" cat "${marker_path}" 2>/dev/null || true)"
   if [[ -n "${actual}" && (-z "${expected}" || "${actual}" == "${expected}") ]]; then
     e2e_pass "suite.rebuild.workspace_state_preserved"
   else
@@ -62,7 +90,7 @@ rebuild_upgrade_assert_agent_version_upgraded() {
   old="${E2E_OLD_AGENT_VERSION:-}"
   expected="${E2E_EXPECTED_AGENT_VERSION:-}"
   cmd="${E2E_AGENT_VERSION_COMMAND:-openclaw --version}"
-  actual="$(_rebuild_upgrade_run REBUILD_UPGRADE_SANDBOX_CMD openshell sandbox exec -n "${sandbox}" -- bash -lc "${cmd}" 2>/dev/null || true)"
+  actual="$(_rebuild_upgrade_sandbox_exec "${sandbox}" bash -lc "${cmd}" 2>/dev/null || true)"
   if [[ -n "${actual}" && (-z "${old}" || "${actual}" != *"${old}"*) && (-z "${expected}" || "${actual}" == *"${expected}"*) ]]; then
     e2e_pass "suite.rebuild.agent_version_upgraded"
   else
@@ -75,7 +103,7 @@ rebuild_upgrade_assert_inference_works() {
   local sandbox cmd output
   sandbox="$(_rebuild_upgrade_ctx E2E_SANDBOX_NAME)"
   cmd="${E2E_INFERENCE_CHECK_COMMAND:-curl -fsS http://inference.local/v1/models}"
-  output="$(_rebuild_upgrade_run REBUILD_UPGRADE_SANDBOX_CMD openshell sandbox exec -n "${sandbox}" -- bash -lc "${cmd}" 2>/dev/null || true)"
+  output="$(_rebuild_upgrade_sandbox_exec "${sandbox}" bash -lc "${cmd}" 2>/dev/null || true)"
   if [[ -n "${output}" ]]; then
     e2e_pass "suite.rebuild.inference_still_works"
   else
@@ -105,7 +133,7 @@ rebuild_upgrade_assert_hermes_config_preserved() {
   fi
   local sandbox output
   sandbox="$(_rebuild_upgrade_ctx E2E_SANDBOX_NAME)"
-  output="$(_rebuild_upgrade_run REBUILD_UPGRADE_SANDBOX_CMD openshell sandbox exec -n "${sandbox}" -- bash -lc "grep -R 'platforms.discord\|DISCORD' ~/.hermes . 2>/dev/null" || true)"
+  output="$(_rebuild_upgrade_sandbox_exec "${sandbox}" bash -lc "grep -R 'platforms.discord\|DISCORD' ~/.hermes . 2>/dev/null" || true)"
   if [[ "${output}" == *"discord"* || "${output}" == *"DISCORD"* ]]; then
     e2e_pass "suite.rebuild.hermes_config_preserved"
   else

From 72185d7a99415000143e52aa078ce31b9b9c2bf0 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Thu, 28 May 2026 08:20:08 -0400
Subject: [PATCH 17/23] test(e2e): align status_fields_present with real
 nemoclaw status output

The lifecycle.sandbox.list-and-status assertion on PR #4380's scenario
run failed with 'missing status field: status'. Root cause: the
assertion required the literal lower-case tokens 'status', 'gateway',
and 'sandbox' in 'nemoclaw <name> status' output, but the production
command (src/lib/actions/sandbox/status.ts) does not print 'Status:'
or 'Gateway:' labels in its normal output. The original assertion only
passed against the framework-test mock string ('status running gateway
healthy sandbox running'), which was synthetic and did not reflect
what the CLI actually emits.

Align the assertion with the production output:

- Require the sandbox name itself to appear (every status output for a
  present sandbox emits 'Sandbox: <name>').
- Require the field labels that the CLI unconditionally prints when a
  sandbox is found: 'Sandbox', 'Model', 'OpenShell'.
- Drop the 'status' and 'gateway' tokens that never appear in real
  output.

This matches the legacy test/e2e/test-full-e2e.sh:230 pattern (verifies
'nemoclaw status' exits 0 and contains substantive content) while
keeping the assertion strict enough to catch CLI regressions that drop
core fields.

Update the framework test mock to emit realistic output ('Sandbox: sb1',
'Model:', 'OpenShell:', 'Policies:') so the existing happy-path test
keeps passing on the new contract.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../framework-tests/e2e-lib-helpers.test.ts   |  2 +-
 .../lib/sandbox_lifecycle.sh                  | 19 +++++++++++++++----
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts b/test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
index b1d1c8946a..82862f5622 100644
--- a/test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
@@ -1065,7 +1065,7 @@ describe("sandbox lifecycle validation helper", () => {
       fs.writeFileSync(path.join(bin, "nemoclaw"), `#!/usr/bin/env bash
 case "$*" in
   list) echo sb1;;
-  "sb1 status") echo 'status running gateway healthy sandbox running';;
+  "sb1 status") printf '  Sandbox: sb1\\n    Model:    nvidia/x\\n    OpenShell: 0.0.44\\n    Policies: npm\\n';;
   "sb1 logs") echo logline;;
   *) echo "unexpected nemoclaw args: $*" >&2; exit 64;;
 esac
diff --git a/test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh b/test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
index aef572c190..3cca8966b4 100755
--- a/test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
+++ b/test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
@@ -75,10 +75,21 @@ sandbox_lifecycle_assert_status_fields_present() {
     sandbox_lifecycle_fail "${id}" "nemoclaw status failed"
     return 1
   }
-  local status_output_lower
-  status_output_lower="$(printf '%s' "${SANDBOX_LIFECYCLE_LAST_OUTPUT}" | tr '[:upper:]' '[:lower:]')"
-  for field in status gateway sandbox; do
-    [[ "${status_output_lower}" == *"${field}"* ]] || {
+  # The real `nemoclaw <name> status` output (src/lib/actions/sandbox/status.ts)
+  # always emits a 'Sandbox: <name>' header plus structured fields like
+  # 'Model:', 'OpenShell:', 'Policies:'. The original assertion required
+  # literal 'status' and 'gateway' tokens that never appear in normal
+  # output — it only passed against the test-suite mock. Align with the
+  # production CLI: require the sandbox name and a couple of substantive
+  # field labels that are unconditionally printed.
+  local output="${SANDBOX_LIFECYCLE_LAST_OUTPUT}"
+  if [[ "${output}" != *"${E2E_SANDBOX_NAME}"* ]]; then
+    sandbox_lifecycle_fail "${id}" "status output did not mention sandbox '${E2E_SANDBOX_NAME}'"
+    return 1
+  fi
+  local field
+  for field in Sandbox Model OpenShell; do
+    [[ "${output}" == *"${field}"* ]] || {
       sandbox_lifecycle_fail "${id}" "missing status field: ${field}"
       return 1
     }

From 0d531d9891b018724c84774a5024333b08a12c01 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Thu, 28 May 2026 08:44:38 -0400
Subject: [PATCH 18/23] test(e2e): wire negative preflight scenario end-to-end

The 'ubuntu-no-docker-preflight-negative' scenario on PR #4380 was
structurally registered (manifest, scenario id, expected_failure
metadata) but never executed its negative path. The framework called
the regular 'cloud-openclaw' onboarding worker, docker was actually
running, onboard succeeded, and nothing wrote
${E2E_CONTEXT_DIR}/negative-preflight.log. The
'onboarding.preflight.expected-failed' assertion then failed with
'evidence not found'.

Mirror the legacy
test/e2e/e2e-cloud-experimental/test-port8080-conflict.sh pattern
(set up failure condition -> capture onboard output -> assert
non-zero exit -> grep log) inside the typed framework:

- Add nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh:
  * Installs a 'docker' shim earlier on PATH that exits non-zero with
    'Cannot connect to the Docker daemon ...' so commandExists('docker')
    succeeds while 'docker info' fails - the same failure mode users
    see when Docker is installed but the daemon is not running.
  * Runs 'nemoclaw onboard --non-interactive' redirecting stdout+stderr
    to ${E2E_CONTEXT_DIR}/negative-preflight.log.
  * Asserts nemoclaw exited non-zero (preflight DID fail). If onboard
    unexpectedly succeeds, the action fails so a regression that lets
    docker-missing slide is loud, not silent.
  * Returns 0 on the expected-failure path so the orchestrator marks
    the action passed and the dependent assertion phase runs against
    the captured log.

- Wire the worker into onboard/dispatch.sh under the
  'cloud-openclaw-no-docker' profile id.

- Compiler routes scenarios with environment.runtime='docker-missing'
  to '<base>-no-docker' onboarding profile. Adds the new profile to
  ONBOARD_PROFILE_SECRET_ENV so NVIDIA_API_KEY still flows through
  (matches a real user invocation where the CLI loads creds even when
  preflight aborts).

- Adds compiler_routes_docker_missing_runtime_to_no_docker_onboarding_profile
  test that locks in the routing: negative scenario -> -no-docker
  profile id, evidence path, secret env; positive scenario unaffected.

324 tests pass.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../e2e-phase-orchestrators.test.ts           | 27 +++++++
 .../onboard/cloud-openclaw-no-docker.sh       | 74 +++++++++++++++++++
 .../nemoclaw_scenarios/onboard/dispatch.sh    |  5 ++
 test/e2e-scenario/scenarios/compiler.ts       | 60 ++++++++++++++-
 4 files changed, 164 insertions(+), 2 deletions(-)
 create mode 100644 test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh

diff --git a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
index 2ab43e5c67..c0f08fd23a 100644
--- a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
@@ -415,6 +415,33 @@ describe("plan compiler emits phase actions for canonical scenarios", () => {
       }
     }
   });
+
+  it("compiler_routes_docker_missing_runtime_to_no_docker_onboarding_profile", async () => {
+    const { compileRunPlans } = await import("../scenarios/compiler.ts");
+    // Negative scenario declares runtime=docker-missing in scenarios.yaml.
+    // The compiler must substitute the onboarding profile id from the
+    // base 'cloud-openclaw' to 'cloud-openclaw-no-docker' so the
+    // dispatcher routes to the worker that installs the docker shim and
+    // captures negative-preflight.log. Without this routing, the
+    // 'onboarding.preflight.expected-failed' assertion has nothing to grep.
+    const [plan] = compileRunPlans(["ubuntu-no-docker-preflight-negative"]);
+    const onb = plan.phases.find((p) => p.name === "onboarding")!;
+    const action = onb.actions.find((a) => a.id.startsWith("onboarding.profile."));
+    expect(action?.id).toBe("onboarding.profile.cloud-openclaw-no-docker");
+    expect(action?.arg).toBe("cloud-openclaw-no-docker");
+    expect(action?.evidencePath).toBe(
+      ".e2e/actions/onboarding.profile.cloud-openclaw-no-docker.log",
+    );
+    // Secret env must still include NVIDIA_API_KEY so behavior matches
+    // a real user invocation (CLI loads creds even if preflight aborts).
+    expect(action?.secretEnv).toContain("NVIDIA_API_KEY");
+    // Positive scenarios must NOT pick up the -no-docker suffix.
+    const [posPlan] = compileRunPlans(["ubuntu-repo-cloud-openclaw"]);
+    const posAction = posPlan.phases
+      .find((p) => p.name === "onboarding")!
+      .actions.find((a) => a.id.startsWith("onboarding.profile."));
+    expect(posAction?.arg).toBe("cloud-openclaw");
+  });
 });
 
 describe("ScenarioRunner seeds context.env and short-circuits across phases", () => {
diff --git a/test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh b/test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh
new file mode 100644
index 0000000000..9c7b9803f1
--- /dev/null
+++ b/test/e2e-scenario/nemoclaw_scenarios/onboard/cloud-openclaw-no-docker.sh
@@ -0,0 +1,74 @@
+#!/usr/bin/env bash
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Onboard worker: cloud-openclaw-no-docker profile.
+#
+# Drives the negative `ubuntu-no-docker-preflight-negative` scenario by:
+#
+#   1. Installing a `docker` shim earlier on PATH that exits non-zero
+#      with a "Cannot connect to the Docker daemon" message. This makes
+#      `commandExists("docker")` succeed (the binary is present) while
+#      `docker info` fails — matching the production failure mode users
+#      see when Docker is installed but the daemon is not running.
+#
+#   2. Running `nemoclaw onboard --non-interactive` with stdout+stderr
+#      captured to `${E2E_CONTEXT_DIR}/negative-preflight.log`. The
+#      `onboarding.preflight.expected-failed` assertion greps that file.
+#
+#   3. Asserting that nemoclaw exits non-zero (preflight DID fail). If
+#      onboard unexpectedly succeeds, the action fails so the operator
+#      sees a clear "expected failure did not happen" signal instead of a
+#      green light masking a regression.
+#
+#   4. Returning 0 on the *expected* failure path so the orchestrator
+#      reports the action as passed and the assertion phase runs against
+#      the captured log. Without this, the action would be marked failed
+#      and the dependent assertions would be skipped.
+#
+# Pattern mirrors test/e2e/e2e-cloud-experimental/test-port8080-conflict.sh,
+# which sets up a different failure condition (port 8080 occupied) but
+# follows the same capture-output / check-exit / grep-log shape.
+
+e2e_onboard_cloud_openclaw_no_docker() {
+  e2e_env_apply_noninteractive
+  e2e_context_init
+
+  local log shim_dir rc=0
+  log="${E2E_CONTEXT_DIR}/negative-preflight.log"
+  shim_dir="$(mktemp -d -t e2e-no-docker-XXXXXX)"
+
+  cat >"${shim_dir}/docker" <<'SHIM'
+#!/usr/bin/env bash
+# Negative-preflight docker shim — preserves "docker is installed" while
+# breaking "docker info" / "docker version" so preflight fails with the
+# real "Cannot connect to the Docker daemon" message.
+printf 'Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\n' >&2
+exit 1
+SHIM
+  chmod +x "${shim_dir}/docker"
+
+  echo "negative-preflight: shim docker installed at ${shim_dir}/docker"
+  echo "negative-preflight: log_file=${log}"
+  echo "negative-preflight: invoking nemoclaw onboard --non-interactive (expected to fail at preflight)"
+
+  PATH="${shim_dir}:${PATH}" \
+    nemoclaw onboard --non-interactive --yes-i-accept-third-party-software \
+    >"${log}" 2>&1 || rc=$?
+
+  rm -rf "${shim_dir}"
+
+  echo "negative-preflight: nemoclaw onboard exited ${rc}"
+  if [[ -f "${log}" ]]; then
+    echo "--- captured log tail (${log}) ---"
+    tail -50 "${log}" 2>/dev/null || true
+    echo "--- end captured log ---"
+  fi
+
+  if [[ "${rc}" -eq 0 ]]; then
+    echo "negative-preflight: ERROR: nemoclaw onboard unexpectedly exited 0; preflight should have failed when docker is unreachable" >&2
+    return 1
+  fi
+
+  return 0
+}
diff --git a/test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh b/test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
index 951ab6613e..fba1004559 100755
--- a/test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
+++ b/test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
@@ -14,6 +14,8 @@ _E2E_ONBOARD_RUNTIME_LIB="$(cd "${_E2E_ONBOARD_DIR}/../../runtime/lib" && pwd)"
 . "${_E2E_ONBOARD_RUNTIME_LIB}/context.sh"
 # shellcheck source=cloud-openclaw.sh
 . "${_E2E_ONBOARD_DIR}/cloud-openclaw.sh"
+# shellcheck source=cloud-openclaw-no-docker.sh
+. "${_E2E_ONBOARD_DIR}/cloud-openclaw-no-docker.sh"
 # shellcheck source=cloud-hermes.sh
 . "${_E2E_ONBOARD_DIR}/cloud-hermes.sh"
 # shellcheck source=local-ollama-openclaw.sh
@@ -30,6 +32,9 @@ e2e_onboard() {
     cloud-openclaw)
       e2e_onboard_cloud_openclaw
       ;;
+    cloud-openclaw-no-docker)
+      e2e_onboard_cloud_openclaw_no_docker
+      ;;
     cloud-openclaw-custom-policies)
       E2E_ONBOARDING_MODEL="${E2E_ONBOARDING_MODEL:-nvidia/nemotron-3-super-120b-a12b}"
       E2E_ONBOARDING_POLICY_PRESETS="${E2E_ONBOARDING_POLICY_PRESETS:-npm,pypi}"
diff --git a/test/e2e-scenario/scenarios/compiler.ts b/test/e2e-scenario/scenarios/compiler.ts
index 8b6e47019d..796e8a05fc 100644
--- a/test/e2e-scenario/scenarios/compiler.ts
+++ b/test/e2e-scenario/scenarios/compiler.ts
@@ -8,6 +8,8 @@ import { loadManifest } from "./manifests.ts";
 import { requireScenarios } from "./registry.ts";
 import type {
   AssertionGroup,
+  ExpectedFailureContract,
+  ExpectedFailurePhase,
   NemoClawInstanceManifest,
   PhaseAction,
   PhaseName,
@@ -99,6 +101,11 @@ const ONBOARD_PROFILE_SECRET_ENV: Readonly<Record<string, readonly string[]>> =
   "cloud-openclaw-custom-policies": ["NVIDIA_API_KEY"],
   "cloud-openclaw-invalid-nvidia-key": ["NVIDIA_API_KEY"],
   "cloud-openclaw-gateway-port-conflict": ["NVIDIA_API_KEY"],
+  // Negative scenario: nemoclaw onboard runs against a docker shim that
+  // exits non-zero. Onboard never reaches the cloud auth step, but the
+  // CLI still loads NVIDIA_API_KEY when present — keep it in the secret
+  // env so behavior matches a real user invocation.
+  "cloud-openclaw-no-docker": ["NVIDIA_API_KEY"],
   "cloud-hermes": ["NVIDIA_API_KEY"],
   "cloud-hermes-discord": ["NVIDIA_API_KEY"],
   "cloud-hermes-slack": ["NVIDIA_API_KEY"],
@@ -138,10 +145,21 @@ function phaseActions(phase: PhaseName, scenario: ScenarioDefinition): PhaseActi
     if (!scenario.environment) {
       return [];
     }
-    const onboardingId = scenario.environment.onboarding;
-    if (!onboardingId) {
+    const baseOnboardingId = scenario.environment.onboarding;
+    if (!baseOnboardingId) {
       throw new Error(`Scenario ${scenario.id} is missing environment.onboarding`);
     }
+    // Negative-runtime scenarios route to a dedicated onboarding profile
+    // that sets up the failure condition (e.g. docker-missing) BEFORE
+    // invoking `nemoclaw onboard` and captures the resulting output to
+    // the log file the assertion phase reads. The profile id convention
+    // is `<base>-no-docker`. New negative profiles register a worker in
+    // nemoclaw_scenarios/onboard/dispatch.sh and a secret-env mapping
+    // above.
+    const onboardingId =
+      scenario.environment.runtime === "docker-missing"
+        ? `${baseOnboardingId}-no-docker`
+        : baseOnboardingId;
     // secretEnv defaults to [] (no parent-env secrets pass through)
     // unless the profile is explicitly listed above. Unknown profiles
     // get the safest setting and surface the gap loudly the first
@@ -178,6 +196,41 @@ const SUT_BOUNDARIES: SutBoundary[] = [
   { id: "state", client: "StateClient" },
 ];
 
+// Negative scenarios advertise their failure mode against one of these
+// user-facing phases. "preflight" is intentionally distinct from the
+// internal PhaseName union: scenario manifests speak the user's vocab
+// ("preflight failed") and the matcher resolves preflight to the
+// onboarding phase orchestrator. See orchestrators/negative-matcher.ts.
+const EXPECTED_FAILURE_PHASES: readonly ExpectedFailurePhase[] = [
+  "environment",
+  "onboarding",
+  "runtime",
+  "preflight",
+];
+
+function validateExpectedFailure(scenarioId: string, contract: ExpectedFailureContract): void {
+  if (!EXPECTED_FAILURE_PHASES.includes(contract.phase)) {
+    throw new Error(
+      `Scenario ${scenarioId} expectedFailure.phase invalid: ${String(contract.phase)} (allowed: ${EXPECTED_FAILURE_PHASES.join(", ")})`,
+    );
+  }
+  if (typeof contract.errorClass !== "string" || contract.errorClass.trim().length === 0) {
+    throw new Error(`Scenario ${scenarioId} expectedFailure.errorClass must be a non-empty string`);
+  }
+  if (contract.forbiddenSideEffects !== undefined) {
+    if (!Array.isArray(contract.forbiddenSideEffects)) {
+      throw new Error(`Scenario ${scenarioId} expectedFailure.forbiddenSideEffects must be an array`);
+    }
+    for (const entry of contract.forbiddenSideEffects) {
+      if (typeof entry !== "string" || entry.trim().length === 0) {
+        throw new Error(
+          `Scenario ${scenarioId} expectedFailure.forbiddenSideEffects entries must be non-empty strings`,
+        );
+      }
+    }
+  }
+}
+
 export function validateRunPlan(plan: RunPlan): void {
   if (!plan.scenarioId) {
     throw new Error("RunPlan missing scenarioId");
@@ -190,6 +243,9 @@ export function validateRunPlan(plan: RunPlan): void {
   if (plan.sutBoundaries.length === 0) {
     throw new Error(`RunPlan ${plan.scenarioId} missing SUT boundaries`);
   }
+  if (plan.expectedFailure) {
+    validateExpectedFailure(plan.scenarioId, plan.expectedFailure);
+  }
 }
 
 export function compileRunPlans(inputs: Array<string | ScenarioDefinition>): RunPlan[] {

From 3198d37d430bc8beb3f670badba73e5eec0b7063 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Thu, 28 May 2026 08:48:19 -0400
Subject: [PATCH 19/23] feat(e2e): typed negative-scenario contract matcher (PR
 #4380 advisor #3)

Advisor finding: scenarios with expectedFailure metadata declared
phase/errorClass/forbiddenSideEffects, but nothing in the typed
runner inspected observed phase results to verify the right phase
failed for the right reason. A scenario named
ubuntu-no-docker-preflight-negative could fail because DNS broke and
the run would still show 'failed' without catching the mismatch.

Add framework-owned negative-scenario contract verification, in the
spirit of redaction.ts and context.ts (typed orchestrator infra,
not shell):

- types.ts: ExpectedFailureContract typed shape replaces the prior
  Record<string, unknown> on ScenarioDefinition.expectedFailure and
  RunPlan.expectedFailure. Adds ExpectedFailurePhase
  (PhaseName | 'preflight') so manifests speak the user vocabulary
  while internal PhaseName stays narrow. Adds NegativeContractPhase
  / PhaseResultName so the synthetic phase result the runner emits
  cannot accidentally be declared by a scenario builder.

- orchestrators/negative-matcher.ts (new): pure function
  evaluateNegativeContract(plan, results) returning NegativeContractResult
  with outcome in {matched, no-failure-observed, wrong-phase,
  wrong-error-class}. Resolves expected.phase='preflight' to the
  onboarding orchestrator (where preflight assertions live).
  Substring-with-case-fold, separator-tolerant errorClass match.
  Excludes the runtime side-effect probe step from observed-failure
  detection so the matcher is not confused by its own enforcement
  scaffolding.

- orchestrators/runner.ts: after phases run, if plan.expectedFailure
  is set, call evaluateNegativeContract and append a synthetic
  PhaseResult with phase='negative-contract'. Emits
  .e2e/negative-contract.json artifact alongside per-phase results.
  Positive scenarios are untouched.

- run.ts: planFailed() consults the synthetic contract phase for
  negative scenarios. A negative scenario is green iff the contract
  matched AND the runtime control group's required no-side-effects
  step passed. Until the forbidden-side-effect probe lands the
  required pending step keeps that piece red, so matched-failure-mode
  alone still cannot flip a negative scenario green.

- builder.ts / scenarios/baseline.ts: thread the typed contract
  through the builder API and the canonical input shape.

- 15 new tests in e2e-negative-matcher.test.ts cover: matched,
  preflight->onboarding mapping, no-failure-observed, wrong-phase,
  wrong-errorClass, side-effect probe step ignored, case-insensitive
  matching, runner integration (matched + mismatched + positive
  unaffected), registry contract (every negative scenario opts into
  the side-effect probe step), and compiler validation rejects bad
  shapes.

Spec ownership boundaries kept honest:
- Failure injection (uninstalling docker, planting a bad key,
  occupying a port) stays runner-environment prep, not framework
  code. Matcher only inspects observed results.
- Forbidden-side-effect verification stays the
  expectedFailureNoSideEffectsProbe's job. The matcher reports
  phase + errorClass independently; the required pending step from
  cc6b7a220 keeps the side-effect axis visibly red until the probe
  lands.

354 framework tests pass (15 new). tsc clean.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../e2e-negative-matcher.test.ts              | 391 ++++++++++++++++++
 test/e2e-scenario/scenarios/builder.ts        |   2 +-
 .../orchestrators/negative-matcher.ts         | 206 +++++++++
 .../scenarios/orchestrators/runner.ts         |  44 +-
 test/e2e-scenario/scenarios/run.ts            |  32 +-
 .../scenarios/scenarios/baseline.ts           |   4 +-
 test/e2e-scenario/scenarios/types.ts          |  25 +-
 7 files changed, 696 insertions(+), 8 deletions(-)
 create mode 100644 test/e2e-scenario/framework-tests/e2e-negative-matcher.test.ts
 create mode 100644 test/e2e-scenario/scenarios/orchestrators/negative-matcher.ts

diff --git a/test/e2e-scenario/framework-tests/e2e-negative-matcher.test.ts b/test/e2e-scenario/framework-tests/e2e-negative-matcher.test.ts
new file mode 100644
index 0000000000..fc8f94eea9
--- /dev/null
+++ b/test/e2e-scenario/framework-tests/e2e-negative-matcher.test.ts
@@ -0,0 +1,391 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+import { describe, expect, it } from "vitest";
+import fs from "node:fs";
+import os from "node:os";
+import path from "node:path";
+
+import { compileRunPlans } from "../scenarios/compiler.ts";
+import {
+  evaluateNegativeContract,
+  negativeContractPhaseResult,
+} from "../scenarios/orchestrators/negative-matcher.ts";
+import { ScenarioRunner } from "../scenarios/orchestrators/runner.ts";
+import { listScenarios } from "../scenarios/registry.ts";
+import type {
+  ExpectedFailureContract,
+  PhaseName,
+  PhaseResult,
+  RunContext,
+  RunPlan,
+  RunPlanPhase,
+} from "../scenarios/types.ts";
+
+function freshCtx(): RunContext {
+  return { contextDir: fs.mkdtempSync(path.join(os.tmpdir(), "e2e-neg-")) };
+}
+
+function planWithExpectedFailure(contract: ExpectedFailureContract): RunPlan {
+  return {
+    scenarioId: "synthetic-negative",
+    status: "compiled",
+    suiteIds: [],
+    onboardingAssertionIds: [],
+    phases: [
+      { name: "environment", actions: [], assertionGroups: [] },
+      { name: "onboarding", actions: [], assertionGroups: [] },
+      { name: "runtime", actions: [], assertionGroups: [] },
+    ],
+    runnerRequirements: [],
+    requiredSecrets: [],
+    skippedCapabilities: [],
+    expectedFailure: contract,
+    sutBoundaries: [{ id: "host-cli", client: "HostCliClient" }],
+  };
+}
+
+function phaseResult(
+  phase: PhaseName,
+  opts: {
+    status?: PhaseResult["status"];
+    failedActionId?: string;
+    failedActionMessage?: string;
+    failedAssertionId?: string;
+    failedAssertionMessage?: string;
+  } = {},
+): PhaseResult {
+  return {
+    phase,
+    status: opts.status ?? "passed",
+    actions: opts.failedActionId
+      ? [{ id: opts.failedActionId, status: "failed", durationMs: 1, message: opts.failedActionMessage }]
+      : [],
+    assertions: opts.failedAssertionId
+      ? [
+          {
+            id: opts.failedAssertionId,
+            status: "failed",
+            attempts: 1,
+            durationMs: 1,
+            message: opts.failedAssertionMessage,
+          },
+        ]
+      : [],
+  };
+}
+
+describe("evaluateNegativeContract - phase + errorClass matching", () => {
+  it("matches when expected phase fails with the declared errorClass", () => {
+    const plan = planWithExpectedFailure({
+      phase: "onboarding",
+      errorClass: "invalid-nvidia-api-key",
+      forbiddenSideEffects: ["gateway-started"],
+    });
+    const results: PhaseResult[] = [
+      phaseResult("environment", { status: "passed" }),
+      phaseResult("onboarding", {
+        status: "failed",
+        failedActionId: "onboarding.profile.cloud-openclaw-invalid-nvidia-key",
+        failedActionMessage: "phase action onboarding exit 1: invalid-nvidia-api-key auth failed",
+      }),
+    ];
+    const result = evaluateNegativeContract(plan, results);
+    expect(result.matched).toBe(true);
+    expect(result.outcome).toBe("matched");
+    expect(result.observed.failedPhase).toBe("onboarding");
+  });
+
+  it("resolves preflight expected phase to onboarding orchestrator", () => {
+    const plan = planWithExpectedFailure({
+      phase: "preflight",
+      errorClass: "docker-missing",
+    });
+    const results: PhaseResult[] = [
+      phaseResult("environment", { status: "passed" }),
+      phaseResult("onboarding", {
+        status: "failed",
+        failedActionId: "onboarding.profile.cloud-openclaw",
+        failedActionMessage: "preflight detected docker-missing on the runner host",
+      }),
+    ];
+    const result = evaluateNegativeContract(plan, results);
+    expect(result.matched).toBe(true);
+    expect(result.outcome).toBe("matched");
+  });
+
+  it("fails when no failure was observed at all", () => {
+    const plan = planWithExpectedFailure({ phase: "onboarding", errorClass: "docker-missing" });
+    const results: PhaseResult[] = [
+      phaseResult("environment", { status: "passed" }),
+      phaseResult("onboarding", { status: "passed" }),
+      phaseResult("runtime", { status: "passed" }),
+    ];
+    const result = evaluateNegativeContract(plan, results);
+    expect(result.matched).toBe(false);
+    expect(result.outcome).toBe("no-failure-observed");
+    expect(result.message).toMatch(/all phases passed/);
+  });
+
+  it("fails when the wrong phase failed", () => {
+    const plan = planWithExpectedFailure({ phase: "onboarding", errorClass: "docker-missing" });
+    const results: PhaseResult[] = [
+      phaseResult("environment", {
+        status: "failed",
+        failedActionId: "environment.install.ubuntu-repo-no-docker",
+        failedActionMessage: "install dispatcher exit 1: docker-missing",
+      }),
+    ];
+    const result = evaluateNegativeContract(plan, results);
+    expect(result.matched).toBe(false);
+    expect(result.outcome).toBe("wrong-phase");
+    expect(result.message).toMatch(/expected onboarding failure/);
+    expect(result.observed.failedPhase).toBe("environment");
+  });
+
+  it("fails when the right phase failed for the wrong errorClass", () => {
+    const plan = planWithExpectedFailure({
+      phase: "onboarding",
+      errorClass: "gateway-port-conflict",
+    });
+    const results: PhaseResult[] = [
+      phaseResult("onboarding", {
+        status: "failed",
+        failedActionId: "onboarding.profile.cloud-openclaw-gateway-port-conflict",
+        failedActionMessage: "onboard exit 1: invalid-nvidia-api-key authentication failed",
+      }),
+    ];
+    const result = evaluateNegativeContract(plan, results);
+    expect(result.matched).toBe(false);
+    expect(result.outcome).toBe("wrong-error-class");
+    expect(result.message).toMatch(/errorClass mismatch/);
+  });
+
+  it("ignores the runtime side-effect probe step when scanning for observed failure", () => {
+    const plan = planWithExpectedFailure({ phase: "onboarding", errorClass: "docker-missing" });
+    const results: PhaseResult[] = [
+      phaseResult("environment", { status: "passed" }),
+      phaseResult("onboarding", {
+        status: "failed",
+        failedActionId: "onboarding.profile.cloud-openclaw",
+        failedActionMessage: "onboard exit 1: docker-missing daemon unreachable",
+      }),
+      // runtime phase has only the required pending side-effect step
+      // that fails closed until the probe lands. The matcher must NOT
+      // treat that as the observed failure mode.
+      {
+        phase: "runtime",
+        status: "failed",
+        actions: [],
+        assertions: [
+          {
+            id: "runtime.expected-failure.no-side-effects",
+            status: "failed",
+            attempts: 1,
+            durationMs: 0,
+            message: "required pending step not implemented: expectedFailureNoSideEffectsProbe",
+          },
+        ],
+      },
+    ];
+    const result = evaluateNegativeContract(plan, results);
+    expect(result.matched).toBe(true);
+    expect(result.observed.failedActionId).toBe("onboarding.profile.cloud-openclaw");
+  });
+
+  it("matches errorClass case-insensitively and across separator variants", () => {
+    const plan = planWithExpectedFailure({ phase: "onboarding", errorClass: "docker-missing" });
+    const results: PhaseResult[] = [
+      phaseResult("onboarding", {
+        status: "failed",
+        failedActionId: "onboarding",
+        failedActionMessage: "Onboard exit 1: Docker_Missing daemon socket unreachable",
+      }),
+    ];
+    expect(evaluateNegativeContract(plan, results).matched).toBe(true);
+  });
+
+  it("throws if invoked for a plan without expectedFailure", () => {
+    const plan: RunPlan = { ...planWithExpectedFailure({ phase: "onboarding", errorClass: "x" }), expectedFailure: undefined };
+    expect(() => evaluateNegativeContract(plan, [])).toThrow(/no expectedFailure declared/);
+  });
+
+  it("synthetic phase result reflects matched status", () => {
+    const plan = planWithExpectedFailure({ phase: "onboarding", errorClass: "docker-missing" });
+    const results: PhaseResult[] = [
+      phaseResult("onboarding", {
+        status: "failed",
+        failedActionId: "onboarding",
+        failedActionMessage: "docker-missing",
+      }),
+    ];
+    const synthetic = negativeContractPhaseResult(evaluateNegativeContract(plan, results));
+    expect(synthetic.phase).toBe("negative-contract");
+    expect(synthetic.status).toBe("passed");
+    expect(synthetic.assertions[0]).toEqual(
+      expect.objectContaining({ id: "negative-contract.match", status: "passed" }),
+    );
+  });
+});
+
+describe("ScenarioRunner appends negative-contract phase", () => {
+  it("invokes matcher and appends a passing synthetic phase when contract matched", async () => {
+    const ctx = freshCtx();
+    try {
+      const fakePhase = (
+        phase: PhaseName,
+        outcome: PhaseResult,
+      ) => ({
+        run: async (
+          _ctx: RunContext,
+          _runPhase: RunPlanPhase,
+          _prior?: PhaseResult[],
+        ): Promise<PhaseResult> => outcome,
+      });
+
+      const runner = new ScenarioRunner({
+        environment: fakePhase("environment", { phase: "environment", status: "passed", actions: [], assertions: [] }),
+        onboarding: fakePhase("onboarding", {
+          phase: "onboarding",
+          status: "failed",
+          actions: [
+            {
+              id: "onboarding.profile.cloud-openclaw",
+              status: "failed",
+              durationMs: 1,
+              message: "onboard exit 1: docker-missing daemon unreachable",
+            },
+          ],
+          assertions: [],
+        }),
+        runtime: fakePhase("runtime", { phase: "runtime", status: "passed", actions: [], assertions: [] }),
+      });
+
+      const plan = planWithExpectedFailure({ phase: "preflight", errorClass: "docker-missing" });
+      const results = await runner.run(ctx, plan);
+
+      const contractPhase = results[results.length - 1];
+      expect(contractPhase.phase).toBe("negative-contract");
+      expect(contractPhase.status).toBe("passed");
+
+      // Artifact emitted to ctx.contextDir/.e2e/negative-contract.json
+      const artifact = path.join(ctx.contextDir, ".e2e", "negative-contract.json");
+      expect(fs.existsSync(artifact)).toBe(true);
+      const parsed = JSON.parse(fs.readFileSync(artifact, "utf8"));
+      expect(parsed.matched).toBe(true);
+      expect(parsed.outcome).toBe("matched");
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("emits a failed synthetic phase when the wrong phase failed", async () => {
+    const ctx = freshCtx();
+    try {
+      const fakePhase = (outcome: PhaseResult) => ({
+        run: async (): Promise<PhaseResult> => outcome,
+      });
+
+      const runner = new ScenarioRunner({
+        environment: fakePhase({
+          phase: "environment",
+          status: "failed",
+          actions: [
+            {
+              id: "environment.install.ubuntu-repo-no-docker",
+              status: "failed",
+              durationMs: 1,
+              message: "install dispatcher exit 1: dns-resolution-error",
+            },
+          ],
+          assertions: [],
+        }),
+        onboarding: fakePhase({ phase: "onboarding", status: "skipped", actions: [], assertions: [] }),
+        runtime: fakePhase({ phase: "runtime", status: "skipped", actions: [], assertions: [] }),
+      });
+
+      const plan = planWithExpectedFailure({ phase: "onboarding", errorClass: "docker-missing" });
+      const results = await runner.run(ctx, plan);
+
+      const contractPhase = results[results.length - 1];
+      expect(contractPhase.phase).toBe("negative-contract");
+      expect(contractPhase.status).toBe("failed");
+      expect(contractPhase.assertions[0].message).toMatch(/expected onboarding failure/);
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("does NOT append negative-contract phase for positive scenarios", async () => {
+    const ctx = freshCtx();
+    try {
+      const [plan] = compileRunPlans(["ubuntu-repo-cloud-openclaw"]);
+      expect(plan.expectedFailure).toBeUndefined();
+
+      const fakePhase = (phase: PhaseName) => ({
+        run: async (): Promise<PhaseResult> => ({
+          phase,
+          status: "passed",
+          actions: [],
+          assertions: [],
+        }),
+      });
+      const runner = new ScenarioRunner({
+        environment: fakePhase("environment"),
+        onboarding: fakePhase("onboarding"),
+        runtime: fakePhase("runtime"),
+      });
+
+      const results = await runner.run(ctx, plan);
+      expect(results.map((r) => r.phase)).toEqual(["environment", "onboarding", "runtime"]);
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+});
+
+describe("registry contract: every negative scenario opts into the side-effect probe", () => {
+  it("scenario.expectedFailure implies the runtime no-side-effects required pending step", () => {
+    const negatives = listScenarios().filter((scenario) => scenario.expectedFailure);
+    expect(negatives.length).toBeGreaterThan(0);
+    for (const scenario of negatives) {
+      const runtimeGroups = scenario.assertionGroups.filter((group) => group.phase === "runtime");
+      const hasProbeStep = runtimeGroups.some((group) =>
+        group.steps.some(
+          (step) =>
+            step.id === "runtime.expected-failure.no-side-effects" &&
+            step.implementation?.kind === "pending" &&
+            step.required === true,
+        ),
+      );
+      expect(hasProbeStep, `scenario ${scenario.id} must include the required side-effect pending step`).toBe(true);
+    }
+  });
+});
+
+describe("compiler validates the typed expected-failure contract", () => {
+  it("rejects an invalid phase value", () => {
+    expect(() =>
+      compileRunPlans([
+        {
+          id: "synthetic-bad-phase",
+          assertionGroups: [],
+          // Force the bad shape the compiler must reject.
+          expectedFailure: { phase: "bogus" as never, errorClass: "x" },
+        },
+      ]),
+    ).toThrow(/expectedFailure\.phase invalid/);
+  });
+
+  it("rejects an empty errorClass", () => {
+    expect(() =>
+      compileRunPlans([
+        {
+          id: "synthetic-empty-class",
+          assertionGroups: [],
+          expectedFailure: { phase: "onboarding", errorClass: "" },
+        },
+      ]),
+    ).toThrow(/errorClass must be a non-empty string/);
+  });
+});
diff --git a/test/e2e-scenario/scenarios/builder.ts b/test/e2e-scenario/scenarios/builder.ts
index b2b9243a51..d4c2327e84 100644
--- a/test/e2e-scenario/scenarios/builder.ts
+++ b/test/e2e-scenario/scenarios/builder.ts
@@ -60,7 +60,7 @@ export class ScenarioBuilder {
     return this;
   }
 
-  expectedFailure(expectedFailure: Record<string, unknown>): ScenarioBuilder {
+  expectedFailure(expectedFailure: import("./types.ts").ExpectedFailureContract): ScenarioBuilder {
     this.definition.expectedFailure = expectedFailure;
     return this;
   }
diff --git a/test/e2e-scenario/scenarios/orchestrators/negative-matcher.ts b/test/e2e-scenario/scenarios/orchestrators/negative-matcher.ts
new file mode 100644
index 0000000000..30eb47d1af
--- /dev/null
+++ b/test/e2e-scenario/scenarios/orchestrators/negative-matcher.ts
@@ -0,0 +1,206 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+import type {
+  ExpectedFailureContract,
+  ExpectedFailurePhase,
+  PhaseName,
+  PhaseResult,
+  RunPlan,
+} from "../types.ts";
+
+// Pure framework infrastructure: given a compiled RunPlan and the
+// observed phase results, decide whether a negative scenario's
+// declared failure contract was honored. Does not mutate inputs and
+// does not perform I/O.
+//
+// Spec ownership boundaries:
+// - Failure injection (uninstalling docker, planting a bad key,
+//   occupying a gateway port) is runner-environment prep, NOT this
+//   matcher's job. The matcher only inspects what actually happened.
+// - Forbidden-side-effect verification (did a sandbox actually get
+//   created when the scenario forbids it?) belongs to the
+//   `expectedFailureNoSideEffectsProbe` implementation registered as
+//   a probe step. Until that probe lands, the runtime control group
+//   keeps the negative scenario visibly red via a `required: true`
+//   pending step. The matcher reports the contract status for
+//   phase + errorClass independently of the side-effect probe, and
+//   exposes whether forbiddenSideEffects were declared so callers can
+//   integrate both signals.
+
+export type NegativeContractMatchOutcome =
+  // Right phase, right errorClass match observed.
+  | "matched"
+  // Scenario expected a failure but every phase passed.
+  | "no-failure-observed"
+  // Wrong phase failed (e.g., expected onboarding, observed environment).
+  | "wrong-phase"
+  // Right phase, but the failure message did not advertise the
+  // declared errorClass.
+  | "wrong-error-class";
+
+export interface NegativeContractObservation {
+  failedPhase?: PhaseName;
+  failedActionId?: string;
+  failedActionMessage?: string;
+  failedAssertionId?: string;
+  failedAssertionMessage?: string;
+}
+
+export interface NegativeContractResult {
+  matched: boolean;
+  outcome: NegativeContractMatchOutcome;
+  expected: ExpectedFailureContract;
+  observed: NegativeContractObservation;
+  // Human-readable diagnostic; suitable for evidence logs and CI output.
+  message: string;
+}
+
+// Internal id reserved for the runtime side-effect pending/probe step
+// declared in assertions/registry.ts. The matcher excludes failures of
+// that step from "observed failure" detection so the contract evaluation
+// is not confused by its own enforcement scaffolding.
+const SIDE_EFFECT_PROBE_STEP_ID = "runtime.expected-failure.no-side-effects";
+
+// Map the user-facing expected failure phase to the internal phase
+// orchestrator that owns it. Today preflight assertions live under
+// onboarding (see assertions/registry.ts: onboarding.preflight.*).
+function resolveExpectedPhase(phase: ExpectedFailurePhase): PhaseName {
+  if (phase === "preflight") {
+    return "onboarding";
+  }
+  return phase;
+}
+
+function isOwnPhaseResult(phase: PhaseResult["phase"]): phase is PhaseName {
+  return phase === "environment" || phase === "onboarding" || phase === "runtime";
+}
+
+function findFirstObservedFailure(results: readonly PhaseResult[]): NegativeContractObservation | undefined {
+  for (const result of results) {
+    if (!isOwnPhaseResult(result.phase)) {
+      continue;
+    }
+    const failedAction = result.actions.find((action) => action.status === "failed");
+    if (failedAction) {
+      return {
+        failedPhase: result.phase,
+        failedActionId: failedAction.id,
+        failedActionMessage: failedAction.message,
+      };
+    }
+    const failedAssertion = result.assertions.find(
+      (assertion) => assertion.status === "failed" && assertion.id !== SIDE_EFFECT_PROBE_STEP_ID,
+    );
+    if (failedAssertion) {
+      return {
+        failedPhase: result.phase,
+        failedAssertionId: failedAssertion.id,
+        failedAssertionMessage: failedAssertion.message,
+      };
+    }
+  }
+  return undefined;
+}
+
+function errorClassMatches(message: string | undefined, errorClass: string): boolean {
+  if (!message) {
+    return false;
+  }
+  // Substring-with-case-fold match. Negative scenarios assert their
+  // failure mode by class name (e.g., "docker-missing",
+  // "invalid-nvidia-api-key"); we match either the literal class
+  // string or a normalized form where dashes/underscores/spaces are
+  // interchangeable. This stays a pure string check so the matcher
+  // can be fully tested in isolation.
+  const normalize = (value: string): string => value.toLowerCase().replace(/[\s_-]+/g, "-");
+  return normalize(message).includes(normalize(errorClass));
+}
+
+function describeObservation(observation: NegativeContractObservation): string {
+  const parts: string[] = [];
+  if (observation.failedPhase) {
+    parts.push(`phase=${observation.failedPhase}`);
+  }
+  if (observation.failedActionId) {
+    parts.push(`action=${observation.failedActionId}`);
+  }
+  if (observation.failedAssertionId) {
+    parts.push(`assertion=${observation.failedAssertionId}`);
+  }
+  const message = observation.failedActionMessage ?? observation.failedAssertionMessage;
+  if (message) {
+    parts.push(`message="${message.slice(0, 240)}"`);
+  }
+  return parts.length > 0 ? parts.join(" ") : "no failure observed";
+}
+
+export function evaluateNegativeContract(plan: RunPlan, results: readonly PhaseResult[]): NegativeContractResult {
+  const expected = plan.expectedFailure;
+  if (!expected) {
+    throw new Error(
+      `evaluateNegativeContract called for scenario ${plan.scenarioId} which has no expectedFailure declared`,
+    );
+  }
+  const expectedPhase = resolveExpectedPhase(expected.phase);
+  const observation = findFirstObservedFailure(results);
+
+  if (!observation) {
+    return {
+      matched: false,
+      outcome: "no-failure-observed",
+      expected,
+      observed: {},
+      message: `scenario ${plan.scenarioId} expected to fail in ${expected.phase} (errorClass=${expected.errorClass}), but all phases passed`,
+    };
+  }
+
+  if (observation.failedPhase !== expectedPhase) {
+    return {
+      matched: false,
+      outcome: "wrong-phase",
+      expected,
+      observed: observation,
+      message: `scenario ${plan.scenarioId} expected ${expected.phase} failure (errorClass=${expected.errorClass}); observed ${describeObservation(observation)}`,
+    };
+  }
+
+  const observedMessage = observation.failedActionMessage ?? observation.failedAssertionMessage;
+  if (!errorClassMatches(observedMessage, expected.errorClass)) {
+    return {
+      matched: false,
+      outcome: "wrong-error-class",
+      expected,
+      observed: observation,
+      message: `scenario ${plan.scenarioId} ${expected.phase} failure errorClass mismatch: expected="${expected.errorClass}" observed=${describeObservation(observation)}`,
+    };
+  }
+
+  return {
+    matched: true,
+    outcome: "matched",
+    expected,
+    observed: observation,
+    message: `scenario ${plan.scenarioId} negative contract matched: ${expected.phase}/${expected.errorClass} (${describeObservation(observation)})`,
+  };
+}
+
+// Convenience: build a synthetic PhaseResult for the runner to append
+// to the per-phase results. Keeps run.ts and artifact writers honest
+// (one shape, written through the same path as real phase results).
+export function negativeContractPhaseResult(result: NegativeContractResult): PhaseResult {
+  return {
+    phase: "negative-contract",
+    status: result.matched ? "passed" : "failed",
+    actions: [],
+    assertions: [
+      {
+        id: "negative-contract.match",
+        status: result.matched ? "passed" : "failed",
+        attempts: 1,
+        durationMs: 0,
+        message: result.message,
+      },
+    ],
+  };
+}
diff --git a/test/e2e-scenario/scenarios/orchestrators/runner.ts b/test/e2e-scenario/scenarios/orchestrators/runner.ts
index 228d32d452..fe429fc0c3 100644
--- a/test/e2e-scenario/scenarios/orchestrators/runner.ts
+++ b/test/e2e-scenario/scenarios/orchestrators/runner.ts
@@ -1,9 +1,13 @@
 // SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 // SPDX-License-Identifier: Apache-2.0
 
+import fs from "node:fs";
+import path from "node:path";
+
 import type { PhaseActionResult, PhaseResult, RunContext, RunPlan, RunPlanPhase } from "../types.ts";
 import { seedContextEnv } from "./context.ts";
 import { EnvironmentOrchestrator } from "./environment.ts";
+import { evaluateNegativeContract, negativeContractPhaseResult } from "./negative-matcher.ts";
 import { OnboardingOrchestrator } from "./onboarding.ts";
 import { RuntimeOrchestrator } from "./runtime.ts";
 
@@ -62,6 +66,20 @@ export class ScenarioRunner {
       const orchestrator = this.orchestratorFor(phase.name);
       results.push(await orchestrator.run(ctx, phase, results));
     }
+
+    // Negative-scenario contract verification. Single decision point:
+    // if the plan declared expectedFailure, evaluate the matcher and
+    // append a synthetic phase result. Positive scenarios are
+    // unaffected. Side-effect verification stays the responsibility of
+    // the runtime control group's required pending step (kept red
+    // until the probe lands); the matcher only judges phase + errorClass.
+    if (plan.expectedFailure) {
+      const contractResult = evaluateNegativeContract(plan, results);
+      const synthetic = negativeContractPhaseResult(contractResult);
+      results.push(synthetic);
+      writeNegativeContractArtifact(ctx, contractResult, synthetic);
+    }
+
     return results;
   }
 
@@ -74,16 +92,40 @@ export class ScenarioRunner {
 }
 
 interface BlockingFailure {
-  phase: PhaseResult["phase"];
+  phase: "environment" | "onboarding" | "runtime";
   action: PhaseActionResult;
 }
 
+function writeNegativeContractArtifact(
+  ctx: RunContext,
+  contractResult: ReturnType<typeof evaluateNegativeContract>,
+  synthetic: PhaseResult,
+): void {
+  try {
+    const outputDir = path.join(ctx.contextDir, ".e2e");
+    fs.mkdirSync(outputDir, { recursive: true });
+    fs.writeFileSync(
+      path.join(outputDir, "negative-contract.json"),
+      `${JSON.stringify(contractResult, null, 2)}\n`,
+    );
+    fs.writeFileSync(
+      path.join(outputDir, `${synthetic.phase}.result.json`),
+      `${JSON.stringify(synthetic, null, 2)}\n`,
+    );
+  } catch {
+    /* artifact emission is best-effort; matcher result already in memory */
+  }
+}
+
 function blockingPriorResult(results: PhaseResult[]): BlockingFailure | undefined {
   // A phase action failure (real setup work didn't succeed) blocks
   // downstream phases. Assertion failures do NOT block downstream
   // phases - they are expected to be reported alongside other phase
   // results so reviewers can see all failure layers at once.
   for (const result of results) {
+    if (result.phase !== "environment" && result.phase !== "onboarding" && result.phase !== "runtime") {
+      continue;
+    }
     const failedAction = result.actions.find((action) => action.status === "failed");
     if (failedAction) {
       return { phase: result.phase, action: failedAction };
diff --git a/test/e2e-scenario/scenarios/run.ts b/test/e2e-scenario/scenarios/run.ts
index 2a16c85996..31fb0c6083 100644
--- a/test/e2e-scenario/scenarios/run.ts
+++ b/test/e2e-scenario/scenarios/run.ts
@@ -97,7 +97,7 @@ async function main() {
   for (const plan of plans) {
     const results = await runner.run({ contextDir }, plan);
     allResults.push(...results);
-    if (results.some((result) => result.status === "failed")) {
+    if (planFailed(plan, results)) {
       anyFailed = true;
     }
   }
@@ -125,6 +125,36 @@ async function main() {
   }
 }
 
+// A scenario fails iff:
+//   positive (no expectedFailure): any phase result failed.
+//   negative (expectedFailure declared): the synthetic
+//     negative-contract phase did not match, OR the runtime
+//     control group's required side-effect step did not pass.
+//
+// The matcher decides exit code for negatives so that a scenario
+// that failed for the right reason in the right phase is no longer
+// reported as red just because setup did not complete. Until the
+// forbidden-side-effect probe lands, the required pending step in
+// runtimeControlGroups keeps negatives visibly red on the side-effect
+// axis even when phase + errorClass match.
+function planFailed(plan: import("./types.ts").RunPlan, results: PhaseResult[]): boolean {
+  if (!plan.expectedFailure) {
+    return results.some((result) => result.status === "failed");
+  }
+  const contractPhase = results.find((result) => result.phase === "negative-contract");
+  if (!contractPhase || contractPhase.status !== "passed") {
+    return true;
+  }
+  const runtime = results.find((result) => result.phase === "runtime");
+  const sideEffectStep = runtime?.assertions.find(
+    (assertion) => assertion.id === "runtime.expected-failure.no-side-effects",
+  );
+  if (!sideEffectStep || sideEffectStep.status !== "passed") {
+    return true;
+  }
+  return false;
+}
+
 try {
   await main();
 } catch (error) {
diff --git a/test/e2e-scenario/scenarios/scenarios/baseline.ts b/test/e2e-scenario/scenarios/scenarios/baseline.ts
index ef05fb6d6f..1a896bb90c 100644
--- a/test/e2e-scenario/scenarios/scenarios/baseline.ts
+++ b/test/e2e-scenario/scenarios/scenarios/baseline.ts
@@ -11,7 +11,7 @@ import {
   ubuntuRepoNoDocker,
   wslRepoDocker,
 } from "../matrix.ts";
-import type { ScenarioDefinition, ScenarioEnvironment } from "../types.ts";
+import type { ExpectedFailureContract, ScenarioDefinition, ScenarioEnvironment } from "../types.ts";
 
 interface CanonicalScenarioInput {
   id: string;
@@ -24,7 +24,7 @@ interface CanonicalScenarioInput {
   runnerRequirements?: string[];
   requiredSecrets?: string[];
   skippedCapabilities?: Array<Record<string, unknown>>;
-  expectedFailure?: Record<string, unknown>;
+  expectedFailure?: ExpectedFailureContract;
 }
 
 function canonicalScenario(input: CanonicalScenarioInput): ScenarioDefinition {
diff --git a/test/e2e-scenario/scenarios/types.ts b/test/e2e-scenario/scenarios/types.ts
index 46201f55a2..85a406c6d8 100644
--- a/test/e2e-scenario/scenarios/types.ts
+++ b/test/e2e-scenario/scenarios/types.ts
@@ -3,6 +3,25 @@
 
 export type PhaseName = "environment" | "onboarding" | "runtime";
 
+// Synthetic phase appended by the scenario runner when a scenario
+// declares plan.expectedFailure. Distinct from PhaseName so a scenario
+// builder cannot accidentally declare an assertion or action against
+// it. Only the runner emits PhaseResult entries with this name.
+export type NegativeContractPhase = "negative-contract";
+
+export type PhaseResultName = PhaseName | NegativeContractPhase;
+
+// User-facing phase the negative-scenario contract advertises. Wider
+// than PhaseName because manifests may declare "preflight" failures,
+// which the matcher resolves to the onboarding phase orchestrator.
+export type ExpectedFailurePhase = PhaseName | "preflight";
+
+export interface ExpectedFailureContract {
+  phase: ExpectedFailurePhase;
+  errorClass: string;
+  forbiddenSideEffects?: readonly string[];
+}
+
 export type TransientClassifier =
   | "empty-event-capture"
   | "provider-transient"
@@ -112,7 +131,7 @@ export interface ScenarioDefinition {
   runnerRequirements?: string[];
   requiredSecrets?: string[];
   skippedCapabilities?: Array<Record<string, unknown>>;
-  expectedFailure?: Record<string, unknown>;
+  expectedFailure?: ExpectedFailureContract;
 }
 
 // A phase action is real, deterministic setup work the phase orchestrator
@@ -179,7 +198,7 @@ export interface RunPlan {
   runnerRequirements: string[];
   requiredSecrets: string[];
   skippedCapabilities: Array<Record<string, unknown>>;
-  expectedFailure?: Record<string, unknown>;
+  expectedFailure?: ExpectedFailureContract;
   sutBoundaries: SutBoundary[];
 }
 
@@ -206,7 +225,7 @@ export interface PhaseActionResult {
 }
 
 export interface PhaseResult {
-  phase: PhaseName;
+  phase: PhaseResultName;
   status: "passed" | "failed" | "skipped";
   // Action results are recorded distinctly from assertion results so
   // failure-layer attribution stays unambiguous: a failure in actions

From 2008d0d463c49c66003c14b2f720e26c83c8c3f9 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Thu, 28 May 2026 08:50:00 -0400
Subject: [PATCH 20/23] test(e2e): switch policy_presets_preserved to host-side
 openshell policy get

The lifecycle.upgrade.policy-config-preserved assertion on PR #4380's
scenario run failed with no diagnostic. Root cause: the assertion
shelled out to 'nemoclaw policy status', a CLI subcommand that does
not exist (src/commands/sandbox/policy/ only ships add/list/remove).
The wrapper swallowed the missing-command stderr via '2>/dev/null
|| true', so the failure surfaced as a bare 'FAIL: ...' with no
explanation.

Mirror the legacy pattern used in test/e2e/test-rebuild-openclaw.sh
and test-full-e2e.sh:

  output=$(openshell policy get --full "${SANDBOX_NAME}" 2>&1 || true)
  echo "${output}" | grep -qi "npm\|registry.npmjs.org"

In the new framework's helper:

- Call 'openshell policy get --full <sandbox>' (host-side, fast, no
  ssh-into-sandbox required) instead of the nonexistent
  'nemoclaw policy status'. The override variable is renamed to
  REBUILD_UPGRADE_OPENSHELL_CMD so test fakes target the right command.
- Match each declared preset against either its bare name OR a
  well-known endpoint hostname. The live gateway policy dump renders
  network rules by hostname, not by preset label, so 'registry.npmjs.org'
  alone is sufficient evidence that the 'npm' preset is active.
- Cover npm, pypi, huggingface, brew, openclaw-pricing as the presets
  applied by every onboarding flow today (see actions log of the run).
- On miss, emit a diagnostic that names the missing preset, the
  matchers we tried, and the head of the policy output, so future
  failures are immediately actionable instead of silent.

New tests:
- policy_preset_check_should_match_endpoint_url_when_preset_name_absent:
  realistic policy dump containing only endpoint URLs passes.
- policy_preset_check_should_fail_with_diagnostic_when_preset_missing:
  missing preset surfaces a diagnostic naming the preset and matchers.

358 tests pass (was 322; fan-out includes the new policy-preset cases).

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../framework-tests/e2e-lib-helpers.test.ts   | 76 +++++++++++++++++++
 .../validation_suites/lib/rebuild_upgrade.sh  | 42 ++++++++--
 2 files changed, 113 insertions(+), 5 deletions(-)

diff --git a/test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts b/test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
index 82862f5622..26d4be9d49 100644
--- a/test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
@@ -522,6 +522,82 @@ describe("rebuild/upgrade validation helpers", () => {
       fs.rmSync(tmp, { recursive: true, force: true });
     }
   });
+
+  it("policy_preset_check_should_match_endpoint_url_when_preset_name_absent", () => {
+    // The legacy assertion called `nemoclaw policy status` (a command
+    // that does not exist) and silently failed. The new assertion calls
+    // `openshell policy get --full <sandbox>` and matches preset names
+    // OR their well-known endpoint hostnames. Verify both paths: a
+    // policy output containing only endpoint URLs (no bare preset name)
+    // still passes, mirroring the behavior of the live gateway policy
+    // dump in test/e2e/test-rebuild-openclaw.sh.
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-ru-policy-"));
+    try {
+      fs.writeFileSync(
+        path.join(tmp, "context.env"),
+        "E2E_SCENARIO=test\nE2E_AGENT=openclaw\nE2E_SANDBOX_NAME=sb\nE2E_GATEWAY_URL=http://127.0.0.1\n",
+      );
+      const r = runBash(
+        `
+        set -euo pipefail
+        fake_openshell() {
+          # Emit a minimal policy dump that contains the preset endpoint
+          # URLs but NOT the bare preset names. This is the realistic
+          # case: 'openshell policy get --full' renders network rules
+          # by hostname, not by preset label.
+          printf 'allow registry.npmjs.org\\nallow pypi.org\\n'
+        }
+        . "${REBUILD_UPGRADE_LIB}"
+        rebuild_upgrade_assert_policy_presets_preserved
+      `,
+        {
+          E2E_CONTEXT_DIR: tmp,
+          REBUILD_UPGRADE_OPENSHELL_CMD: "fake_openshell",
+          E2E_EXPECTED_POLICY_PRESETS: "npm pypi",
+        },
+      );
+      expect(r.status, r.stderr).toBe(0);
+      expect(r.stdout).toContain("suite.rebuild.policy_presets_preserved");
+    } finally {
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+
+  it("policy_preset_check_should_fail_with_diagnostic_when_preset_missing", () => {
+    // Negative case: when a declared preset is absent from the live
+    // policy dump, the assertion must fail AND emit a diagnostic line
+    // identifying the missing preset and showing the policy head. The
+    // original implementation failed silently because the underlying
+    // `nemoclaw policy status` command did not exist; the new
+    // implementation must produce actionable evidence.
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-ru-policy-miss-"));
+    try {
+      fs.writeFileSync(
+        path.join(tmp, "context.env"),
+        "E2E_SCENARIO=test\nE2E_AGENT=openclaw\nE2E_SANDBOX_NAME=sb\nE2E_GATEWAY_URL=http://127.0.0.1\n",
+      );
+      const r = runBash(
+        `
+        fake_openshell() {
+          # Policy dump missing 'pypi' entirely.
+          printf 'allow registry.npmjs.org\\n'
+        }
+        . "${REBUILD_UPGRADE_LIB}"
+        rebuild_upgrade_assert_policy_presets_preserved
+      `,
+        {
+          E2E_CONTEXT_DIR: tmp,
+          REBUILD_UPGRADE_OPENSHELL_CMD: "fake_openshell",
+          E2E_EXPECTED_POLICY_PRESETS: "npm pypi",
+        },
+      );
+      expect(r.status).not.toBe(0);
+      expect(r.stdout + r.stderr).toMatch(/preset 'pypi' not in policy/);
+      expect(r.stdout + r.stderr).toMatch(/matchers: pypi/);
+    } finally {
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
 });
 
 describe("Phase 1.A logging helpers", () => {
diff --git a/test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh b/test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
index 4870a68c64..85871affac 100755
--- a/test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
+++ b/test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
@@ -113,16 +113,48 @@ rebuild_upgrade_assert_inference_works() {
 
 rebuild_upgrade_assert_policy_presets_preserved() {
   rebuild_upgrade_require_context || return 1
-  local presets output preset
+  local id="suite.rebuild.policy_presets_preserved"
+  local sandbox presets output preset
+  sandbox="$(_rebuild_upgrade_ctx E2E_SANDBOX_NAME)"
   presets="${E2E_EXPECTED_POLICY_PRESETS:-npm pypi}"
-  output="$(_rebuild_upgrade_run REBUILD_UPGRADE_NEMOCLAW_CMD nemoclaw policy status 2>/dev/null || true)"
+
+  # Mirror the legacy test/e2e/test-rebuild-openclaw.sh and
+  # test-full-e2e.sh pattern: ask the live gateway for the full policy
+  # via `openshell policy get --full <sandbox>` and grep for the preset
+  # name OR a well-known endpoint hostname for that preset. The earlier
+  # implementation called `nemoclaw policy status`, which does not
+  # exist as a CLI subcommand — the assertion always failed silently
+  # because the wrapper swallowed the missing-command stderr via
+  # `2>/dev/null || true`.
+  output="$(_rebuild_upgrade_run REBUILD_UPGRADE_OPENSHELL_CMD openshell policy get --full "${sandbox}" 2>&1 || true)"
+  if [[ -z "${output}" ]]; then
+    e2e_fail "${id} openshell policy get --full returned no output for sandbox '${sandbox}'"
+    return 1
+  fi
+
+  local preset matchers found m
   for preset in ${presets}; do
-    if [[ "${output}" != *"${preset}"* ]]; then
-      e2e_fail "suite.rebuild.policy_presets_preserved"
+    case "${preset}" in
+      npm)              matchers=("npm" "registry.npmjs.org") ;;
+      pypi)             matchers=("pypi" "pypi.org" "files.pythonhosted.org") ;;
+      huggingface)      matchers=("huggingface" "huggingface.co") ;;
+      brew)             matchers=("brew" "formulae.brew.sh") ;;
+      openclaw-pricing) matchers=("openclaw-pricing" "openrouter.ai") ;;
+      *)                matchers=("${preset}") ;;
+    esac
+    found=0
+    for m in "${matchers[@]}"; do
+      if [[ "${output}" == *"${m}"* ]]; then
+        found=1
+        break
+      fi
+    done
+    if [[ "${found}" -eq 0 ]]; then
+      e2e_fail "${id} preset '${preset}' not in policy (matchers: ${matchers[*]}); head: ${output:0:300}"
       return 1
     fi
   done
-  e2e_pass "suite.rebuild.policy_presets_preserved"
+  e2e_pass "${id} presets=${presets}"
 }
 
 rebuild_upgrade_assert_hermes_config_preserved() {

From 6ecd198546759488f5d524d65fae9d11a2d8bb75 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Thu, 28 May 2026 09:06:43 -0400
Subject: [PATCH 21/23] test(e2e): land typed probe registry with diagnostics +
 docs validation

PR #4380's scenario run reports five probe-kind assertion steps as
'skipped: probe not registered': diagnostics.bundle, docs.validation,
and three security probes (shields.config, policy.enforced,
injection.blocked). The framework already had the contract hook in
phase.ts (kind === 'probe') but no registry to look the ref up in.

Land the registry infrastructure with two real probes; the security
probes follow in a separate change because they need careful
correctness review.

Probe registry (scenarios/probes/):

- types.ts defines ProbeContext (typed scenario state handed to the
  probe) and ProbeOutcome (mirrors the orchestrator's
  StepAttemptOutcome).
- registry.ts exposes registerProbe / lookupProbe /
  listRegisteredProbes / resetProbeRegistry. Re-registering the same
  name throws so probe shadowing surfaces loudly.
- diagnostics.ts: 'diagnosticsProbe' mirrors test/e2e/test-diagnostics.sh
  TC-DIAG-02 (nemoclaw debug --quick produces a non-empty archive
  within 30s). Writes structured evidence JSON to
  .e2e/assertions/diagnostics.bundle.json.
- docs-validation.ts: 'docsValidationProbe' mirrors
  test/e2e/test-docs-validation.sh by shelling to
  test/e2e/e2e-cloud-experimental/check-docs.sh --only-cli and
  --only-links --local-only. Remote http(s) link probes are skipped
  (the legacy script flags them as flaky under CI rate limiting).
- builtin.ts registers the two production probes; intentionally skips
  shieldsConfigProbe / networkPolicyProbe / injectionBlockedProbe so
  they fail closed (their typed registry entries are required: true).

Orchestrator integration (scenarios/orchestrators/phase.ts):

- Side-effect import of registerBuiltinProbes() at module load so
  every entry point that uses the orchestrator (run.ts,
  ScenarioRunner, framework tests) sees the same wired set without
  per-entry-point boilerplate.
- executeStep's kind==='probe' branch now looks the ref up in the
  registry. Registered: invoke and translate the structured outcome.
  Unregistered: skip, OR fail closed when step.required is true.
- buildProbeContext parses context.env once per probe invocation
  and exposes typed sandboxName / gatewayUrl / contextEnv fields
  so probe code doesn't reach into the file system itself.
- Probes that throw are caught and converted to a redacted failed
  outcome so the orchestrator never loses observability on a probe
  bug.

Tests (framework-tests/e2e-probes.test.ts, 14 cases):

- Registry: round-trip register/lookup, duplicate rejection,
  empty-name rejection, sorted listing, registerBuiltinProbes
  idempotency, security probes intentionally NOT registered.
- diagnosticsProbe: passes on non-empty archive, fails on non-zero
  exit, fails on empty archive. Uses a fake nemoclaw on PATH so the
  test runs reproducibly without a real install.
- docsValidationProbe: passes when both check-docs.sh phases exit 0,
  fails on cli-parity non-zero, fails on links non-zero, fails with
  actionable message when check-docs.sh is missing. Uses a fake
  check-docs.sh under a controlled repoRoot.

Existing test_non_required_probe_step_continues_to_skip_visibly
updated to reference an intentionally-unregistered probe ref instead
of 'diagnosticsProbe' (which is now a real built-in that would
actually invoke nemoclaw).

386 tests pass (was 358; +28 from the new probe suite x project fan-out).

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../e2e-phase-orchestrators.test.ts           |   8 +-
 .../framework-tests/e2e-probes.test.ts        | 305 ++++++++++++++++++
 .../scenarios/orchestrators/phase.ts          |  91 +++++-
 test/e2e-scenario/scenarios/probes/builtin.ts |  41 +++
 .../scenarios/probes/diagnostics.ts           | 156 +++++++++
 .../scenarios/probes/docs-validation.ts       | 160 +++++++++
 .../e2e-scenario/scenarios/probes/registry.ts |  54 ++++
 test/e2e-scenario/scenarios/probes/types.ts   |  61 ++++
 8 files changed, 865 insertions(+), 11 deletions(-)
 create mode 100644 test/e2e-scenario/framework-tests/e2e-probes.test.ts
 create mode 100644 test/e2e-scenario/scenarios/probes/builtin.ts
 create mode 100644 test/e2e-scenario/scenarios/probes/diagnostics.ts
 create mode 100644 test/e2e-scenario/scenarios/probes/docs-validation.ts
 create mode 100644 test/e2e-scenario/scenarios/probes/registry.ts
 create mode 100644 test/e2e-scenario/scenarios/probes/types.ts

diff --git a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
index c0f08fd23a..82ce513669 100644
--- a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
@@ -612,7 +612,13 @@ describe("required probe and pending steps fail closed", () => {
       const step: AssertionStep = {
         id: "runtime.diagnostics.non-required-probe",
         phase: "runtime",
-        implementation: { kind: "probe", ref: "diagnosticsProbe" },
+        // Use an intentionally-unregistered ref so this test exercises
+        // the "missing probe" code path. `diagnosticsProbe` is now a
+        // real built-in registered at orchestrator import time, so
+        // referring to it here would actually invoke nemoclaw and the
+        // assertion would fail (or pass) on real CLI behavior —
+        // unrelated to what this test verifies.
+        implementation: { kind: "probe", ref: "unregisteredFakeProbe" },
         evidencePath: ".e2e/assertions/runtime.diagnostics.non-required-probe.json",
         // required intentionally omitted (defaults to false)
       };
diff --git a/test/e2e-scenario/framework-tests/e2e-probes.test.ts b/test/e2e-scenario/framework-tests/e2e-probes.test.ts
new file mode 100644
index 0000000000..6063eedc1d
--- /dev/null
+++ b/test/e2e-scenario/framework-tests/e2e-probes.test.ts
@@ -0,0 +1,305 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+import { describe, it, expect, beforeEach, afterEach } from "vitest";
+import fs from "node:fs";
+import os from "node:os";
+import path from "node:path";
+import { fileURLToPath } from "node:url";
+import {
+  listRegisteredProbes,
+  lookupProbe,
+  registerProbe,
+  resetProbeRegistry,
+} from "../scenarios/probes/registry.ts";
+import type { ProbeContext, ProbeOutcome } from "../scenarios/probes/types.ts";
+import { registerBuiltinProbes } from "../scenarios/probes/builtin.ts";
+
+const REPO_ROOT = path.resolve(path.dirname(fileURLToPath(import.meta.url)), "../../..");
+
+describe("probe registry", () => {
+  // The orchestrator side-effect-imports builtin.ts at module load,
+  // so the registry already contains the built-ins. Each test resets
+  // and re-registers explicitly so order independence holds.
+  beforeEach(() => {
+    resetProbeRegistry();
+  });
+
+  afterEach(() => {
+    // Restore the production wiring so subsequent test files don't
+    // see an empty registry (vitest shares module state across files
+    // within a worker).
+    resetProbeRegistry();
+    registerBuiltinProbes();
+  });
+
+  it("registerProbe_lookupProbe_round_trip", () => {
+    const fn = async (): Promise<ProbeOutcome> => ({ status: "passed" });
+    registerProbe("myProbe", fn);
+    expect(lookupProbe("myProbe")).toBe(fn);
+  });
+
+  it("lookupProbe_returns_undefined_for_unknown_ref", () => {
+    expect(lookupProbe("nonexistent")).toBeUndefined();
+  });
+
+  it("registerProbe_rejects_duplicate_registration", () => {
+    const fn = async (): Promise<ProbeOutcome> => ({ status: "passed" });
+    registerProbe("dup", fn);
+    expect(() => registerProbe("dup", fn)).toThrow(/already registered/);
+  });
+
+  it("registerProbe_rejects_empty_name", () => {
+    const fn = async (): Promise<ProbeOutcome> => ({ status: "passed" });
+    expect(() => registerProbe("", fn)).toThrow(/name is required/);
+  });
+
+  it("listRegisteredProbes_returns_sorted_names", () => {
+    registerProbe("zeta", async () => ({ status: "passed" }));
+    registerProbe("alpha", async () => ({ status: "passed" }));
+    registerProbe("mu", async () => ({ status: "passed" }));
+    expect(listRegisteredProbes()).toEqual(["alpha", "mu", "zeta"]);
+  });
+
+  it("registerBuiltinProbes_is_idempotent", () => {
+    registerBuiltinProbes();
+    const first = listRegisteredProbes();
+    expect(first).toContain("diagnosticsProbe");
+    expect(first).toContain("docsValidationProbe");
+    // Calling again must not throw on duplicate names.
+    expect(() => registerBuiltinProbes()).not.toThrow();
+    expect(listRegisteredProbes()).toEqual(first);
+  });
+
+  it("registerBuiltinProbes_does_NOT_register_security_probes_yet", () => {
+    // The shieldsConfig / networkPolicy / injectionBlocked probes
+    // are intentionally not registered yet \u2014 their `required: true`
+    // status in scenarios/assertions/registry.ts means the
+    // orchestrator fails closed when they're missing, which is the
+    // contract we want until real implementations land.
+    registerBuiltinProbes();
+    const registered = listRegisteredProbes();
+    expect(registered).not.toContain("shieldsConfigProbe");
+    expect(registered).not.toContain("networkPolicyProbe");
+    expect(registered).not.toContain("injectionBlockedProbe");
+  });
+});
+
+// ─────────────────────────────────────────────────────────────────────────────
+// diagnosticsProbe — uses a fake `nemoclaw` on PATH so this test runs
+// reproducibly without depending on a real nemoclaw install.
+// ─────────────────────────────────────────────────────────────────────────────
+
+function makeProbeCtx(tmp: string, evidenceFile = "diag-evidence.json"): ProbeContext {
+  // contextDir doubles as the parent of the evidence file when the
+  // step does not specify an explicit path. Tests pass an explicit
+  // path here to keep the file under tmp.
+  return {
+    contextDir: tmp,
+    evidencePath: path.join(tmp, evidenceFile),
+    contextEnv: {},
+    sandboxName: null,
+    gatewayUrl: null,
+    repoRoot: REPO_ROOT,
+  };
+}
+
+function installFakeOnPath(
+  binDir: string,
+  name: string,
+  script: string,
+): { restore: () => void } {
+  fs.mkdirSync(binDir, { recursive: true });
+  fs.writeFileSync(path.join(binDir, name), script, { mode: 0o755 });
+  const oldPath = process.env.PATH;
+  process.env.PATH = `${binDir}:${oldPath ?? ""}`;
+  return {
+    restore: () => {
+      process.env.PATH = oldPath;
+    },
+  };
+}
+
+describe("diagnosticsProbe", () => {
+  it("passes_when_nemoclaw_debug_quick_writes_a_non_empty_archive", async () => {
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "diag-probe-pass-"));
+    const fake = installFakeOnPath(
+      path.join(tmp, "bin"),
+      "nemoclaw",
+      `#!/usr/bin/env bash
+# Stub: locate the --output value and write a small non-empty archive there.
+out=""
+while [[ "$#" -gt 0 ]]; do
+  case "$1" in
+    --output) out="$2"; shift 2 ;;
+    *) shift ;;
+  esac
+done
+[[ -n "$out" ]] || { echo "no --output" >&2; exit 2; }
+printf 'fake-archive-bytes' > "$out"
+exit 0
+`,
+    );
+    try {
+      const { diagnosticsProbe } = await import("../scenarios/probes/diagnostics.ts");
+      const outcome = await diagnosticsProbe(makeProbeCtx(tmp));
+      expect(outcome.status).toBe("passed");
+      expect(outcome.message).toMatch(/bundle ok/);
+      // Evidence JSON must exist and parse.
+      const ev = JSON.parse(fs.readFileSync(path.join(tmp, "diag-evidence.json"), "utf8"));
+      expect(ev.exitCode).toBe(0);
+      expect(ev.archiveSize).toBeGreaterThan(0);
+    } finally {
+      fake.restore();
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+
+  it("fails_when_nemoclaw_exits_nonzero", async () => {
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "diag-probe-fail-"));
+    const fake = installFakeOnPath(
+      path.join(tmp, "bin"),
+      "nemoclaw",
+      `#!/usr/bin/env bash\necho "boom" >&2\nexit 7\n`,
+    );
+    try {
+      const { diagnosticsProbe } = await import("../scenarios/probes/diagnostics.ts");
+      const outcome = await diagnosticsProbe(makeProbeCtx(tmp));
+      expect(outcome.status).toBe("failed");
+      expect(outcome.message).toMatch(/exited 7/);
+      const ev = JSON.parse(fs.readFileSync(path.join(tmp, "diag-evidence.json"), "utf8"));
+      expect(ev.exitCode).toBe(7);
+      expect(ev.stderrTail).toContain("boom");
+    } finally {
+      fake.restore();
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+
+  it("fails_when_archive_is_empty", async () => {
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "diag-probe-empty-"));
+    const fake = installFakeOnPath(
+      path.join(tmp, "bin"),
+      "nemoclaw",
+      `#!/usr/bin/env bash
+out=""
+while [[ "$#" -gt 0 ]]; do
+  case "$1" in --output) out="$2"; shift 2 ;; *) shift ;; esac
+done
+: > "$out"  # zero-byte archive
+exit 0
+`,
+    );
+    try {
+      const { diagnosticsProbe } = await import("../scenarios/probes/diagnostics.ts");
+      const outcome = await diagnosticsProbe(makeProbeCtx(tmp));
+      expect(outcome.status).toBe("failed");
+      expect(outcome.message).toMatch(/empty/);
+    } finally {
+      fake.restore();
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+});
+
+// ─────────────────────────────────────────────────────────────────────────────
+// docsValidationProbe — substitutes a fake check-docs.sh by overriding
+// the repoRoot in the ProbeContext so the resolved path points at a
+// scratch dir we control.
+// ─────────────────────────────────────────────────────────────────────────────
+
+describe("docsValidationProbe", () => {
+  function setupFakeCheckDocs(
+    tmp: string,
+    cliExit: number,
+    linksExit: number,
+  ): { ctx: ProbeContext } {
+    const scriptDir = path.join(tmp, "test/e2e/e2e-cloud-experimental");
+    fs.mkdirSync(scriptDir, { recursive: true });
+    fs.writeFileSync(
+      path.join(scriptDir, "check-docs.sh"),
+      `#!/usr/bin/env bash
+case "$1" in
+  --only-cli)            exit ${cliExit} ;;
+  --only-links)          exit ${linksExit} ;;
+  *)                     echo "unknown: $*" >&2; exit 99 ;;
+esac
+`,
+      { mode: 0o755 },
+    );
+    return {
+      ctx: {
+        contextDir: tmp,
+        evidencePath: path.join(tmp, "docs-evidence.json"),
+        contextEnv: {},
+        sandboxName: null,
+        gatewayUrl: null,
+        repoRoot: tmp, // probe resolves check-docs.sh against this
+      },
+    };
+  }
+
+  it("passes_when_both_cli_and_links_checks_exit_zero", async () => {
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "docs-probe-pass-"));
+    try {
+      const { ctx } = setupFakeCheckDocs(tmp, 0, 0);
+      const { docsValidationProbe } = await import("../scenarios/probes/docs-validation.ts");
+      const outcome = await docsValidationProbe(ctx);
+      expect(outcome.status).toBe("passed");
+      const ev = JSON.parse(fs.readFileSync(ctx.evidencePath, "utf8"));
+      expect(ev.results).toHaveLength(2);
+      expect(ev.results[0].phase).toBe("cli-parity");
+      expect(ev.results[0].exitCode).toBe(0);
+      expect(ev.results[1].phase).toBe("links-local");
+      expect(ev.results[1].exitCode).toBe(0);
+    } finally {
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+
+  it("fails_when_cli_parity_check_exits_nonzero", async () => {
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "docs-probe-cli-fail-"));
+    try {
+      const { ctx } = setupFakeCheckDocs(tmp, 3, 0);
+      const { docsValidationProbe } = await import("../scenarios/probes/docs-validation.ts");
+      const outcome = await docsValidationProbe(ctx);
+      expect(outcome.status).toBe("failed");
+      expect(outcome.message).toMatch(/CLI\/docs parity failed.*exit 3/);
+    } finally {
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+
+  it("fails_when_links_check_exits_nonzero", async () => {
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "docs-probe-links-fail-"));
+    try {
+      const { ctx } = setupFakeCheckDocs(tmp, 0, 5);
+      const { docsValidationProbe } = await import("../scenarios/probes/docs-validation.ts");
+      const outcome = await docsValidationProbe(ctx);
+      expect(outcome.status).toBe("failed");
+      expect(outcome.message).toMatch(/markdown link check failed.*exit 5/);
+    } finally {
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+
+  it("fails_with_actionable_message_when_check_docs_script_missing", async () => {
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "docs-probe-missing-"));
+    try {
+      const { docsValidationProbe } = await import("../scenarios/probes/docs-validation.ts");
+      const ctx: ProbeContext = {
+        contextDir: tmp,
+        evidencePath: path.join(tmp, "docs-evidence.json"),
+        contextEnv: {},
+        sandboxName: null,
+        gatewayUrl: null,
+        repoRoot: tmp, // no test/e2e/... tree under tmp
+      };
+      const outcome = await docsValidationProbe(ctx);
+      expect(outcome.status).toBe("failed");
+      expect(outcome.message).toMatch(/check-docs\.sh not found/);
+    } finally {
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+});
diff --git a/test/e2e-scenario/scenarios/orchestrators/phase.ts b/test/e2e-scenario/scenarios/orchestrators/phase.ts
index de952b23fc..ccde0ba73d 100644
--- a/test/e2e-scenario/scenarios/orchestrators/phase.ts
+++ b/test/e2e-scenario/scenarios/orchestrators/phase.ts
@@ -16,8 +16,18 @@ import type {
   RunPlanPhase,
   TransientClassifier,
 } from "../types.ts";
+import { lookupProbe } from "../probes/registry.ts";
+import type { ProbeContext } from "../probes/types.ts";
 import { buildChildEnv, pipeRedacted, redactString } from "./redaction.ts";
 
+// Auto-register the built-in probes the moment the orchestrator is
+// imported. This is a deliberate side-effect import: registry state is
+// module-scoped and we want every entry point that runs assertions
+// (run.ts, ScenarioRunner, framework tests) to see the same wired set
+// without each one repeating the registration.
+import { registerBuiltinProbes } from "../probes/builtin.ts";
+registerBuiltinProbes();
+
 const REPO_ROOT = path.resolve(path.dirname(fileURLToPath(import.meta.url)), "../../../..");
 const DEFAULT_STEP_TIMEOUT_SECONDS = 300;
 
@@ -47,6 +57,44 @@ function classifierForRef(ref: string): TransientClassifier {
   return "runner-infra";
 }
 
+/**
+ * Build the typed ProbeContext handed to a probe runner. Mirrors the
+ * subset of state that shell steps already get via
+ * ${E2E_CONTEXT_DIR}/context.env, but parsed up front so probe code
+ * doesn't reach into the file system itself.
+ */
+function buildProbeContext(ctx: RunContext, step: AssertionStep): ProbeContext {
+  const contextEnvPath = path.join(ctx.contextDir, "context.env");
+  const contextEnv: Record<string, string> = {};
+  if (fs.existsSync(contextEnvPath)) {
+    const raw = fs.readFileSync(contextEnvPath, "utf8");
+    for (const line of raw.split("\n")) {
+      const trimmed = line.trim();
+      if (!trimmed || trimmed.startsWith("#")) continue;
+      const eq = trimmed.indexOf("=");
+      if (eq <= 0) continue;
+      const key = trimmed.slice(0, eq);
+      let value = trimmed.slice(eq + 1);
+      if ((value.startsWith('"') && value.endsWith('"')) || (value.startsWith("'") && value.endsWith("'"))) {
+        value = value.slice(1, -1);
+      }
+      contextEnv[key] = value;
+    }
+  }
+  const evidenceRel = step.evidencePath ?? `.e2e/assertions/${step.id}.json`;
+  const evidencePath = path.isAbsolute(evidenceRel)
+    ? evidenceRel
+    : path.join(ctx.contextDir, evidenceRel);
+  return {
+    contextDir: ctx.contextDir,
+    evidencePath,
+    contextEnv,
+    sandboxName: contextEnv.E2E_SANDBOX_NAME ?? null,
+    gatewayUrl: contextEnv.E2E_GATEWAY_URL ?? null,
+    repoRoot: REPO_ROOT,
+  };
+}
+
 export class PhaseOrchestrator {
   constructor(private readonly phaseName: PhaseName) {}
 
@@ -289,21 +337,44 @@ export class PhaseOrchestrator {
       return this.runShellStep(ctx, step);
     }
     if (kind === "probe") {
-      // Probe registry lands in a follow-up PR. Until then, probes
-      // surface as visibly skipped — never as fake green. For
-      // security-sensitive or otherwise required probes, the run
-      // must NOT pass on this gap; the typed registry marks those
-      // with `required: true` and we reclassify the skip as a
-      // failure so the phase result fails closed.
       const ref = step.implementation?.ref ?? "<no ref>";
-      if (step.required) {
+      const probe = lookupProbe(ref);
+      if (!probe) {
+        // Probe is referenced by the typed registry but no
+        // implementation has been registered yet. Surface as
+        // skipped — unless the step is marked required, in which
+        // case fail closed so security-sensitive suites never
+        // pass on a missing probe.
+        if (step.required) {
+          return {
+            status: "failed",
+            classifier: "runner-infra",
+            message: `required probe not registered: ${ref} (step ${step.id})`,
+          };
+        }
+        return { status: "skipped", message: `probe not registered: ${ref}` };
+      }
+      const probeCtx = buildProbeContext(ctx, step);
+      try {
+        const outcome = await probe(probeCtx);
+        return {
+          status: outcome.status,
+          classifier: outcome.classifier,
+          message: outcome.message,
+          evidence: outcome.evidence ?? probeCtx.evidencePath,
+        };
+      } catch (err) {
+        // Probes must not throw — but a thrown error must NEVER
+        // cause an unobservable failure. Convert to a failed
+        // outcome with a redacted message so the orchestrator's
+        // result aggregation still records evidence.
+        const message = err instanceof Error ? err.message : String(err);
         return {
           status: "failed",
-          classifier: "runner-infra",
-          message: `required probe not registered: ${ref} (step ${step.id})`,
+          message: redactString(`probe ${ref} threw: ${message}`),
+          evidence: probeCtx.evidencePath,
         };
       }
-      return { status: "skipped", message: `probe not registered: ${ref}` };
     }
     if (kind === "pending") {
       // pending steps surface as skipped with the placeholder ref so
diff --git a/test/e2e-scenario/scenarios/probes/builtin.ts b/test/e2e-scenario/scenarios/probes/builtin.ts
new file mode 100644
index 0000000000..afab89aaf8
--- /dev/null
+++ b/test/e2e-scenario/scenarios/probes/builtin.ts
@@ -0,0 +1,41 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+import { diagnosticsProbe } from "./diagnostics.ts";
+import { docsValidationProbe } from "./docs-validation.ts";
+import { lookupProbe, registerProbe } from "./registry.ts";
+
+/**
+ * Register all built-in probes. Idempotent: re-importing this module
+ * (e.g. through a different entry point) is a no-op once the probes
+ * are already in place.
+ *
+ * Ownership boundary:
+ *   - Built-in probes here implement the cross-scenario contract that
+ *     the typed registry already references by name (see
+ *     scenarios/assertions/registry.ts).
+ *   - Scenario-specific probes (if any) belong in a per-scenario
+ *     module that calls `registerProbe()` directly.
+ *
+ * Probes intentionally NOT yet registered (probe-registry follow-up):
+ *   - shieldsConfigProbe       (security; required: true)
+ *   - networkPolicyProbe       (security; required: true)
+ *   - injectionBlockedProbe    (security; required: true)
+ *
+ * Until those land, the orchestrator surfaces them as failed (not
+ * skipped) because the typed registry marks them required: true.
+ * That is intentional — security-sensitive suites must NEVER show
+ * fake-green when their probe is missing.
+ */
+const BUILTIN_PROBES = {
+  diagnosticsProbe,
+  docsValidationProbe,
+} as const;
+
+export function registerBuiltinProbes(): void {
+  for (const [name, fn] of Object.entries(BUILTIN_PROBES)) {
+    if (lookupProbe(name) === undefined) {
+      registerProbe(name, fn);
+    }
+  }
+}
diff --git a/test/e2e-scenario/scenarios/probes/diagnostics.ts b/test/e2e-scenario/scenarios/probes/diagnostics.ts
new file mode 100644
index 0000000000..e2259a6b77
--- /dev/null
+++ b/test/e2e-scenario/scenarios/probes/diagnostics.ts
@@ -0,0 +1,156 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+import { spawn } from "node:child_process";
+import fs from "node:fs";
+import os from "node:os";
+import path from "node:path";
+import type { ProbeContext, ProbeFn, ProbeOutcome } from "./types.ts";
+
+/**
+ * Probe: diagnostics.bundle (`diagnosticsProbe`).
+ *
+ * Mirrors test/e2e/test-diagnostics.sh's TC-DIAG-02 case:
+ *
+ *   1. Run `nemoclaw debug --quick --output <tmp>/quick-debug.tar.gz`
+ *      with a 30s budget.
+ *   2. Assert exit 0.
+ *   3. Assert the archive exists and is non-empty.
+ *
+ * The legacy test also asserts the archive contains no plaintext
+ * credentials (TC-DIAG-01), but that lives in a separate probe
+ * (a future `diagnosticsBundleSecretsProbe`) so this one stays
+ * narrowly focused on bundle production.
+ *
+ * Evidence: a JSON document at ProbeContext.evidencePath summarizing
+ * exit code, archive size, and elapsed seconds.
+ */
+const DIAGNOSTICS_TIMEOUT_MS = 30_000;
+
+interface DiagnosticsEvidence {
+  exitCode: number | null;
+  signal: NodeJS.Signals | null;
+  elapsedMs: number;
+  archivePath: string;
+  archiveSize: number | null;
+  stderrTail: string;
+}
+
+function writeEvidence(evidencePath: string, payload: DiagnosticsEvidence): void {
+  try {
+    fs.mkdirSync(path.dirname(evidencePath), { recursive: true });
+    fs.writeFileSync(evidencePath, JSON.stringify(payload, null, 2));
+  } catch {
+    /* evidence write is best-effort; never fail the probe on IO. */
+  }
+}
+
+export const diagnosticsProbe: ProbeFn = async (ctx: ProbeContext): Promise<ProbeOutcome> => {
+  // Pre-flight: nemoclaw must be on PATH; the legacy test treats this
+  // as a hard prerequisite, not a skip.
+  // (We rely on the spawned process surfacing ENOENT if it isn't.)
+
+  const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "e2e-diag-probe-"));
+  const archivePath = path.join(tmp, "quick-debug.tar.gz");
+  const startedAt = Date.now();
+
+  let exitCode: number | null = null;
+  let signal: NodeJS.Signals | null = null;
+  let stderrTail = "";
+
+  const result = await new Promise<{ code: number | null; signal: NodeJS.Signals | null }>(
+    (resolve) => {
+      const child = spawn(
+        "nemoclaw",
+        ["debug", "--quick", "--output", archivePath],
+        // Use the parent env directly: probes run inside the framework
+        // process and don't need the redacted secret env that shell
+        // steps build at the spawn boundary. PATH/HOME/E2E_* are
+        // already in process.env.
+        { env: process.env, cwd: ctx.repoRoot, stdio: ["ignore", "ignore", "pipe"] },
+      );
+      const onTimeout = setTimeout(() => {
+        try {
+          child.kill("SIGTERM");
+        } catch {
+          /* already gone */
+        }
+      }, DIAGNOSTICS_TIMEOUT_MS);
+      child.stderr?.on("data", (chunk: Buffer) => {
+        stderrTail = (stderrTail + chunk.toString("utf8")).slice(-1024);
+      });
+      child.on("error", (err) => {
+        clearTimeout(onTimeout);
+        // ENOENT or similar — nemoclaw is not on PATH. Surface as a
+        // distinct classifier so the operator can see it's an
+        // environment problem, not a real diagnostics failure.
+        stderrTail = (stderrTail + `spawn error: ${err.message}`).slice(-1024);
+        resolve({ code: 127, signal: null });
+      });
+      child.on("close", (code, sig) => {
+        clearTimeout(onTimeout);
+        resolve({ code, signal: sig });
+      });
+    },
+  );
+  exitCode = result.code;
+  signal = result.signal;
+  const elapsedMs = Date.now() - startedAt;
+
+  let archiveSize: number | null = null;
+  try {
+    const stat = fs.statSync(archivePath);
+    archiveSize = stat.size;
+  } catch {
+    archiveSize = null;
+  }
+
+  const evidence: DiagnosticsEvidence = {
+    exitCode,
+    signal,
+    elapsedMs,
+    archivePath,
+    archiveSize,
+    stderrTail,
+  };
+  writeEvidence(ctx.evidencePath, evidence);
+
+  // Best-effort cleanup of the tmp dir; keep the JSON evidence on
+  // disk regardless.
+  try {
+    fs.rmSync(tmp, { recursive: true, force: true });
+  } catch {
+    /* tmp cleanup is non-fatal */
+  }
+
+  if (signal === "SIGTERM") {
+    return {
+      status: "failed",
+      classifier: "runner-infra",
+      message: `diagnosticsProbe: nemoclaw debug --quick exceeded ${DIAGNOSTICS_TIMEOUT_MS / 1000}s`,
+    };
+  }
+  if (exitCode !== 0) {
+    return {
+      status: "failed",
+      message: `diagnosticsProbe: nemoclaw debug --quick exited ${exitCode}; stderr: ${stderrTail.slice(-300)}`,
+    };
+  }
+  if (archiveSize === null) {
+    return {
+      status: "failed",
+      message: `diagnosticsProbe: archive missing at ${archivePath}`,
+    };
+  }
+  if (archiveSize === 0) {
+    return {
+      status: "failed",
+      message: `diagnosticsProbe: archive at ${archivePath} is empty`,
+    };
+  }
+
+  return {
+    status: "passed",
+    message: `diagnosticsProbe: bundle ok (${archiveSize} bytes, ${elapsedMs}ms)`,
+  };
+};
diff --git a/test/e2e-scenario/scenarios/probes/docs-validation.ts b/test/e2e-scenario/scenarios/probes/docs-validation.ts
new file mode 100644
index 0000000000..76ba5127c6
--- /dev/null
+++ b/test/e2e-scenario/scenarios/probes/docs-validation.ts
@@ -0,0 +1,160 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+import { spawn } from "node:child_process";
+import fs from "node:fs";
+import path from "node:path";
+import type { ProbeContext, ProbeFn, ProbeOutcome } from "./types.ts";
+
+/**
+ * Probe: docs.validation (`docsValidationProbe`).
+ *
+ * Mirrors test/e2e/test-docs-validation.sh:
+ *
+ *   1. Run `test/e2e/e2e-cloud-experimental/check-docs.sh --only-cli`
+ *      to verify `nemoclaw --help` matches docs/reference/commands.mdx
+ *      (CLI / docs parity).
+ *   2. Run `... --only-links --local-only` to verify markdown internal
+ *      links resolve. Remote http(s) probes are skipped by default
+ *      because they are slow and flaky under CI rate limiting (the
+ *      legacy script documents this caveat).
+ *
+ * Both checks exit 0 on success. The probe captures both exit codes
+ * and surfaces a single combined outcome, with a structured evidence
+ * JSON for diagnosis.
+ */
+
+const CHECK_DOCS_REL = "test/e2e/e2e-cloud-experimental/check-docs.sh";
+const CLI_PARITY_TIMEOUT_MS = 60_000;
+const LINK_CHECK_TIMEOUT_MS = 90_000;
+
+interface DocsCheckResult {
+  phase: "cli-parity" | "links-local";
+  exitCode: number | null;
+  signal: NodeJS.Signals | null;
+  elapsedMs: number;
+  stderrTail: string;
+  stdoutTail: string;
+}
+
+interface DocsEvidence {
+  results: DocsCheckResult[];
+}
+
+function runCheck(
+  scriptPath: string,
+  args: readonly string[],
+  cwd: string,
+  timeoutMs: number,
+  phase: DocsCheckResult["phase"],
+): Promise<DocsCheckResult> {
+  return new Promise((resolve) => {
+    const startedAt = Date.now();
+    let stdoutTail = "";
+    let stderrTail = "";
+    const child = spawn("bash", [scriptPath, ...args], {
+      env: { ...process.env, CHECK_DOC_LINKS_REMOTE: "0" },
+      cwd,
+      stdio: ["ignore", "pipe", "pipe"],
+    });
+    const onTimeout = setTimeout(() => {
+      try {
+        child.kill("SIGTERM");
+      } catch {
+        /* already gone */
+      }
+    }, timeoutMs);
+    child.stdout?.on("data", (chunk: Buffer) => {
+      stdoutTail = (stdoutTail + chunk.toString("utf8")).slice(-1024);
+    });
+    child.stderr?.on("data", (chunk: Buffer) => {
+      stderrTail = (stderrTail + chunk.toString("utf8")).slice(-1024);
+    });
+    child.on("error", (err) => {
+      clearTimeout(onTimeout);
+      resolve({
+        phase,
+        exitCode: 127,
+        signal: null,
+        elapsedMs: Date.now() - startedAt,
+        stderrTail: `spawn error: ${err.message}`,
+        stdoutTail,
+      });
+    });
+    child.on("close", (code, sig) => {
+      clearTimeout(onTimeout);
+      resolve({
+        phase,
+        exitCode: code,
+        signal: sig,
+        elapsedMs: Date.now() - startedAt,
+        stderrTail,
+        stdoutTail,
+      });
+    });
+  });
+}
+
+function writeEvidence(evidencePath: string, payload: DocsEvidence): void {
+  try {
+    fs.mkdirSync(path.dirname(evidencePath), { recursive: true });
+    fs.writeFileSync(evidencePath, JSON.stringify(payload, null, 2));
+  } catch {
+    /* evidence write is best-effort */
+  }
+}
+
+export const docsValidationProbe: ProbeFn = async (ctx: ProbeContext): Promise<ProbeOutcome> => {
+  const scriptPath = path.resolve(ctx.repoRoot, CHECK_DOCS_REL);
+  if (!fs.existsSync(scriptPath)) {
+    return {
+      status: "failed",
+      message: `docsValidationProbe: check-docs.sh not found at ${scriptPath}`,
+    };
+  }
+
+  const cliResult = await runCheck(
+    scriptPath,
+    ["--only-cli"],
+    ctx.repoRoot,
+    CLI_PARITY_TIMEOUT_MS,
+    "cli-parity",
+  );
+  const linksResult = await runCheck(
+    scriptPath,
+    ["--only-links", "--local-only"],
+    ctx.repoRoot,
+    LINK_CHECK_TIMEOUT_MS,
+    "links-local",
+  );
+
+  writeEvidence(ctx.evidencePath, { results: [cliResult, linksResult] });
+
+  // Surface SIGTERM (timeout) as runner-infra so the orchestrator may
+  // retry on a transient slowness. Hard exit-code failures do not
+  // retry — a docs/CLI drift is deterministic.
+  if (cliResult.signal === "SIGTERM" || linksResult.signal === "SIGTERM") {
+    const which = cliResult.signal === "SIGTERM" ? "cli-parity" : "links-local";
+    return {
+      status: "failed",
+      classifier: "runner-infra",
+      message: `docsValidationProbe: ${which} check timed out`,
+    };
+  }
+  if (cliResult.exitCode !== 0) {
+    return {
+      status: "failed",
+      message: `docsValidationProbe: CLI/docs parity failed (exit ${cliResult.exitCode}); stderr: ${cliResult.stderrTail.slice(-300)}`,
+    };
+  }
+  if (linksResult.exitCode !== 0) {
+    return {
+      status: "failed",
+      message: `docsValidationProbe: markdown link check failed (exit ${linksResult.exitCode}); stderr: ${linksResult.stderrTail.slice(-300)}`,
+    };
+  }
+  return {
+    status: "passed",
+    message: `docsValidationProbe: ok (cli ${cliResult.elapsedMs}ms, links ${linksResult.elapsedMs}ms)`,
+  };
+};
diff --git a/test/e2e-scenario/scenarios/probes/registry.ts b/test/e2e-scenario/scenarios/probes/registry.ts
new file mode 100644
index 0000000000..3c4403cfcc
--- /dev/null
+++ b/test/e2e-scenario/scenarios/probes/registry.ts
@@ -0,0 +1,54 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+import type { ProbeFn } from "./types.ts";
+
+/**
+ * Map of probe-ref name → probe runner. Shell-side AssertionStep
+ * declarations carry an `implementation: { kind: "probe", ref: <name> }`.
+ * The orchestrator calls `lookupProbe(ref)` at execution time; if it
+ * returns undefined the step is reported skipped (or failed for
+ * `required` probes).
+ *
+ * The registry is module-scoped state. Built-in probes are registered
+ * by importing `./builtin.ts` (which calls registerProbe at module
+ * load). Tests that need a clean slate can call `resetProbeRegistry()`.
+ */
+const probes = new Map<string, ProbeFn>();
+
+/**
+ * Register a probe implementation under `name`. Re-registering an
+ * existing name throws — silently shadowing a probe is a contract
+ * violation that hides behavior from the runner.
+ */
+export function registerProbe(name: string, fn: ProbeFn): void {
+  if (!name) {
+    throw new Error("registerProbe: name is required");
+  }
+  if (probes.has(name)) {
+    throw new Error(`registerProbe: '${name}' already registered`);
+  }
+  probes.set(name, fn);
+}
+
+/**
+ * Look up a registered probe. Returns undefined when the ref is not
+ * registered; the caller (phase.ts) decides whether the missing probe
+ * surfaces as skipped or failed based on AssertionStep.required.
+ */
+export function lookupProbe(name: string): ProbeFn | undefined {
+  return probes.get(name);
+}
+
+/**
+ * Names of every currently-registered probe. Useful in plan rendering
+ * and tests that assert a build wired its expected probes.
+ */
+export function listRegisteredProbes(): readonly string[] {
+  return Array.from(probes.keys()).sort();
+}
+
+/** Test-only: clear the registry so each test starts from empty. */
+export function resetProbeRegistry(): void {
+  probes.clear();
+}
diff --git a/test/e2e-scenario/scenarios/probes/types.ts b/test/e2e-scenario/scenarios/probes/types.ts
new file mode 100644
index 0000000000..4b1edabd08
--- /dev/null
+++ b/test/e2e-scenario/scenarios/probes/types.ts
@@ -0,0 +1,61 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+import type { TransientClassifier } from "../types.ts";
+
+/**
+ * Context handed to a probe runner. Mirrors the subset of scenario
+ * state that shell steps already get via `${E2E_CONTEXT_DIR}/context.env`,
+ * but typed so probe implementations don't have to parse the file
+ * themselves.
+ *
+ * The orchestrator builds this before invoking the probe; probe code
+ * must NOT mutate `contextEnv` (treat as read-only).
+ */
+export interface ProbeContext {
+  /** Repo-relative or absolute path to .e2e/.. context root. */
+  contextDir: string;
+  /** Absolute path to the evidence file the probe SHOULD write. */
+  evidencePath: string;
+  /** Parsed key/value pairs from ${contextDir}/context.env. */
+  contextEnv: Readonly<Record<string, string>>;
+  /** Convenience accessor for the most-used keys. Null when missing. */
+  sandboxName: string | null;
+  gatewayUrl: string | null;
+  /** Repo root, so probes that shell out have a canonical cwd. */
+  repoRoot: string;
+}
+
+/**
+ * Structured probe result. Mirrors AssertionStep StepAttemptOutcome
+ * in `phase.ts` so the orchestrator can adopt it without translation.
+ *
+ * Probes MUST emit a structured outcome — never throw out of the
+ * registered function. Throwing is a contract violation that the
+ * orchestrator surfaces as a failed assertion with the error message,
+ * but a well-behaved probe converts thrown errors into a `failed`
+ * outcome with a redacted message.
+ */
+export interface ProbeOutcome {
+  status: "passed" | "failed" | "skipped";
+  message?: string;
+  classifier?: TransientClassifier;
+  /**
+   * Optional override for the evidence path. If omitted the orchestrator
+   * uses `step.evidencePath` (which the probe was already told via
+   * ProbeContext.evidencePath).
+   */
+  evidence?: string;
+}
+
+/**
+ * The function shape every registered probe implements.
+ *
+ * Convention:
+ *   - Probes are async even when they could be sync, so the registry
+ *     can swap an implementation for a slow IO-bound version without
+ *     ripple effects through the orchestrator.
+ *   - Probes write structured evidence (JSON) to ProbeContext.evidencePath
+ *     so failures are diagnosable from the artifact bundle.
+ */
+export type ProbeFn = (ctx: ProbeContext) => Promise<ProbeOutcome>;

From ff4b0a8f90b03fa5878730e74bad26599ece7508 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Thu, 28 May 2026 09:15:38 -0400
Subject: [PATCH 22/23] feat(e2e): typed state-validation phase gates runtime
 suites (PR #4380 advisor #4)

Advisor finding: README and scenario model define
'setup scenario -> expected state -> suite sequence', but the live TS
path skipped expected-state validation entirely. The legacy bash
runner gated this behind E2E_VALIDATE_EXPECTED_STATE=1 and used
env-var probe overrides only; the actual environment-shape contract
ran inline as e2e_gateway_assert_healthy / e2e_sandbox_assert_running
between onboarding and suite execution. The TS runner did neither.

Promote the inline preconditions to a first-class typed phase, in
the spirit of EnvironmentOrchestrator/OnboardingOrchestrator: real
probes, real clients, framework-owned timeouts and redaction, single
source of truth in the typed registry.

- types.ts: PhaseName extends to 'state-validation'. New
  ExpectedState typed shape (cli/gateway/sandbox/inference/credentials
  with present|absent|optional). New StateProbeId union. ExpectedFailurePhase
  intentionally excludes state-validation so scenarios cannot declare
  expected failures against an internal phase.

- scenarios/expected-states.ts (new): typed mirror of
  nemoclaw_scenarios/expected-states.yaml. Source of truth for the TS
  runner during transition; YAML stays in place for the legacy
  resolver until that path is fully retired. probesForState() maps
  the typed contract to the concrete probe ids the orchestrator emits.
  Inference and credentials remain declared but emit no probe actions
  yet (probe scripts not implemented); a registry test pins this gap
  so a future probe-script PR is forced to update the mapping.

- nemoclaw_scenarios/probes/ (new):
  - dispatch.sh exports e2e_state_probe <id>; the typed runner spawns
    it via the shared dispatch-action.sh launcher.
  - cli-installed.sh: command -v nemoclaw + executable check.
  - gateway-healthy.sh: defers to validation_suites/assert/gateway-alive.sh
    so there's one implementation of the gateway-health contract.
  - sandbox-running.sh: defers to validation_suites/assert/sandbox-alive.sh.
  - gateway-absent.sh: nemoclaw gateway status + URL reachability.
    Typed replacement for the run-scenario.sh inline forbidden-effect
    check on the gateway axis.
  - sandbox-absent.sh: nemoclaw list + openshell sandbox list. Typed
    replacement for the inline 'openshell sandbox list | grep -Fq'.

- compiler.ts: state-validation phase actions emitted from
  scenario.expectedStateId via probesForState(). Hard error on
  unknown expected_state id (typed runner is stricter than legacy
  resolver). Phase order is environment -> onboarding ->
  state-validation -> runtime.

- orchestrators/state-validation.ts (new): tiny subclass of
  PhaseOrchestrator. No new control flow; probe actions reuse the
  existing shell-fn action machinery (timeouts, redaction, evidence
  logs).

- orchestrators/runner.ts: phase-blocking semantics get one rule
  refinement. environment failure blocks onboarding, state-validation,
  and runtime. onboarding failure does NOT block state-validation -
  negative scenarios deliberately fail onboarding and rely on
  absent-state probes running afterward to verify forbidden side
  effects did not occur. state-validation failure blocks runtime so
  suites never run against a missing/wedged environment.

- orchestrators/negative-matcher.ts: state-validation forbidden-effect
  probe ids (gateway-absent, sandbox-absent) excluded from observed-failure
  scanning. They are post-failure verification, not the failure mode
  itself; their pass/fail status is reported separately and feeds the
  phase chain through normal action-failure semantics.

- 16 new tests in e2e-expected-state.test.ts cover: typed registry
  mirrors YAML structurally, probesForState mapping for ready/absent/
  optional states (inference/credentials gap pinned), compiler emits
  the right probe actions per scenario, phase order, hard error on
  unknown state id, and the three runner short-circuit cases
  (env failure blocks state-validation, onboarding failure does not,
  state-validation failure blocks runtime). Existing tests updated
  for the new phase ordering (e2e-phase-orchestrators,
  e2e-negative-matcher).

Spec ownership boundaries kept honest:
- Typed contract over env-var probes: replaces E2E_PROBE_OVERRIDE_*
  with structurally-typed expected-state declarations.
- One mode, no opt-out: state-validation always runs after onboarding
  for any scenario with an expectedStateId.
- Framework infra, not shell: orchestrator + clients + redaction
  reused; only the probe scripts are bash, same shape as install/
  onboard dispatchers.
- Single source of truth: scenario.expectedStateId -> typed registry
  -> emitted probe actions; one declaration drives everything.
- Failures stay visibly real: probes that aren't implemented yet
  (inference, credentials) stay declared but inert with a registry
  test pinning the gap. Their first execution lands when the probe
  ships, not as silent green.

Out of scope (deferred follow-ups):
- inference-available and credentials-present probe scripts.
- Per-suite requires_state gating at the assertion level.
- Retiring expected-states.yaml and runtime/resolver/validator.ts;
  the typed registry is now the SoT for the TS runner, but both
  artifacts remain alongside until the legacy resolver is unused.
- Replacing the runtime.expected-failure.no-side-effects required
  pending step. State-validation absent probes now do the real work,
  but the placeholder stays put until a follow-up confirms the
  switchover and removes it.

418 framework tests pass (16 new). tsc clean. Plan output verified
on positive (ubuntu-repo-cloud-openclaw) and negative
(ubuntu-no-docker-preflight-negative) scenarios.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 test/e2e-scenario/docs/README.md              |  16 +
 .../e2e-expected-state.test.ts                | 353 ++++++++++++++++++
 .../e2e-negative-matcher.test.ts              |   8 +-
 .../e2e-phase-orchestrators.test.ts           |  42 ++-
 .../probes/cli-installed.sh                   |  24 ++
 .../nemoclaw_scenarios/probes/dispatch.sh     |  42 +++
 .../probes/gateway-absent.sh                  |  44 +++
 .../probes/gateway-healthy.sh                 |  34 ++
 .../probes/sandbox-absent.sh                  |  46 +++
 .../probes/sandbox-running.sh                 |  30 ++
 test/e2e-scenario/scenarios/compiler.ts       |  46 ++-
 .../e2e-scenario/scenarios/expected-states.ts | 133 +++++++
 .../orchestrators/negative-matcher.ts         |  34 +-
 .../scenarios/orchestrators/runner.ts         |  42 ++-
 .../orchestrators/state-validation.ts         |  24 ++
 test/e2e-scenario/scenarios/types.ts          |  58 ++-
 16 files changed, 961 insertions(+), 15 deletions(-)
 create mode 100644 test/e2e-scenario/framework-tests/e2e-expected-state.test.ts
 create mode 100755 test/e2e-scenario/nemoclaw_scenarios/probes/cli-installed.sh
 create mode 100755 test/e2e-scenario/nemoclaw_scenarios/probes/dispatch.sh
 create mode 100755 test/e2e-scenario/nemoclaw_scenarios/probes/gateway-absent.sh
 create mode 100755 test/e2e-scenario/nemoclaw_scenarios/probes/gateway-healthy.sh
 create mode 100755 test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh
 create mode 100755 test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-running.sh
 create mode 100644 test/e2e-scenario/scenarios/expected-states.ts
 create mode 100644 test/e2e-scenario/scenarios/orchestrators/state-validation.ts

diff --git a/test/e2e-scenario/docs/README.md b/test/e2e-scenario/docs/README.md
index f4acc8eebe..e2ccf1c527 100644
--- a/test/e2e-scenario/docs/README.md
+++ b/test/e2e-scenario/docs/README.md
@@ -32,6 +32,22 @@ test plan, expected state, and post-onboard suites. Test plans can also declare
 onboarding assertions that run after install/onboard and before expected-state
 validation.
 
+The typed TS runner enforces the contract by inserting a dedicated
+`state-validation` phase between onboarding and runtime. Probe actions
+are emitted from the typed expected-state registry
+(`scenarios/expected-states.ts`, mirrored to
+`nemoclaw_scenarios/expected-states.yaml` during transition):
+
+- `cli-installed`, `gateway-healthy`, `sandbox-running` for ready states.
+- `gateway-absent`, `sandbox-absent` for negative/preflight-failure states.
+
+A failed probe is a phase-action failure; the runner short-circuits
+the runtime phase rather than running suite assertions against a
+missing or wedged environment. An onboarding-phase failure does NOT
+block state-validation — negative scenarios depend on absent-state
+probes running after the deliberate onboarding failure to verify
+forbidden side effects (gateway/sandbox left behind) did not occur.
+
 ## How to run
 
 The TypeScript runner is the only supported entrypoint. There is one
diff --git a/test/e2e-scenario/framework-tests/e2e-expected-state.test.ts b/test/e2e-scenario/framework-tests/e2e-expected-state.test.ts
new file mode 100644
index 0000000000..acffc5bf60
--- /dev/null
+++ b/test/e2e-scenario/framework-tests/e2e-expected-state.test.ts
@@ -0,0 +1,353 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+import { describe, expect, it } from "vitest";
+import fs from "node:fs";
+import os from "node:os";
+import path from "node:path";
+import yaml from "js-yaml";
+
+import { compileRunPlans } from "../scenarios/compiler.ts";
+import {
+  getExpectedState,
+  listExpectedStates,
+  probesForState,
+  requireExpectedState,
+} from "../scenarios/expected-states.ts";
+import { ScenarioRunner } from "../scenarios/orchestrators/runner.ts";
+import { listScenarios } from "../scenarios/registry.ts";
+import type { ExpectedState, PhaseName, PhaseResult, RunContext, RunPlanPhase } from "../scenarios/types.ts";
+
+const REPO_ROOT = path.resolve(import.meta.dirname, "../../..");
+const STATES_YAML_PATH = path.join(
+  REPO_ROOT,
+  "test/e2e-scenario/nemoclaw_scenarios/expected-states.yaml",
+);
+
+function freshCtx(): RunContext {
+  return { contextDir: fs.mkdtempSync(path.join(os.tmpdir(), "e2e-state-")) };
+}
+
+describe("typed expected-state registry mirrors expected-states.yaml", () => {
+  it("typed registry covers every YAML expected_state id", () => {
+    const yamlDoc = yaml.load(fs.readFileSync(STATES_YAML_PATH, "utf8")) as {
+      expected_states: Record<string, unknown>;
+    };
+    const yamlIds = Object.keys(yamlDoc.expected_states).sort();
+    const typedIds = listExpectedStates()
+      .map((s) => s.id)
+      .sort();
+    expect(typedIds).toEqual(yamlIds);
+  });
+
+  it("structural cli/gateway/sandbox dimensions match the YAML for each shared id", () => {
+    const yamlDoc = yaml.load(fs.readFileSync(STATES_YAML_PATH, "utf8")) as {
+      expected_states: Record<string, Record<string, Record<string, unknown>>>;
+    };
+    for (const state of listExpectedStates()) {
+      const yamlEntry = yamlDoc.expected_states[state.id];
+      expect(yamlEntry, `YAML entry for ${state.id}`).toBeTruthy();
+      // cli.installed
+      if (state.cli?.installed !== undefined) {
+        expect(yamlEntry.cli?.installed).toBe(state.cli.installed);
+      }
+      // gateway.expected
+      if (state.gateway) {
+        expect(yamlEntry.gateway?.expected).toBe(state.gateway.expected);
+        if (state.gateway.health) {
+          expect(yamlEntry.gateway?.health).toBe(state.gateway.health);
+        }
+      }
+      // sandbox.expected
+      if (state.sandbox) {
+        expect(yamlEntry.sandbox?.expected).toBe(state.sandbox.expected);
+        if (state.sandbox.status) {
+          expect(yamlEntry.sandbox?.status).toBe(state.sandbox.status);
+        }
+        if (state.sandbox.agent) {
+          expect(yamlEntry.sandbox?.agent).toBe(state.sandbox.agent);
+        }
+      }
+    }
+  });
+
+  it("requireExpectedState throws on unknown id with available list", () => {
+    expect(() => requireExpectedState("does-not-exist")).toThrow(/Unknown expected_state/);
+  });
+
+  it("getExpectedState returns the state for known ids", () => {
+    expect(getExpectedState("cloud-openclaw-ready")?.id).toBe("cloud-openclaw-ready");
+  });
+});
+
+describe("probesForState maps typed expected-state into probe ids", () => {
+  it("ready cloud state emits cli-installed, gateway-healthy, sandbox-running", () => {
+    expect(probesForState(requireExpectedState("cloud-openclaw-ready"))).toEqual([
+      "cli-installed",
+      "gateway-healthy",
+      "sandbox-running",
+    ]);
+  });
+
+  it("preflight-failure state emits cli-installed, gateway-absent, sandbox-absent", () => {
+    expect(probesForState(requireExpectedState("preflight-failure-no-sandbox"))).toEqual([
+      "cli-installed",
+      "gateway-absent",
+      "sandbox-absent",
+    ]);
+  });
+
+  it("optional-dimension state emits cli-installed only", () => {
+    expect(probesForState(requireExpectedState("macos-cli-ready-docker-optional"))).toEqual([
+      "cli-installed",
+    ]);
+  });
+
+  it("inference and credentials probes are intentionally NOT emitted yet", () => {
+    // The typed registry declares inference.expected=available and
+    // credentials.expected=present for ready states; the compiler does
+    // not yet emit probe actions for those dimensions because the
+    // probe scripts aren't written. This test pins that gap so a
+    // future probe-script PR is forced to update probesForState too.
+    const state: ExpectedState = {
+      id: "synthetic",
+      inference: { expected: "available", provider: "nvidia" },
+      credentials: { expected: "present" },
+    };
+    expect(probesForState(state)).toEqual([]);
+  });
+});
+
+describe("compiler emits state-validation phase actions from expected-state registry", () => {
+  it("positive scenario gets cli-installed + gateway-healthy + sandbox-running probe actions", () => {
+    const [plan] = compileRunPlans(["ubuntu-repo-cloud-openclaw"]);
+    const stateValidationPhase = plan.phases.find((p) => p.name === "state-validation");
+    expect(stateValidationPhase).toBeTruthy();
+    expect(stateValidationPhase!.actions.map((a) => a.id)).toEqual([
+      "state-validation.cli-installed",
+      "state-validation.gateway-healthy",
+      "state-validation.sandbox-running",
+    ]);
+    // Probes are typed shell-fn actions that go through the shared
+    // dispatcher; the orchestrator owns timeouts and redaction.
+    for (const action of stateValidationPhase!.actions) {
+      expect(action.kind).toBe("shell-fn");
+      expect(action.fn).toBe("e2e_state_probe");
+      expect(action.scriptRef).toBe(
+        "test/e2e-scenario/nemoclaw_scenarios/probes/dispatch.sh",
+      );
+      expect(action.timeoutSeconds).toBe(30);
+    }
+  });
+
+  it("negative scenario gets cli-installed + gateway-absent + sandbox-absent probe actions", () => {
+    const [plan] = compileRunPlans(["ubuntu-no-docker-preflight-negative"]);
+    const stateValidationPhase = plan.phases.find((p) => p.name === "state-validation");
+    expect(stateValidationPhase).toBeTruthy();
+    expect(stateValidationPhase!.actions.map((a) => a.id)).toEqual([
+      "state-validation.cli-installed",
+      "state-validation.gateway-absent",
+      "state-validation.sandbox-absent",
+    ]);
+  });
+
+  it("compiler hard-errors on a scenario referencing an unknown expected_state id", () => {
+    expect(() =>
+      compileRunPlans([
+        {
+          id: "synthetic-unknown-state",
+          assertionGroups: [],
+          expectedStateId: "definitely-not-a-state",
+        },
+      ]),
+    ).toThrow(/unknown expected_state/);
+  });
+
+  it("phase order is environment -> onboarding -> state-validation -> runtime", () => {
+    const [plan] = compileRunPlans(["ubuntu-repo-cloud-openclaw"]);
+    expect(plan.phases.map((p) => p.name)).toEqual([
+      "environment",
+      "onboarding",
+      "state-validation",
+      "runtime",
+    ]);
+  });
+});
+
+describe("ScenarioRunner short-circuit semantics around state-validation", () => {
+  it("onboarding action failure does NOT block state-validation (negative scenarios verify absent state)", async () => {
+    const ctx = freshCtx();
+    try {
+      const [plan] = compileRunPlans(["ubuntu-no-docker-preflight-negative"]);
+      const phase = (
+        name: PhaseName,
+        outcome: PhaseResult,
+      ): { run: (ctx: RunContext, p: RunPlanPhase) => Promise<PhaseResult> } => ({
+        run: async () => outcome,
+      });
+
+      let stateValidationCalled = false;
+      let runtimeCalled = false;
+      const runner = new ScenarioRunner({
+        environment: phase("environment", {
+          phase: "environment",
+          status: "passed",
+          actions: [],
+          assertions: [],
+        }),
+        onboarding: phase("onboarding", {
+          phase: "onboarding",
+          status: "failed",
+          actions: [
+            {
+              id: "onboarding.profile.cloud-openclaw-no-docker",
+              status: "failed",
+              durationMs: 1,
+              message: "preflight detected docker-missing",
+            },
+          ],
+          assertions: [],
+        }),
+        stateValidation: {
+          run: async () => {
+            stateValidationCalled = true;
+            return {
+              phase: "state-validation",
+              status: "passed",
+              actions: [],
+              assertions: [],
+            };
+          },
+        },
+        runtime: {
+          run: async () => {
+            runtimeCalled = true;
+            return { phase: "runtime", status: "passed", actions: [], assertions: [] };
+          },
+        },
+      });
+
+      const results = await runner.run(ctx, plan);
+      expect(stateValidationCalled).toBe(true);
+      expect(runtimeCalled).toBe(false);
+      // state-validation has its real result; runtime is skipped with
+      // the blocking-action message.
+      const stateRes = results.find((r) => r.phase === "state-validation")!;
+      expect(stateRes.status).toBe("passed");
+      const runtimeRes = results.find((r) => r.phase === "runtime")!;
+      expect(runtimeRes.status).toBe("skipped");
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("environment action failure blocks state-validation AND runtime", async () => {
+    const ctx = freshCtx();
+    try {
+      const [plan] = compileRunPlans(["ubuntu-repo-cloud-openclaw"]);
+      let stateValidationCalled = false;
+      let runtimeCalled = false;
+      const runner = new ScenarioRunner({
+        environment: {
+          run: async () => ({
+            phase: "environment",
+            status: "failed",
+            actions: [
+              {
+                id: "environment.install.repo-current",
+                status: "failed",
+                durationMs: 1,
+                message: "install dispatcher exit 1",
+              },
+            ],
+            assertions: [],
+          }),
+        },
+        onboarding: {
+          run: async () => ({ phase: "onboarding", status: "passed", actions: [], assertions: [] }),
+        },
+        stateValidation: {
+          run: async () => {
+            stateValidationCalled = true;
+            return {
+              phase: "state-validation",
+              status: "passed",
+              actions: [],
+              assertions: [],
+            };
+          },
+        },
+        runtime: {
+          run: async () => {
+            runtimeCalled = true;
+            return { phase: "runtime", status: "passed", actions: [], assertions: [] };
+          },
+        },
+      });
+      await runner.run(ctx, plan);
+      expect(stateValidationCalled).toBe(false);
+      expect(runtimeCalled).toBe(false);
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+
+  it("state-validation action failure blocks runtime", async () => {
+    const ctx = freshCtx();
+    try {
+      const [plan] = compileRunPlans(["ubuntu-repo-cloud-openclaw"]);
+      let runtimeCalled = false;
+      const runner = new ScenarioRunner({
+        environment: {
+          run: async () => ({ phase: "environment", status: "passed", actions: [], assertions: [] }),
+        },
+        onboarding: {
+          run: async () => ({ phase: "onboarding", status: "passed", actions: [], assertions: [] }),
+        },
+        stateValidation: {
+          run: async () => ({
+            phase: "state-validation",
+            status: "failed",
+            actions: [
+              {
+                id: "state-validation.gateway-healthy",
+                status: "failed",
+                durationMs: 1,
+                message: "gateway unreachable at http://127.0.0.1:18789",
+              },
+            ],
+            assertions: [],
+          }),
+        },
+        runtime: {
+          run: async () => {
+            runtimeCalled = true;
+            return { phase: "runtime", status: "passed", actions: [], assertions: [] };
+          },
+        },
+      });
+      const results = await runner.run(ctx, plan);
+      expect(runtimeCalled).toBe(false);
+      const runtimeRes = results.find((r) => r.phase === "runtime")!;
+      expect(runtimeRes.status).toBe("skipped");
+      expect(runtimeRes.assertions[0].message).toMatch(/state-validation\.gateway-healthy/);
+    } finally {
+      fs.rmSync(ctx.contextDir, { recursive: true, force: true });
+    }
+  });
+});
+
+describe("expected-state registry covers every scenario referenced in the typed registry", () => {
+  it("every ScenarioDefinition.expectedStateId resolves in the typed expected-state registry", () => {
+    const referenced = new Set<string>();
+    for (const scenario of listScenarios()) {
+      if (scenario.expectedStateId) {
+        referenced.add(scenario.expectedStateId);
+      }
+    }
+    expect(referenced.size).toBeGreaterThan(0);
+    for (const id of referenced) {
+      expect(getExpectedState(id), `expected_state '${id}' must be in the typed registry`).toBeDefined();
+    }
+  });
+});
diff --git a/test/e2e-scenario/framework-tests/e2e-negative-matcher.test.ts b/test/e2e-scenario/framework-tests/e2e-negative-matcher.test.ts
index fc8f94eea9..5b4d87703f 100644
--- a/test/e2e-scenario/framework-tests/e2e-negative-matcher.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-negative-matcher.test.ts
@@ -333,11 +333,17 @@ describe("ScenarioRunner appends negative-contract phase", () => {
       const runner = new ScenarioRunner({
         environment: fakePhase("environment"),
         onboarding: fakePhase("onboarding"),
+        stateValidation: fakePhase("state-validation"),
         runtime: fakePhase("runtime"),
       });
 
       const results = await runner.run(ctx, plan);
-      expect(results.map((r) => r.phase)).toEqual(["environment", "onboarding", "runtime"]);
+      expect(results.map((r) => r.phase)).toEqual([
+        "environment",
+        "onboarding",
+        "state-validation",
+        "runtime",
+      ]);
     } finally {
       fs.rmSync(ctx.contextDir, { recursive: true, force: true });
     }
diff --git a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
index 82ce513669..62f2367052 100644
--- a/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
@@ -112,13 +112,19 @@ describe("phase orchestrators - top-level delegation", () => {
       const runner = new ScenarioRunner({
         environment: fakeOrchestrator("environment"),
         onboarding: fakeOrchestrator("onboarding"),
+        stateValidation: fakeOrchestrator("state-validation"),
         runtime: fakeOrchestrator("runtime"),
       });
 
       const results = await runner.run(ctx, plan);
 
-      expect(calls).toEqual(["environment", "onboarding", "runtime"]);
-      expect(results.map((result) => result.phase)).toEqual(["environment", "onboarding", "runtime"]);
+      expect(calls).toEqual(["environment", "onboarding", "state-validation", "runtime"]);
+      expect(results.map((result) => result.phase)).toEqual([
+        "environment",
+        "onboarding",
+        "state-validation",
+        "runtime",
+      ]);
     } finally {
       fs.rmSync(ctx.contextDir, { recursive: true, force: true });
     }
@@ -522,18 +528,44 @@ describe("ScenarioRunner seeds context.env and short-circuits across phases", ()
           return { phase: "runtime" as const, status: "passed" as const, actions: [], assertions: [] };
         },
       };
-      const runner = new ScenarioRunner({ environment: failingEnv, onboarding, runtime });
+      let stateValidationCalled = false;
+      const stateValidation = {
+        run: async () => {
+          stateValidationCalled = true;
+          return {
+            phase: "state-validation" as const,
+            status: "passed" as const,
+            actions: [],
+            assertions: [],
+          };
+        },
+      };
+      const runner = new ScenarioRunner({
+        environment: failingEnv,
+        onboarding,
+        stateValidation,
+        runtime,
+      });
 
       const results = await runner.run(ctx, plan);
 
-      // Downstream orchestrators must NOT have been invoked.
+      // Downstream orchestrators must NOT have been invoked. An
+      // environment failure means install never ran; there is nothing
+      // for state-validation to probe.
       expect(onboardingCalled).toBe(false);
+      expect(stateValidationCalled).toBe(false);
       expect(runtimeCalled).toBe(false);
       // Each phase still has a result, and the downstream ones are
       // skipped with a message that names the blocking action.
-      expect(results.map((r) => r.phase)).toEqual(["environment", "onboarding", "runtime"]);
+      expect(results.map((r) => r.phase)).toEqual([
+        "environment",
+        "onboarding",
+        "state-validation",
+        "runtime",
+      ]);
       expect(results[1].status).toBe("skipped");
       expect(results[2].status).toBe("skipped");
+      expect(results[3].status).toBe("skipped");
       expect(results[1].assertions[0].message).toMatch(/blocked by prior failure/);
       expect(results[1].assertions[0].message).toMatch(/environment.install.repo-current/);
     } finally {
diff --git a/test/e2e-scenario/nemoclaw_scenarios/probes/cli-installed.sh b/test/e2e-scenario/nemoclaw_scenarios/probes/cli-installed.sh
new file mode 100755
index 0000000000..ab0f37a814
--- /dev/null
+++ b/test/e2e-scenario/nemoclaw_scenarios/probes/cli-installed.sh
@@ -0,0 +1,24 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Probe: cli-installed
+#
+# Asserts that the nemoclaw CLI is reachable on PATH after the
+# environment phase's install action completed.
+
+set -euo pipefail
+
+if ! command -v nemoclaw >/dev/null 2>&1; then
+  echo "probe cli-installed: nemoclaw not found on PATH (PATH=${PATH})" >&2
+  exit 1
+fi
+
+# Resolve to a real binary; aliases or shell functions don't count.
+nemoclaw_bin="$(command -v nemoclaw)"
+if [[ ! -x "${nemoclaw_bin}" ]]; then
+  echo "probe cli-installed: nemoclaw resolved to non-executable: ${nemoclaw_bin}" >&2
+  exit 1
+fi
+
+printf 'probe cli-installed: ok (%s)\n' "${nemoclaw_bin}"
+exit 0
diff --git a/test/e2e-scenario/nemoclaw_scenarios/probes/dispatch.sh b/test/e2e-scenario/nemoclaw_scenarios/probes/dispatch.sh
new file mode 100755
index 0000000000..84db7e7fa1
--- /dev/null
+++ b/test/e2e-scenario/nemoclaw_scenarios/probes/dispatch.sh
@@ -0,0 +1,42 @@
+#!/usr/bin/env bash
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# State-validation probe dispatcher.
+#
+# Each probe is a small bash script in this directory invoked by the
+# typed StateValidationOrchestrator via the shared dispatch-action.sh
+# launcher. The orchestrator owns timeouts, redaction, evidence
+# logging, and pass/fail attribution; probes only return 0 (probe
+# satisfied) or non-zero with a human-readable message on stderr.
+#
+# Probes consult ${E2E_CONTEXT_DIR}/context.env for runtime values
+# (E2E_GATEWAY_URL, E2E_SANDBOX_NAME) seeded by the framework and
+# extended by onboarding.
+#
+# Library style: dispatch.sh defines a single dispatch function
+# (e2e_state_probe) that runs the named probe. The TS phase-action
+# uses fn=e2e_state_probe arg=<probe-id>.
+
+_E2E_PROBES_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+# e2e_state_probe <probe-id>
+e2e_state_probe() {
+  local id="$1"
+  if [[ -z "${id}" ]]; then
+    echo "e2e_state_probe: missing probe id" >&2
+    return 2
+  fi
+  local probe_script="${_E2E_PROBES_DIR}/${id}.sh"
+  if [[ ! -f "${probe_script}" ]]; then
+    echo "e2e_state_probe: unknown probe id '${id}' (no script at ${probe_script})" >&2
+    return 2
+  fi
+  e2e_env_trace "probe:${id}"
+  # Probes run in a subshell so a `set -e` failure inside one probe
+  # does not affect another action in the same orchestrator process.
+  (
+    # shellcheck source=/dev/null
+    . "${probe_script}"
+  )
+}
diff --git a/test/e2e-scenario/nemoclaw_scenarios/probes/gateway-absent.sh b/test/e2e-scenario/nemoclaw_scenarios/probes/gateway-absent.sh
new file mode 100755
index 0000000000..8adf3b4098
--- /dev/null
+++ b/test/e2e-scenario/nemoclaw_scenarios/probes/gateway-absent.sh
@@ -0,0 +1,44 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Probe: gateway-absent
+#
+# Negative-state probe. Asserts that no gateway was started by a
+# scenario whose expected_state declares gateway.expected=absent
+# (preflight failure, invalid-key onboarding failure,
+# gateway-port-conflict onboarding failure). This is the typed
+# replacement for the runtime.expected-failure.no-side-effects
+# pending step on the gateway-started axis: a real probe that fails
+# closed if the gateway IS running.
+
+set -euo pipefail
+
+# Order matters: cheap CLI status check first, then port reachability
+# fallback. We deliberately do NOT rely on any single signal so a
+# scenario that leaves a partially-started gateway behind cannot
+# slip through.
+
+if command -v nemoclaw >/dev/null 2>&1; then
+  if nemoclaw gateway status >/dev/null 2>&1; then
+    echo "probe gateway-absent: nemoclaw reports gateway is running, expected absent" >&2
+    nemoclaw gateway status >&2 || true
+    exit 1
+  fi
+fi
+
+# Best-effort URL reachability check. context.env may carry a
+# gateway URL even for negative scenarios (it is computed from the
+# scenario id, not from a successful onboard).
+context_env="${E2E_CONTEXT_DIR:-.e2e}/context.env"
+if [[ -f "${context_env}" ]]; then
+  url="$(awk -F= '/^E2E_GATEWAY_URL=/{print substr($0, index($0, "=")+1); exit}' "${context_env}" | tr -d '"')"
+  if [[ -n "${url}" ]]; then
+    if curl -fsS -o /dev/null --max-time 3 "${url%/}/health" 2>/dev/null; then
+      echo "probe gateway-absent: ${url%/}/health responded healthy, expected absent" >&2
+      exit 1
+    fi
+  fi
+fi
+
+echo "probe gateway-absent: ok"
+exit 0
diff --git a/test/e2e-scenario/nemoclaw_scenarios/probes/gateway-healthy.sh b/test/e2e-scenario/nemoclaw_scenarios/probes/gateway-healthy.sh
new file mode 100755
index 0000000000..f267899ba3
--- /dev/null
+++ b/test/e2e-scenario/nemoclaw_scenarios/probes/gateway-healthy.sh
@@ -0,0 +1,34 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Probe: gateway-healthy
+#
+# Asserts the gateway is reachable and reports a healthy HTTP status
+# at ${E2E_GATEWAY_URL}/health (with fallback to the base URL). Mirrors
+# the legacy validation_suites/assert/gateway-alive.sh::e2e_gateway_assert_healthy
+# contract, but is invoked as a typed phase action by the
+# StateValidationOrchestrator BEFORE runtime suites run, so suite
+# assertions never execute against a missing or wedged gateway.
+
+set -euo pipefail
+
+# Defer to the legacy bash helper for the actual probe logic so we keep
+# a single implementation of the gateway-health contract during the
+# transition. The legacy helper consults context.env for the URL.
+_THIS_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+GATEWAY_HELPER="$(cd "${_THIS_DIR}/../../validation_suites/assert" && pwd)/gateway-alive.sh"
+
+if [[ ! -f "${GATEWAY_HELPER}" ]]; then
+  echo "probe gateway-healthy: legacy helper not found: ${GATEWAY_HELPER}" >&2
+  exit 1
+fi
+
+# shellcheck source=/dev/null
+. "${GATEWAY_HELPER}"
+
+if ! e2e_gateway_assert_healthy; then
+  exit 1
+fi
+
+echo "probe gateway-healthy: ok"
+exit 0
diff --git a/test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh b/test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh
new file mode 100755
index 0000000000..455ac61c69
--- /dev/null
+++ b/test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-absent.sh
@@ -0,0 +1,46 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Probe: sandbox-absent
+#
+# Negative-state probe. Asserts that no sandbox was created by a
+# scenario whose expected_state declares sandbox.expected=absent
+# (preflight failure, onboarding failures). Typed replacement for
+# the legacy run-scenario.sh inline check
+# `openshell sandbox list | grep -Fq "${sandbox_name}"`.
+
+set -euo pipefail
+
+# E2E_SANDBOX_NAME is seeded by the framework from the scenario id
+# even when onboarding never completed; missing context here is a
+# framework bug, not a probe pass.
+if [[ -z "${E2E_SANDBOX_NAME:-}" ]]; then
+  context_env="${E2E_CONTEXT_DIR:-.e2e}/context.env"
+  if [[ -f "${context_env}" ]]; then
+    E2E_SANDBOX_NAME="$(awk -F= '/^E2E_SANDBOX_NAME=/{print substr($0, index($0, "=")+1); exit}' "${context_env}" | tr -d '"')"
+  fi
+fi
+if [[ -z "${E2E_SANDBOX_NAME:-}" ]]; then
+  echo "probe sandbox-absent: E2E_SANDBOX_NAME unset; framework did not seed context" >&2
+  exit 2
+fi
+
+# Two independent checks — `nemoclaw list` is the user-facing surface
+# and openshell-side listing covers cases where nemoclaw is uninstalled
+# or wedged. Either reporting the sandbox fails the probe.
+if command -v nemoclaw >/dev/null 2>&1; then
+  if nemoclaw list 2>/dev/null | grep -qE "(^|[[:space:]])${E2E_SANDBOX_NAME}([[:space:]]|$)"; then
+    echo "probe sandbox-absent: nemoclaw list reports sandbox '${E2E_SANDBOX_NAME}', expected absent" >&2
+    exit 1
+  fi
+fi
+
+if command -v openshell >/dev/null 2>&1; then
+  if openshell sandbox list 2>/dev/null | grep -Fq "${E2E_SANDBOX_NAME}"; then
+    echo "probe sandbox-absent: openshell reports sandbox '${E2E_SANDBOX_NAME}', expected absent" >&2
+    exit 1
+  fi
+fi
+
+echo "probe sandbox-absent: ok"
+exit 0
diff --git a/test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-running.sh b/test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-running.sh
new file mode 100755
index 0000000000..8f8a697e16
--- /dev/null
+++ b/test/e2e-scenario/nemoclaw_scenarios/probes/sandbox-running.sh
@@ -0,0 +1,30 @@
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Probe: sandbox-running
+#
+# Asserts the sandbox declared by E2E_SANDBOX_NAME (seeded by
+# onboarding) is present in `nemoclaw list`. Mirrors the legacy
+# validation_suites/assert/sandbox-alive.sh::e2e_sandbox_assert_running
+# contract; promoted to a typed phase action so runtime suites cannot
+# silently run against an absent sandbox.
+
+set -euo pipefail
+
+_THIS_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+SANDBOX_HELPER="$(cd "${_THIS_DIR}/../../validation_suites/assert" && pwd)/sandbox-alive.sh"
+
+if [[ ! -f "${SANDBOX_HELPER}" ]]; then
+  echo "probe sandbox-running: legacy helper not found: ${SANDBOX_HELPER}" >&2
+  exit 1
+fi
+
+# shellcheck source=/dev/null
+. "${SANDBOX_HELPER}"
+
+if ! e2e_sandbox_assert_running; then
+  exit 1
+fi
+
+echo "probe sandbox-running: ok"
+exit 0
diff --git a/test/e2e-scenario/scenarios/compiler.ts b/test/e2e-scenario/scenarios/compiler.ts
index 796e8a05fc..d7d7122155 100644
--- a/test/e2e-scenario/scenarios/compiler.ts
+++ b/test/e2e-scenario/scenarios/compiler.ts
@@ -4,6 +4,7 @@
 import fs from "node:fs";
 import path from "node:path";
 import { fileURLToPath } from "node:url";
+import { getExpectedState, probesForState } from "./expected-states.ts";
 import { loadManifest } from "./manifests.ts";
 import { requireScenarios } from "./registry.ts";
 import type {
@@ -18,7 +19,12 @@ import type {
   SutBoundary,
 } from "./types.ts";
 
-const PHASES: PhaseName[] = ["environment", "onboarding", "runtime"];
+// Phase order. state-validation runs after onboarding and before
+// runtime so gateway/sandbox/cli probes gate suite execution: a
+// failed probe is a failed phase action, and the existing runner
+// short-circuit reports runtime as skipped without re-running
+// suite assertions against a missing/wedged environment.
+const PHASES: PhaseName[] = ["environment", "onboarding", "state-validation", "runtime"];
 const REPO_ROOT = path.resolve(path.dirname(fileURLToPath(import.meta.url)), "../../..");
 
 function groupsForPhase(scenario: ScenarioDefinition, phase: PhaseName): AssertionGroup[] {
@@ -83,11 +89,16 @@ function validateManifestCompatibility(scenario: ScenarioDefinition, manifest?:
 // resurrected bash runner.
 const INSTALL_DISPATCH = "test/e2e-scenario/nemoclaw_scenarios/install/dispatch.sh";
 const ONBOARD_DISPATCH = "test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh";
+const PROBES_DISPATCH = "test/e2e-scenario/nemoclaw_scenarios/probes/dispatch.sh";
 
 // Default action timeouts. Install and onboarding can take a while on
 // cold runners (Docker pulls, image builds, sandbox bootstrap).
 const INSTALL_TIMEOUT_SECONDS = 900;
 const ONBOARD_TIMEOUT_SECONDS = 900;
+// State-validation probes are cheap (`command -v`, single curl,
+// `nemoclaw list`); a tight timeout keeps a wedged probe from
+// consuming runner budget.
+const PROBE_TIMEOUT_SECONDS = 30;
 
 // Declared parent-env secrets each onboarding profile actually needs.
 // Anything not listed here (and not in the framework allowlist) is
@@ -183,6 +194,39 @@ function phaseActions(phase: PhaseName, scenario: ScenarioDefinition): PhaseActi
       },
     ];
   }
+  if (phase === "state-validation") {
+    // State-validation actions are emitted from the typed expected-state
+    // registry, NOT from the legacy expected-states.yaml. The compiler
+    // stays a pure function over typed inputs; YAML-vs-typed parity is
+    // enforced by a framework test, not by re-reading the YAML at
+    // compile time.
+    if (!scenario.expectedStateId) {
+      // Scenarios without an expected state (older skeleton scenarios)
+      // legitimately have no probes; do not fail-fast.
+      return [];
+    }
+    const state = getExpectedState(scenario.expectedStateId);
+    if (!state) {
+      // The compiler treats an unknown expected_state id as a hard
+      // error: typed scenarios must reference a typed state. The
+      // legacy YAML resolver has its own validation path; this is a
+      // separate (and stricter) contract for the typed runner.
+      throw new Error(
+        `Scenario ${scenario.id} references unknown expected_state '${scenario.expectedStateId}'`,
+      );
+    }
+    return probesForState(state).map((probeId) => ({
+      id: `state-validation.${probeId}`,
+      phase: "state-validation",
+      description: `Probe ${probeId} from expected_state '${state.id}'.`,
+      kind: "shell-fn",
+      scriptRef: PROBES_DISPATCH,
+      fn: "e2e_state_probe",
+      arg: probeId,
+      timeoutSeconds: PROBE_TIMEOUT_SECONDS,
+      evidencePath: `.e2e/actions/state-validation.${probeId}.log`,
+    }));
+  }
   // Runtime phase has no actions; suites are assertion groups.
   return [];
 }
diff --git a/test/e2e-scenario/scenarios/expected-states.ts b/test/e2e-scenario/scenarios/expected-states.ts
new file mode 100644
index 0000000000..539c520f22
--- /dev/null
+++ b/test/e2e-scenario/scenarios/expected-states.ts
@@ -0,0 +1,133 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+import type { ExpectedState, StateProbeId } from "./types.ts";
+
+// Typed mirror of nemoclaw_scenarios/expected-states.yaml.
+//
+// During the transition this registry is the source of truth for the
+// TS runner. expected-states.yaml stays in place for the legacy bash
+// resolver; a framework test verifies the typed registry covers the
+// YAML's expected-state ids and matches their structural shape on the
+// dimensions the typed runner probes today (cli, gateway, sandbox).
+// Inference and credentials remain declared in YAML and in this typed
+// registry, but the compiler skips emitting probe actions for them
+// until the corresponding probe scripts land — see
+// nemoclaw_scenarios/probes/.
+
+const cloudOpenclawReady: ExpectedState = {
+  id: "cloud-openclaw-ready",
+  cli: { installed: true },
+  gateway: { expected: "present", health: "healthy" },
+  sandbox: { expected: "present", status: "running", agent: "openclaw" },
+  inference: { expected: "available", provider: "nvidia" },
+  credentials: { expected: "present" },
+};
+
+const cloudOpenclawCustomPoliciesReady: ExpectedState = {
+  ...cloudOpenclawReady,
+  id: "cloud-openclaw-custom-policies-ready",
+};
+
+const cloudHermesReady: ExpectedState = {
+  id: "cloud-hermes-ready",
+  cli: { installed: true },
+  gateway: { expected: "present", health: "healthy" },
+  sandbox: { expected: "present", status: "running", agent: "hermes" },
+  inference: { expected: "available", provider: "nvidia" },
+  credentials: { expected: "present" },
+};
+
+const localOllamaOpenclawReady: ExpectedState = {
+  id: "local-ollama-openclaw-ready",
+  cli: { installed: true },
+  gateway: { expected: "present", health: "healthy" },
+  sandbox: { expected: "present", status: "running", agent: "openclaw" },
+  inference: { expected: "available", provider: "ollama" },
+  credentials: { expected: "present" },
+};
+
+const macosCliReadyDockerOptional: ExpectedState = {
+  id: "macos-cli-ready-docker-optional",
+  cli: { installed: true },
+  gateway: { expected: "optional", health: "optional" },
+  sandbox: { expected: "optional", status: "optional", agent: "openclaw" },
+  inference: { expected: "optional", provider: "nvidia" },
+  credentials: { expected: "optional" },
+};
+
+const preflightFailureNoSandbox: ExpectedState = {
+  id: "preflight-failure-no-sandbox",
+  cli: { installed: true },
+  gateway: { expected: "absent" },
+  sandbox: { expected: "absent" },
+};
+
+const onboardingFailureInvalidNvidiaKey: ExpectedState = {
+  id: "onboarding-failure-invalid-nvidia-key",
+  cli: { installed: true },
+  gateway: { expected: "absent" },
+  sandbox: { expected: "absent" },
+};
+
+const onboardingFailureGatewayPortConflict: ExpectedState = {
+  id: "onboarding-failure-gateway-port-conflict",
+  cli: { installed: true },
+  gateway: { expected: "absent" },
+  sandbox: { expected: "absent" },
+};
+
+const REGISTRY: readonly ExpectedState[] = [
+  cloudOpenclawReady,
+  cloudOpenclawCustomPoliciesReady,
+  cloudHermesReady,
+  localOllamaOpenclawReady,
+  macosCliReadyDockerOptional,
+  preflightFailureNoSandbox,
+  onboardingFailureInvalidNvidiaKey,
+  onboardingFailureGatewayPortConflict,
+];
+
+const BY_ID: ReadonlyMap<string, ExpectedState> = new Map(REGISTRY.map((state) => [state.id, state]));
+
+export function listExpectedStates(): readonly ExpectedState[] {
+  return REGISTRY;
+}
+
+export function getExpectedState(id: string): ExpectedState | undefined {
+  return BY_ID.get(id);
+}
+
+export function requireExpectedState(id: string): ExpectedState {
+  const state = BY_ID.get(id);
+  if (!state) {
+    const available = Array.from(BY_ID.keys()).join(", ");
+    throw new Error(`Unknown expected_state id '${id}' (available: ${available})`);
+  }
+  return state;
+}
+
+// Translate the typed expected-state contract into the concrete probe
+// ids the state-validation orchestrator emits. Inference and
+// credentials probes are intentionally omitted today (probe scripts
+// not yet implemented); their declarations remain in ExpectedState so
+// the contract is visible in plan output and a future change can
+// switch on emission without touching scenario data. "optional"
+// dimensions emit no probe actions.
+export function probesForState(state: ExpectedState): readonly StateProbeId[] {
+  const probes: StateProbeId[] = [];
+  if (state.cli?.installed === true) {
+    probes.push("cli-installed");
+  }
+  if (state.gateway?.expected === "present" && state.gateway.health === "healthy") {
+    probes.push("gateway-healthy");
+  } else if (state.gateway?.expected === "absent") {
+    probes.push("gateway-absent");
+  }
+  if (state.sandbox?.expected === "present" && state.sandbox.status === "running") {
+    probes.push("sandbox-running");
+  } else if (state.sandbox?.expected === "absent") {
+    probes.push("sandbox-absent");
+  }
+  return probes;
+}
diff --git a/test/e2e-scenario/scenarios/orchestrators/negative-matcher.ts b/test/e2e-scenario/scenarios/orchestrators/negative-matcher.ts
index 30eb47d1af..dbbe2b0956 100644
--- a/test/e2e-scenario/scenarios/orchestrators/negative-matcher.ts
+++ b/test/e2e-scenario/scenarios/orchestrators/negative-matcher.ts
@@ -60,8 +60,24 @@ export interface NegativeContractResult {
 // declared in assertions/registry.ts. The matcher excludes failures of
 // that step from "observed failure" detection so the contract evaluation
 // is not confused by its own enforcement scaffolding.
+//
+// As of the state-validation phase landing, forbidden side effects are
+// observed by the typed gateway-absent / sandbox-absent probes during
+// the state-validation phase, not by this pending step. The exclusion
+// is kept to stay correct for any scenario that still references the
+// legacy step id.
 const SIDE_EFFECT_PROBE_STEP_ID = "runtime.expected-failure.no-side-effects";
 
+// State-validation probe ids the matcher must skip when scanning for
+// observed failures. For a negative scenario, these probes are real
+// post-failure checks (gateway-absent, sandbox-absent) — their pass/fail
+// status does NOT determine which phase advertised the original failure
+// mode, only whether forbidden side effects occurred.
+const STATE_VALIDATION_FORBIDDEN_PROBE_IDS: ReadonlySet<string> = new Set([
+  "state-validation.gateway-absent",
+  "state-validation.sandbox-absent",
+]);
+
 // Map the user-facing expected failure phase to the internal phase
 // orchestrator that owns it. Today preflight assertions live under
 // onboarding (see assertions/registry.ts: onboarding.preflight.*).
@@ -73,7 +89,12 @@ function resolveExpectedPhase(phase: ExpectedFailurePhase): PhaseName {
 }
 
 function isOwnPhaseResult(phase: PhaseResult["phase"]): phase is PhaseName {
-  return phase === "environment" || phase === "onboarding" || phase === "runtime";
+  return (
+    phase === "environment" ||
+    phase === "onboarding" ||
+    phase === "state-validation" ||
+    phase === "runtime"
+  );
 }
 
 function findFirstObservedFailure(results: readonly PhaseResult[]): NegativeContractObservation | undefined {
@@ -81,7 +102,16 @@ function findFirstObservedFailure(results: readonly PhaseResult[]): NegativeCont
     if (!isOwnPhaseResult(result.phase)) {
       continue;
     }
-    const failedAction = result.actions.find((action) => action.status === "failed");
+    // state-validation forbidden-side-effect probes (gateway-absent,
+    // sandbox-absent) are post-failure verification, not the failure
+    // mode itself; skip them when locating the originating failure.
+    // A failed cli-installed probe IS a real observed failure (the
+    // install action passed but the binary isn't reachable) and is
+    // not skipped.
+    const failedAction = result.actions.find(
+      (action) =>
+        action.status === "failed" && !STATE_VALIDATION_FORBIDDEN_PROBE_IDS.has(action.id),
+    );
     if (failedAction) {
       return {
         failedPhase: result.phase,
diff --git a/test/e2e-scenario/scenarios/orchestrators/runner.ts b/test/e2e-scenario/scenarios/orchestrators/runner.ts
index fe429fc0c3..c725997e8d 100644
--- a/test/e2e-scenario/scenarios/orchestrators/runner.ts
+++ b/test/e2e-scenario/scenarios/orchestrators/runner.ts
@@ -10,6 +10,7 @@ import { EnvironmentOrchestrator } from "./environment.ts";
 import { evaluateNegativeContract, negativeContractPhaseResult } from "./negative-matcher.ts";
 import { OnboardingOrchestrator } from "./onboarding.ts";
 import { RuntimeOrchestrator } from "./runtime.ts";
+import { StateValidationOrchestrator } from "./state-validation.ts";
 
 interface PhaseRunner {
   run(ctx: RunContext, phase: RunPlanPhase, priorResults?: PhaseResult[]): Promise<PhaseResult>;
@@ -18,17 +19,20 @@ interface PhaseRunner {
 export interface ScenarioRunnerDeps {
   environment?: PhaseRunner;
   onboarding?: PhaseRunner;
+  stateValidation?: PhaseRunner;
   runtime?: PhaseRunner;
 }
 
 export class ScenarioRunner {
   private readonly environment: PhaseRunner;
   private readonly onboarding: PhaseRunner;
+  private readonly stateValidation: PhaseRunner;
   private readonly runtime: PhaseRunner;
 
   constructor(deps: ScenarioRunnerDeps = {}) {
     this.environment = deps.environment ?? new EnvironmentOrchestrator();
     this.onboarding = deps.onboarding ?? new OnboardingOrchestrator();
+    this.stateValidation = deps.stateValidation ?? new StateValidationOrchestrator();
     this.runtime = deps.runtime ?? new RuntimeOrchestrator();
   }
 
@@ -41,7 +45,7 @@ export class ScenarioRunner {
 
     const results: PhaseResult[] = [];
     for (const phase of plan.phases) {
-      const blocked = blockingPriorResult(results);
+      const blocked = phaseBlockedBy(phase.name, results);
       if (blocked) {
         // Cross-phase short-circuit: the previous phase's setup work
         // failed, so this phase cannot meaningfully run. Synthesize a
@@ -86,13 +90,14 @@ export class ScenarioRunner {
   private orchestratorFor(name: RunPlanPhase["name"]): PhaseRunner {
     if (name === "environment") return this.environment;
     if (name === "onboarding") return this.onboarding;
+    if (name === "state-validation") return this.stateValidation;
     if (name === "runtime") return this.runtime;
     throw new Error(`Unsupported phase: ${String(name)}`);
   }
 }
 
 interface BlockingFailure {
-  phase: "environment" | "onboarding" | "runtime";
+  phase: "environment" | "onboarding" | "state-validation" | "runtime";
   action: PhaseActionResult;
 }
 
@@ -117,13 +122,42 @@ function writeNegativeContractArtifact(
   }
 }
 
-function blockingPriorResult(results: PhaseResult[]): BlockingFailure | undefined {
+// state-validation is the typed diagnostic layer between onboarding
+// and runtime. It probes gateway/sandbox/cli post-conditions and is
+// the phase that proves a negative scenario's forbidden side effects
+// did not occur (gateway-absent, sandbox-absent). For state-validation
+// to do its job after a deliberate onboarding failure (negative
+// scenarios), an onboarding failure must NOT block it. Only an
+// environment-phase failure (install never ran) skips state-validation.
+// Runtime stays blocked by any prior phase-action failure, including
+// state-validation, so suites never run against a missing or wedged
+// environment.
+function phaseBlockedBy(
+  phase: "environment" | "onboarding" | "state-validation" | "runtime",
+  results: PhaseResult[],
+): BlockingFailure | undefined {
+  const firstFailure = firstBlockingActionFailure(results);
+  if (!firstFailure) {
+    return undefined;
+  }
+  if (phase === "state-validation" && firstFailure.phase !== "environment") {
+    return undefined;
+  }
+  return firstFailure;
+}
+
+function firstBlockingActionFailure(results: PhaseResult[]): BlockingFailure | undefined {
   // A phase action failure (real setup work didn't succeed) blocks
   // downstream phases. Assertion failures do NOT block downstream
   // phases - they are expected to be reported alongside other phase
   // results so reviewers can see all failure layers at once.
   for (const result of results) {
-    if (result.phase !== "environment" && result.phase !== "onboarding" && result.phase !== "runtime") {
+    if (
+      result.phase !== "environment" &&
+      result.phase !== "onboarding" &&
+      result.phase !== "state-validation" &&
+      result.phase !== "runtime"
+    ) {
       continue;
     }
     const failedAction = result.actions.find((action) => action.status === "failed");
diff --git a/test/e2e-scenario/scenarios/orchestrators/state-validation.ts b/test/e2e-scenario/scenarios/orchestrators/state-validation.ts
new file mode 100644
index 0000000000..567d49b3a6
--- /dev/null
+++ b/test/e2e-scenario/scenarios/orchestrators/state-validation.ts
@@ -0,0 +1,24 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+import { PhaseOrchestrator } from "./phase.ts";
+
+// Typed replacement for the inline gateway/sandbox checks the legacy
+// bash runner ran between onboarding and suite execution
+// (e2e_gateway_assert_healthy / e2e_sandbox_assert_running) AND the
+// post-failure side-effect checks for negative scenarios
+// (`openshell sandbox list | grep -Fq ...`). The orchestrator inserts
+// itself between onboarding and runtime; its phase actions are real
+// probes (typed PhaseAction shell-fn entries the compiler emits from
+// scenario.expectedStateId via the typed expected-state registry).
+//
+// Failure semantics: a probe action failure is just a phase-action
+// failure, so the existing ScenarioRunner short-circuit logic kicks
+// in and the runtime phase is reported as skipped. No new control
+// flow is added; this orchestrator is only here to give the phase a
+// dedicated identity in PhaseResult artifacts and in tests.
+export class StateValidationOrchestrator extends PhaseOrchestrator {
+  constructor() {
+    super("state-validation");
+  }
+}
diff --git a/test/e2e-scenario/scenarios/types.ts b/test/e2e-scenario/scenarios/types.ts
index 85a406c6d8..f83091e9ef 100644
--- a/test/e2e-scenario/scenarios/types.ts
+++ b/test/e2e-scenario/scenarios/types.ts
@@ -1,7 +1,7 @@
 // SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 // SPDX-License-Identifier: Apache-2.0
 
-export type PhaseName = "environment" | "onboarding" | "runtime";
+export type PhaseName = "environment" | "onboarding" | "state-validation" | "runtime";
 
 // Synthetic phase appended by the scenario runner when a scenario
 // declares plan.expectedFailure. Distinct from PhaseName so a scenario
@@ -11,10 +11,26 @@ export type NegativeContractPhase = "negative-contract";
 
 export type PhaseResultName = PhaseName | NegativeContractPhase;
 
+// Concrete probe ids the state-validation orchestrator emits as phase
+// actions. Each id maps to a probe script under
+// nemoclaw_scenarios/probes/. Inference and credentials probes are
+// declared but not yet implemented; the compiler skips emitting actions
+// for them until the probe scripts land.
+export type StateProbeId =
+  | "cli-installed"
+  | "gateway-healthy"
+  | "gateway-absent"
+  | "sandbox-running"
+  | "sandbox-absent";
+
 // User-facing phase the negative-scenario contract advertises. Wider
 // than PhaseName because manifests may declare "preflight" failures,
 // which the matcher resolves to the onboarding phase orchestrator.
-export type ExpectedFailurePhase = PhaseName | "preflight";
+// state-validation is intentionally omitted: it is an internal phase
+// the framework inserts after onboarding; scenarios cannot declare
+// expected failures against it (those are expressed via
+// expectedStateId + the absent/forbidden-side-effect probes).
+export type ExpectedFailurePhase = "environment" | "onboarding" | "runtime" | "preflight";
 
 export interface ExpectedFailureContract {
   phase: ExpectedFailurePhase;
@@ -22,6 +38,44 @@ export interface ExpectedFailureContract {
   forbiddenSideEffects?: readonly string[];
 }
 
+// Expected-state contract. Mirrors the structural shape of
+// nemoclaw_scenarios/expected-states.yaml so the typed registry can
+// remain a verifiable mirror of the legacy YAML during transition.
+// Each dimension's `expected` field declares whether that aspect of
+// the post-setup environment should be present, absent, or optional.
+// Optional dimensions emit no probe actions; present/absent dimensions
+// emit a real probe that gates the runtime phase.
+//
+// Spec ownership: the typed registry (scenarios/expected-states.ts) is
+// the source of truth for the TS runner; expected-states.yaml stays
+// alongside until the legacy resolver is fully retired, with a contract
+// test that the typed registry mirrors the YAML.
+export type ExpectedPresence = "present" | "absent" | "optional";
+export type ExpectedHealth = "healthy" | "absent" | "optional";
+export type ExpectedSandboxStatus = "running" | "absent" | "optional";
+export type ExpectedInferenceAvail = "available" | "absent" | "optional";
+
+export interface ExpectedState {
+  id: string;
+  cli?: { installed?: boolean };
+  gateway?: {
+    expected: ExpectedPresence;
+    health?: ExpectedHealth;
+  };
+  sandbox?: {
+    expected: ExpectedPresence;
+    status?: ExpectedSandboxStatus;
+    agent?: string;
+  };
+  inference?: {
+    expected: ExpectedInferenceAvail;
+    provider?: string;
+  };
+  credentials?: {
+    expected: ExpectedPresence;
+  };
+}
+
 export type TransientClassifier =
   | "empty-event-capture"
   | "provider-transient"

From f61bcd2ad00ef17069b9e0fe8c05f8299c691002 Mon Sep 17 00:00:00 2001
From: Julie Yaunches <jyaunches@nvidia.com>
Date: Thu, 28 May 2026 09:33:52 -0400
Subject: [PATCH 23/23] test(e2e): land security probes (shields, network
 policy, injection)

Complete the probe registry on PR #4380 by registering the three
remaining required: true probes from scenarios/assertions/registry.ts.
With these landed the security suites (shields-config,
security-policy, security-injection) flip from 'silently skipped' to
'actually verified' end-to-end, matching the legacy contracts in
test/e2e/test-shields-config.sh, test-network-policy.sh, and
test-credential-sanitization.sh.

Architecture follows the diagnostics + docs probes already in place:

- Each probe is a TS-native ProbeFn that writes a structured JSON
  evidence document at ProbeContext.evidencePath, returns a typed
  ProbeOutcome, and never throws past the registry boundary.

- A new shared util module (probes/util.ts) gives every probe two
  primitives:
    runSandboxCmd: shells through the canonical
                   validation_suites/sandbox-exec.sh wrapper so probes
                   inherit the ssh-config preferred / openshell-exec
                   fallback transport, the per-call timeout, and the
                   classified diagnostic on hang. Probes never bypass
                   the wrapper.
    runHostCmd:   spawns host CLIs (nemoclaw / openshell) directly,
                  with structured stdout/stderr tail and timeout.

Probes:

- shieldsConfigProbe (security.shields.config): mirrors
  spc_assert_shields_config_consistent. Parses
  'nemoclaw <sandbox> shields status' for UP/DOWN/NOT CONFIGURED;
  if the scenario context declares E2E_SHIELDS_EXPECTED_STATE,
  asserts observed === expected; verifies the in-sandbox config file
  permissions match observed (UP -> root:root + 4xx; otherwise
  sandbox:sandbox). Config path keys off E2E_AGENT (openclaw vs
  hermes).

- networkPolicyProbe (security.policy.enforced): mirrors TC-NET-01
  from test-network-policy.sh. From inside the sandbox, curl a
  non-whitelisted URL (default https://example.com/, override via
  E2E_NETWORK_POLICY_BLOCKED_URL). Pass on HTTP 403/4xx/5xx OR
  curl exit != 0 with no HTTP response. Fail on HTTP 2xx/3xx OR
  HTTP 401 (401 means the request reached upstream auth -- the
  gateway did NOT block the egress). Distinguishes 'gateway
  rejected' from 'request reached an upstream auth wall' so a
  policy bypass cannot show as green.

- injectionBlockedProbe (security.injection.blocked): mirrors
  spc_assert_telegram_payload_not_shell_executed. Pre-cleans a
  unique marker file in the sandbox, sends a
  '$(touch <marker> && echo INJECTED)' payload via stdin to a
  remote sh that should echo it back as data, then asserts both:
  the payload survives literally (proves no host-side shell
  evaluation) AND the marker file is absent (proves no sandbox-side
  shell evaluation). Either invariant violation fails closed.

Tests (framework-tests/e2e-probes.test.ts, 23 cases x project
fan-out = 46):

- shieldsConfigProbe: passes on matching state+perms; fails on
  state mismatch; fails on perms mismatch.
- networkPolicyProbe: passes on 403; passes on curl-exit-7 with
  status 000; fails on 200; fails on 401 (policy bypass).
- injectionBlockedProbe: passes when payload preserved and marker
  absent; fails when marker file is reported present.
- Registry: register/lookup/list/idempotent registration; security
  probes are now registered (was the placeholder 'NOT yet
  registered' assertion).

All probes use runSandboxCmd which honors the canonical wrapper -
no probe bypasses the transport choice, no probe re-implements ssh
logic. Stub openshell scripts in the tests force the wrapper's
openshell-exec fallback path so the tests run reproducibly without
a real sandbox.

436 framework tests pass (was 386; +50 new test instances).

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
---
 .../framework-tests/e2e-probes.test.ts        | 383 +++++++++++++++++-
 test/e2e-scenario/scenarios/probes/builtin.ts |  20 +-
 .../scenarios/probes/injection-blocked.ts     | 155 +++++++
 .../scenarios/probes/network-policy.ts        | 125 ++++++
 .../scenarios/probes/shields-config.ts        | 196 +++++++++
 test/e2e-scenario/scenarios/probes/util.ts    | 235 +++++++++++
 6 files changed, 1096 insertions(+), 18 deletions(-)
 create mode 100644 test/e2e-scenario/scenarios/probes/injection-blocked.ts
 create mode 100644 test/e2e-scenario/scenarios/probes/network-policy.ts
 create mode 100644 test/e2e-scenario/scenarios/probes/shields-config.ts
 create mode 100644 test/e2e-scenario/scenarios/probes/util.ts

diff --git a/test/e2e-scenario/framework-tests/e2e-probes.test.ts b/test/e2e-scenario/framework-tests/e2e-probes.test.ts
index 6063eedc1d..db90b47798 100644
--- a/test/e2e-scenario/framework-tests/e2e-probes.test.ts
+++ b/test/e2e-scenario/framework-tests/e2e-probes.test.ts
@@ -71,17 +71,17 @@ describe("probe registry", () => {
     expect(listRegisteredProbes()).toEqual(first);
   });
 
-  it("registerBuiltinProbes_does_NOT_register_security_probes_yet", () => {
-    // The shieldsConfig / networkPolicy / injectionBlocked probes
-    // are intentionally not registered yet \u2014 their `required: true`
-    // status in scenarios/assertions/registry.ts means the
-    // orchestrator fails closed when they're missing, which is the
-    // contract we want until real implementations land.
+  it("registerBuiltinProbes_registers_security_probes", () => {
+    // shieldsConfig / networkPolicy / injectionBlocked are marked
+    // `required: true` in scenarios/assertions/registry.ts. The
+    // orchestrator fails closed when a required probe is missing,
+    // so registering all three turns the security suites from
+    // 'silently skipped' into 'actually verified'.
     registerBuiltinProbes();
     const registered = listRegisteredProbes();
-    expect(registered).not.toContain("shieldsConfigProbe");
-    expect(registered).not.toContain("networkPolicyProbe");
-    expect(registered).not.toContain("injectionBlockedProbe");
+    expect(registered).toContain("shieldsConfigProbe");
+    expect(registered).toContain("networkPolicyProbe");
+    expect(registered).toContain("injectionBlockedProbe");
   });
 });
 
@@ -303,3 +303,368 @@ esac
     }
   });
 });
+
+// ──────────────────────────────────────────────────────────────────────────
+// Security probes — stub `nemoclaw` (host CLI) and `openshell` so the
+// canonical sandbox-exec wrapper resolves through the stub. The
+// wrapper's openshell-fallback path is exercised because the stub
+// does not implement `sandbox ssh-config`.
+// ──────────────────────────────────────────────────────────────────────────
+
+function makeProbeCtxFor(
+  tmp: string,
+  sandboxName: string,
+  contextEnv: Record<string, string> = {},
+): ProbeContext {
+  // Write context.env so spawned bash scripts that source the
+  // wrapper can pick up E2E_SANDBOX_NAME if needed.
+  const lines = Object.entries({ E2E_SANDBOX_NAME: sandboxName, ...contextEnv })
+    .map(([k, v]) => `${k}=${v}`)
+    .join("\n");
+  fs.writeFileSync(path.join(tmp, "context.env"), lines + "\n");
+  return {
+    contextDir: tmp,
+    evidencePath: path.join(tmp, "probe-evidence.json"),
+    contextEnv: { E2E_SANDBOX_NAME: sandboxName, ...contextEnv },
+    sandboxName,
+    gatewayUrl: null,
+    repoRoot: REPO_ROOT,
+  };
+}
+
+describe("shieldsConfigProbe", () => {
+  it("passes_when_shields_status_matches_expected_and_perms_match_state", async () => {
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "shields-probe-pass-"));
+    const fakeBin = path.join(tmp, "bin");
+    fs.mkdirSync(fakeBin);
+    fs.writeFileSync(
+      path.join(fakeBin, "nemoclaw"),
+      `#!/usr/bin/env bash
+# nemoclaw <sandbox> shields status
+if [[ "$2" == "shields" && "$3" == "status" ]]; then
+  echo "Shields: DOWN"
+  exit 0
+fi
+exit 99
+`,
+      { mode: 0o755 },
+    );
+    fs.writeFileSync(
+      path.join(fakeBin, "openshell"),
+      `#!/usr/bin/env bash
+# Stub openshell. Reject ssh-config so wrapper falls back to sandbox exec.
+# Then implement 'sandbox exec --name <sb> -- <cmd>' by stripping args
+# until '--' and running what's left.
+if [[ "$1" == "sandbox" && "$2" == "ssh-config" ]]; then
+  exit 1
+fi
+if [[ "$1" == "sandbox" && "$2" == "exec" ]]; then
+  shift 2
+  while [[ "$#" -gt 0 && "$1" != "--" ]]; do shift; done
+  shift || true
+  # The 'stat -c %a %U:%G <path>' invocation: emit a fake permissions
+  # line that matches a DOWN-state sandbox config (sandbox-owned).
+  if [[ "$1" == "stat" ]]; then
+    echo "644 sandbox:sandbox"
+    exit 0
+  fi
+  exit 0
+fi
+exit 99
+`,
+      { mode: 0o755 },
+    );
+    const oldPath = process.env.PATH;
+    process.env.PATH = `${fakeBin}:${oldPath ?? ""}`;
+    try {
+      const { shieldsConfigProbe } = await import("../scenarios/probes/shields-config.ts");
+      const ctx = makeProbeCtxFor(tmp, "sb1", {
+        E2E_AGENT: "openclaw",
+        E2E_SHIELDS_EXPECTED_STATE: "down",
+      });
+      const outcome = await shieldsConfigProbe(ctx);
+      expect(outcome.status).toBe("passed");
+      expect(outcome.message).toMatch(/shields=down/);
+      const ev = JSON.parse(fs.readFileSync(ctx.evidencePath, "utf8"));
+      expect(ev.observed).toBe("down");
+      expect(ev.expected).toBe("down");
+      expect(ev.permissionsLine).toBe("644 sandbox:sandbox");
+    } finally {
+      process.env.PATH = oldPath;
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+
+  it("fails_when_observed_state_disagrees_with_expected", async () => {
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "shields-probe-mismatch-"));
+    const fakeBin = path.join(tmp, "bin");
+    fs.mkdirSync(fakeBin);
+    fs.writeFileSync(
+      path.join(fakeBin, "nemoclaw"),
+      `#!/usr/bin/env bash
+if [[ "$2" == "shields" && "$3" == "status" ]]; then
+  echo "Shields: UP"
+  exit 0
+fi
+exit 99
+`,
+      { mode: 0o755 },
+    );
+    const oldPath = process.env.PATH;
+    process.env.PATH = `${fakeBin}:${oldPath ?? ""}`;
+    try {
+      const { shieldsConfigProbe } = await import("../scenarios/probes/shields-config.ts");
+      const ctx = makeProbeCtxFor(tmp, "sb1", {
+        E2E_AGENT: "openclaw",
+        E2E_SHIELDS_EXPECTED_STATE: "down",
+      });
+      const outcome = await shieldsConfigProbe(ctx);
+      expect(outcome.status).toBe("failed");
+      expect(outcome.message).toMatch(/expected shields 'down', observed 'up'/);
+    } finally {
+      process.env.PATH = oldPath;
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+
+  it("fails_when_perms_dont_match_observed_state", async () => {
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "shields-probe-perms-"));
+    const fakeBin = path.join(tmp, "bin");
+    fs.mkdirSync(fakeBin);
+    fs.writeFileSync(
+      path.join(fakeBin, "nemoclaw"),
+      `#!/usr/bin/env bash
+if [[ "$2" == "shields" && "$3" == "status" ]]; then
+  # Shields claim UP, but the stub openshell will report sandbox-owned
+  # perms below — a mismatch the probe must catch.
+  echo "Shields: UP"
+  exit 0
+fi
+exit 99
+`,
+      { mode: 0o755 },
+    );
+    fs.writeFileSync(
+      path.join(fakeBin, "openshell"),
+      `#!/usr/bin/env bash
+if [[ "$1" == "sandbox" && "$2" == "ssh-config" ]]; then exit 1; fi
+if [[ "$1" == "sandbox" && "$2" == "exec" ]]; then
+  shift 2
+  while [[ "$#" -gt 0 && "$1" != "--" ]]; do shift; done
+  shift || true
+  # Sandbox-owned perms: would pass for DOWN, must FAIL for UP.
+  echo "644 sandbox:sandbox"
+  exit 0
+fi
+exit 99
+`,
+      { mode: 0o755 },
+    );
+    const oldPath = process.env.PATH;
+    process.env.PATH = `${fakeBin}:${oldPath ?? ""}`;
+    try {
+      const { shieldsConfigProbe } = await import("../scenarios/probes/shields-config.ts");
+      // Don't declare expected state — the probe should still fail on
+      // perms-vs-observed mismatch alone.
+      const ctx = makeProbeCtxFor(tmp, "sb1", { E2E_AGENT: "openclaw" });
+      const outcome = await shieldsConfigProbe(ctx);
+      expect(outcome.status).toBe("failed");
+      expect(outcome.message).toMatch(/shields are 'up' but .* permissions are/);
+    } finally {
+      process.env.PATH = oldPath;
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+});
+
+describe("networkPolicyProbe", () => {
+  function fakeOpenshellEmittingHttpStatus(
+    binDir: string,
+    httpStatus: string,
+    curlExitCode: number = 0,
+  ): void {
+    fs.mkdirSync(binDir, { recursive: true });
+    fs.writeFileSync(
+      path.join(binDir, "openshell"),
+      `#!/usr/bin/env bash
+# Opt out of ssh-config; force wrapper to use 'sandbox exec' fallback.
+if [[ "$1" == "sandbox" && "$2" == "ssh-config" ]]; then exit 1; fi
+if [[ "$1" == "sandbox" && "$2" == "exec" ]]; then
+  shift 2
+  while [[ "$#" -gt 0 && "$1" != "--" ]]; do shift; done
+  shift || true
+  # We're being asked to run curl inside the sandbox. Emit the test's
+  # chosen status to stdout (mirrors curl -w '%{http_code}') and exit
+  # with the test's chosen curl exit code.
+  printf '%s' "${httpStatus}"
+  exit ${curlExitCode}
+fi
+exit 99
+`,
+      { mode: 0o755 },
+    );
+  }
+
+  it("passes_when_blocked_url_returns_403", async () => {
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "netpolicy-probe-403-"));
+    fakeOpenshellEmittingHttpStatus(path.join(tmp, "bin"), "403", 0);
+    const oldPath = process.env.PATH;
+    process.env.PATH = `${path.join(tmp, "bin")}:${oldPath ?? ""}`;
+    try {
+      const { networkPolicyProbe } = await import("../scenarios/probes/network-policy.ts");
+      const ctx = makeProbeCtxFor(tmp, "sb1");
+      const outcome = await networkPolicyProbe(ctx);
+      expect(outcome.status).toBe("passed");
+      expect(outcome.message).toMatch(/blocked .*http_code=403/);
+    } finally {
+      process.env.PATH = oldPath;
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+
+  it("passes_when_curl_exits_nonzero_and_no_http_response", async () => {
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "netpolicy-probe-conn-"));
+    // curl exit 7 = couldn't connect; status '000' = no HTTP response.
+    fakeOpenshellEmittingHttpStatus(path.join(tmp, "bin"), "000", 7);
+    const oldPath = process.env.PATH;
+    process.env.PATH = `${path.join(tmp, "bin")}:${oldPath ?? ""}`;
+    try {
+      const { networkPolicyProbe } = await import("../scenarios/probes/network-policy.ts");
+      const ctx = makeProbeCtxFor(tmp, "sb1");
+      const outcome = await networkPolicyProbe(ctx);
+      expect(outcome.status).toBe("passed");
+      expect(outcome.message).toMatch(/curl exit 7/);
+    } finally {
+      process.env.PATH = oldPath;
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+
+  it("fails_when_blocked_url_returns_200", async () => {
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "netpolicy-probe-200-"));
+    fakeOpenshellEmittingHttpStatus(path.join(tmp, "bin"), "200", 0);
+    const oldPath = process.env.PATH;
+    process.env.PATH = `${path.join(tmp, "bin")}:${oldPath ?? ""}`;
+    try {
+      const { networkPolicyProbe } = await import("../scenarios/probes/network-policy.ts");
+      const ctx = makeProbeCtxFor(tmp, "sb1");
+      const outcome = await networkPolicyProbe(ctx);
+      expect(outcome.status).toBe("failed");
+      expect(outcome.message).toMatch(/reachable from sandbox.*http_code=200/);
+    } finally {
+      process.env.PATH = oldPath;
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+
+  it("fails_when_blocked_url_returns_401_indicating_policy_bypass", async () => {
+    // 401 means the request reached upstream auth, NOT that gateway
+    // dropped it. The probe must classify this as a policy bypass.
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "netpolicy-probe-401-"));
+    fakeOpenshellEmittingHttpStatus(path.join(tmp, "bin"), "401", 0);
+    const oldPath = process.env.PATH;
+    process.env.PATH = `${path.join(tmp, "bin")}:${oldPath ?? ""}`;
+    try {
+      const { networkPolicyProbe } = await import("../scenarios/probes/network-policy.ts");
+      const ctx = makeProbeCtxFor(tmp, "sb1");
+      const outcome = await networkPolicyProbe(ctx);
+      expect(outcome.status).toBe("failed");
+      expect(outcome.message).toMatch(/reachable from sandbox.*http_code=401/);
+    } finally {
+      process.env.PATH = oldPath;
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+});
+
+describe("injectionBlockedProbe", () => {
+  // For the injection probe we need a stub openshell that simulates a
+  // sandbox shell honestly: pre-clean, echo back stdin, and respond
+  // SAFE/EXPLOITED based on whether the marker file exists. We give
+  // each test its own tmp dir and stub script.
+  function setupInjectionStub(tmp: string, exploited: boolean): { restore: () => void } {
+    const binDir = path.join(tmp, "bin");
+    fs.mkdirSync(binDir, { recursive: true });
+    // Use a state file in tmp to track whether the 'exploit' branch
+    // should claim the marker exists.
+    const stateFile = path.join(tmp, "exploit.state");
+    fs.writeFileSync(stateFile, exploited ? "yes" : "no");
+    fs.writeFileSync(
+      path.join(binDir, "openshell"),
+      `#!/usr/bin/env bash
+state=\$(cat "${stateFile}" 2>/dev/null || echo no)
+if [[ "\$1" == "sandbox" && "\$2" == "ssh-config" ]]; then exit 1; fi
+if [[ "\$1" == "sandbox" && "\$2" == "exec" ]]; then
+  shift 2
+  while [[ "\$#" -gt 0 && "\$1" != "--" ]]; do shift; done
+  shift || true
+  # Recognize the three operations the probe issues:
+  #   1. sh -c 'rm -f <marker>'              — cleanup; always succeeds
+  #   2. sh -c 'MSG=\$(cat); printf %s\\n "\$MSG"'  — echo back stdin
+  #   3. sh -c 'test -f <marker> && echo EXPLOITED || echo SAFE'
+  cmd="\$*"
+  case "\$cmd" in
+    *"MSG="*"printf"*)
+      cat
+      ;;
+    *"test -f"*"EXPLOITED"*"SAFE"*)
+      if [[ "\$state" == "yes" ]]; then echo EXPLOITED; else echo SAFE; fi
+      ;;
+    *"rm -f"*)
+      :
+      ;;
+    *)
+      echo "unrecognized cmd: \$cmd" >&2
+      exit 64
+      ;;
+  esac
+  exit 0
+fi
+exit 99
+`,
+      { mode: 0o755 },
+    );
+    const oldPath = process.env.PATH;
+    process.env.PATH = `${binDir}:${oldPath ?? ""}`;
+    return {
+      restore: () => {
+        process.env.PATH = oldPath;
+      },
+    };
+  }
+
+  it("passes_when_payload_is_preserved_and_marker_absent", async () => {
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "inj-probe-pass-"));
+    const stub = setupInjectionStub(tmp, false);
+    try {
+      const { injectionBlockedProbe } = await import("../scenarios/probes/injection-blocked.ts");
+      const ctx = makeProbeCtxFor(tmp, "sb1");
+      const outcome = await injectionBlockedProbe(ctx);
+      expect(outcome.status).toBe("passed");
+      const ev = JSON.parse(fs.readFileSync(ctx.evidencePath, "utf8"));
+      expect(ev.payloadPreservedLiterally).toBe(true);
+      expect(ev.markerAbsent).toBe(true);
+    } finally {
+      stub.restore();
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+
+  it("fails_when_marker_file_was_created_indicating_command_substitution_executed", async () => {
+    const tmp = fs.mkdtempSync(path.join(os.tmpdir(), "inj-probe-fail-"));
+    const stub = setupInjectionStub(tmp, true);
+    try {
+      const { injectionBlockedProbe } = await import("../scenarios/probes/injection-blocked.ts");
+      const ctx = makeProbeCtxFor(tmp, "sb1");
+      const outcome = await injectionBlockedProbe(ctx);
+      expect(outcome.status).toBe("failed");
+      expect(outcome.message).toMatch(/marker file .* present/);
+      expect(outcome.message).toMatch(/command substitution executed/);
+      const ev = JSON.parse(fs.readFileSync(ctx.evidencePath, "utf8"));
+      expect(ev.markerAbsent).toBe(false);
+    } finally {
+      stub.restore();
+      fs.rmSync(tmp, { recursive: true, force: true });
+    }
+  });
+});
diff --git a/test/e2e-scenario/scenarios/probes/builtin.ts b/test/e2e-scenario/scenarios/probes/builtin.ts
index afab89aaf8..7f78fc06bc 100644
--- a/test/e2e-scenario/scenarios/probes/builtin.ts
+++ b/test/e2e-scenario/scenarios/probes/builtin.ts
@@ -3,6 +3,9 @@
 
 import { diagnosticsProbe } from "./diagnostics.ts";
 import { docsValidationProbe } from "./docs-validation.ts";
+import { injectionBlockedProbe } from "./injection-blocked.ts";
+import { networkPolicyProbe } from "./network-policy.ts";
+import { shieldsConfigProbe } from "./shields-config.ts";
 import { lookupProbe, registerProbe } from "./registry.ts";
 
 /**
@@ -17,19 +20,18 @@ import { lookupProbe, registerProbe } from "./registry.ts";
  *   - Scenario-specific probes (if any) belong in a per-scenario
  *     module that calls `registerProbe()` directly.
  *
- * Probes intentionally NOT yet registered (probe-registry follow-up):
- *   - shieldsConfigProbe       (security; required: true)
- *   - networkPolicyProbe       (security; required: true)
- *   - injectionBlockedProbe    (security; required: true)
- *
- * Until those land, the orchestrator surfaces them as failed (not
- * skipped) because the typed registry marks them required: true.
- * That is intentional — security-sensitive suites must NEVER show
- * fake-green when their probe is missing.
+ * Security probes (shieldsConfigProbe, networkPolicyProbe,
+ * injectionBlockedProbe) are marked `required: true` in
+ * scenarios/assertions/registry.ts. With the implementations
+ * registered below, the orchestrator runs them and fails the phase
+ * on real assertion violations — not on a missing implementation.
  */
 const BUILTIN_PROBES = {
   diagnosticsProbe,
   docsValidationProbe,
+  shieldsConfigProbe,
+  networkPolicyProbe,
+  injectionBlockedProbe,
 } as const;
 
 export function registerBuiltinProbes(): void {
diff --git a/test/e2e-scenario/scenarios/probes/injection-blocked.ts b/test/e2e-scenario/scenarios/probes/injection-blocked.ts
new file mode 100644
index 0000000000..d1acf8ab3d
--- /dev/null
+++ b/test/e2e-scenario/scenarios/probes/injection-blocked.ts
@@ -0,0 +1,155 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+import type { ProbeContext, ProbeFn, ProbeOutcome } from "./types.ts";
+import { runSandboxCmd, writeProbeEvidence } from "./util.ts";
+
+/**
+ * Probe: security.injection.blocked (`injectionBlockedProbe`).
+ *
+ * Mirrors the contract in
+ * test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
+ * `spc_assert_telegram_payload_not_shell_executed`, which itself
+ * encodes the test-credential-sanitization.sh / test-brave-search-e2e.sh
+ * lesson: when an external content channel (chat message, web fetch
+ * response, document body) flows into the agent, it MUST be treated
+ * as data, never as a shell command.
+ *
+ * Implementation:
+ *   1. Pre-clean a unique marker file inside the sandbox.
+ *   2. Send a payload designed to exploit naive shell expansion:
+ *        $(touch <marker> && echo INJECTED)
+ *      via stdin to a remote `sh -c 'MSG=$(cat); printf "%s\\n" "$MSG"'`
+ *      that's responsible for echoing the bytes back without
+ *      evaluation.
+ *   3. Verify the echoed output contains the literal payload
+ *      (proves the bytes were preserved, not interpreted).
+ *   4. Verify the marker file does NOT exist (proves the
+ *      command-substitution operator did not fire on the host
+ *      side or the sandbox-side shell).
+ *
+ * If either invariant is violated the probe fails closed: the
+ * scenario must NEVER pass on a fake-green injection check.
+ *
+ * Evidence: structured JSON capturing the payload, the marker path,
+ * the echo response tail, and the marker existence query result.
+ */
+
+const PER_CALL_SECONDS = 25;
+
+interface InjectionEvidence {
+  payload: string;
+  markerPath: string;
+  echoExitCode: number | null;
+  echoStdoutTail: string;
+  echoStderrTail: string;
+  markerStateExitCode: number | null;
+  markerStateStdout: string;
+  payloadPreservedLiterally: boolean;
+  markerAbsent: boolean;
+}
+
+function uniqueMarkerPath(): string {
+  // `/tmp` is sandbox-writable; collisions across parallel scenarios
+  // are avoided by mixing pid+random to keep the marker scoped to
+  // this probe invocation.
+  const rand = Math.floor(Math.random() * 0xffff_ffff).toString(16);
+  return `/tmp/nemoclaw-injection-probe-${process.pid}-${rand}`;
+}
+
+export const injectionBlockedProbe: ProbeFn = async (ctx: ProbeContext): Promise<ProbeOutcome> => {
+  if (!ctx.sandboxName) {
+    return { status: "failed", message: "injectionBlockedProbe: E2E_SANDBOX_NAME missing in context.env" };
+  }
+
+  const markerPath = uniqueMarkerPath();
+  // Single-quote the marker path inside the payload so the marker
+  // string survives the host-side bash quoting layer; the test is
+  // the COMMAND SUBSTITUTION operator surviving, not the path.
+  const payload = `$(touch '${markerPath}' && echo INJECTED)`;
+
+  const evidence: InjectionEvidence = {
+    payload,
+    markerPath,
+    echoExitCode: null,
+    echoStdoutTail: "",
+    echoStderrTail: "",
+    markerStateExitCode: null,
+    markerStateStdout: "",
+    payloadPreservedLiterally: false,
+    markerAbsent: false,
+  };
+
+  // Step 1: pre-clean the marker. Best-effort; failure here doesn't
+  // matter as long as Step 3 confirms the marker is absent at the
+  // end.
+  await runSandboxCmd(ctx, ["sh", "-c", `rm -f '${markerPath}'`], {
+    perCallSeconds: PER_CALL_SECONDS,
+  });
+
+  // Step 2: send the payload via stdin to an in-sandbox shell that
+  // reads it into a variable and echoes it back. The remote command
+  // string must NOT use $() inside the literal; the host-side bash
+  // wrapper passes the script verbatim and the sandbox shell reads
+  // the payload as data.
+  const echoResult = await runSandboxCmd(
+    ctx,
+    ["sh", "-c", 'MSG=$(cat); printf "%s\n" "$MSG"'],
+    { perCallSeconds: PER_CALL_SECONDS, stdin: payload },
+  );
+  evidence.echoExitCode = echoResult.exitCode;
+  evidence.echoStdoutTail = echoResult.stdout;
+  evidence.echoStderrTail = echoResult.stderr;
+
+  if (echoResult.exitCode !== 0) {
+    writeProbeEvidence(ctx.evidencePath, evidence);
+    return {
+      status: "failed",
+      classifier: echoResult.signal === "SIGTERM" ? "gateway-transient" : undefined,
+      message: `injectionBlockedProbe: echo command failed (exit ${echoResult.exitCode}); stderr: ${echoResult.stderr.slice(-300)}`,
+    };
+  }
+
+  evidence.payloadPreservedLiterally = echoResult.stdout.includes(payload);
+  if (!evidence.payloadPreservedLiterally) {
+    writeProbeEvidence(ctx.evidencePath, evidence);
+    return {
+      status: "failed",
+      message: `injectionBlockedProbe: payload was not preserved literally; stdout tail: ${echoResult.stdout.slice(-300)}`,
+    };
+  }
+
+  // Step 3: confirm the marker file does NOT exist. A 'SAFE' result
+  // means the command-substitution payload was treated as data; an
+  // 'EXPLOITED' result means a shell somewhere in the chain
+  // evaluated the substitution.
+  const markerResult = await runSandboxCmd(
+    ctx,
+    ["sh", "-c", `test -f '${markerPath}' && echo EXPLOITED || echo SAFE`],
+    { perCallSeconds: PER_CALL_SECONDS },
+  );
+  evidence.markerStateExitCode = markerResult.exitCode;
+  evidence.markerStateStdout = markerResult.stdout;
+  evidence.markerAbsent = markerResult.stdout.includes("SAFE");
+
+  // Best-effort cleanup of the marker if it somehow got created
+  // (an 'EXPLOITED' result is a probe failure but we still don't
+  // want a stray file lingering between runs).
+  await runSandboxCmd(ctx, ["sh", "-c", `rm -f '${markerPath}'`], {
+    perCallSeconds: PER_CALL_SECONDS,
+  });
+
+  writeProbeEvidence(ctx.evidencePath, evidence);
+
+  if (!evidence.markerAbsent) {
+    return {
+      status: "failed",
+      message: `injectionBlockedProbe: marker file ${markerPath} present \u2014 command substitution executed; stdout: ${markerResult.stdout.slice(-200)}`,
+    };
+  }
+
+  return {
+    status: "passed",
+    message: `injectionBlockedProbe: payload preserved as data, marker ${markerPath} absent`,
+  };
+};
diff --git a/test/e2e-scenario/scenarios/probes/network-policy.ts b/test/e2e-scenario/scenarios/probes/network-policy.ts
new file mode 100644
index 0000000000..c3bb50923c
--- /dev/null
+++ b/test/e2e-scenario/scenarios/probes/network-policy.ts
@@ -0,0 +1,125 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+import type { ProbeContext, ProbeFn, ProbeOutcome } from "./types.ts";
+import { runSandboxCmd, writeProbeEvidence } from "./util.ts";
+
+/**
+ * Probe: security.policy.enforced (`networkPolicyProbe`).
+ *
+ * Mirrors the deny-by-default contract from
+ * test/e2e/test-network-policy.sh TC-NET-01: when no policy preset
+ * widens egress for a given hostname, a request to that hostname
+ * from inside the sandbox MUST be rejected by the gateway. A success
+ * status is a hard failure \u2014 it means the network-policy enforcement
+ * layer is not catching the request.
+ *
+ * Implementation: from inside the sandbox, run `curl` against a
+ * non-whitelisted URL and inspect:
+ *   - HTTP status code (via curl -w '%{http_code}')
+ *   - curl exit code (curl exit 7 / 28 / etc. when DNS or connect
+ *     is blocked outright)
+ *
+ * Expected outcomes:
+ *   - HTTP 403   (gateway proxy rejected the request)
+ *   - HTTP 4xx (any other 4xx that's not 401 \u2014 401 indicates the
+ *     request reached an upstream auth wall, which counts as policy
+ *     bypass, NOT block)
+ *   - curl exit != 0 with HTTP code 000 (DNS / connect error) \u2014 the
+ *     gateway dropped the request before HTTP could be spoken
+ *
+ * Anything else (HTTP 2xx, 3xx, 401) means policy is NOT enforcing
+ * deny-by-default and the probe fails.
+ *
+ * Hostname choice: example.com is the canonical "should never be on
+ * any preset" target the legacy test uses. Probes that need a
+ * different fixture override via E2E_NETWORK_POLICY_BLOCKED_URL.
+ */
+
+const DEFAULT_BLOCKED_URL = "https://example.com/";
+const CURL_MAX_TIME_S = 10;
+const PER_CALL_SECONDS = 25;
+
+interface NetworkPolicyEvidence {
+  blockedUrl: string;
+  curlExitCode: number | null;
+  curlSignal: string | null;
+  httpStatus: string | null;
+  stdoutTail: string;
+  stderrTail: string;
+}
+
+function isBlockedHttpStatus(code: string): boolean {
+  if (code === "000") return true; // DNS/connect refused before HTTP
+  if (code === "401") return false; // reached upstream auth -> NOT blocked
+  return /^4[0-9][0-9]$/.test(code) || /^5[0-9][0-9]$/.test(code);
+}
+
+export const networkPolicyProbe: ProbeFn = async (ctx: ProbeContext): Promise<ProbeOutcome> => {
+  if (!ctx.sandboxName) {
+    return { status: "failed", message: "networkPolicyProbe: E2E_SANDBOX_NAME missing in context.env" };
+  }
+  const blockedUrl = ctx.contextEnv.E2E_NETWORK_POLICY_BLOCKED_URL || DEFAULT_BLOCKED_URL;
+
+  // curl -sS keeps stderr informative on failure; -o /dev/null discards
+  // body so the gateway's HTML reject page doesn't pollute stdout;
+  // -w prints the status code we parse below.
+  const result = await runSandboxCmd(
+    ctx,
+    [
+      "curl",
+      "-sS",
+      "-o",
+      "/dev/null",
+      "-w",
+      "%{http_code}",
+      "--max-time",
+      String(CURL_MAX_TIME_S),
+      blockedUrl,
+    ],
+    { perCallSeconds: PER_CALL_SECONDS },
+  );
+
+  // curl writes the status code to stdout (or '000' on connect/DNS
+  // failure). Trim whitespace; some curl builds emit a trailing
+  // newline.
+  const httpStatus = result.stdout.trim() || null;
+  const evidence: NetworkPolicyEvidence = {
+    blockedUrl,
+    curlExitCode: result.exitCode,
+    curlSignal: result.signal,
+    httpStatus,
+    stdoutTail: result.stdout,
+    stderrTail: result.stderr,
+  };
+  writeProbeEvidence(ctx.evidencePath, evidence);
+
+  if (result.signal === "SIGTERM") {
+    return {
+      status: "failed",
+      classifier: "gateway-transient",
+      message: `networkPolicyProbe: curl into sandbox timed out after ${PER_CALL_SECONDS}s`,
+    };
+  }
+
+  // The probe accepts:
+  //   - curl exit 0 with a 4xx/5xx body (gateway returned a reject)
+  //   - curl exit != 0 with status '000' (gateway dropped the
+  //     connection, curl never got an HTTP response)
+  if (httpStatus && isBlockedHttpStatus(httpStatus)) {
+    return {
+      status: "passed",
+      message: `networkPolicyProbe: ${blockedUrl} blocked (http_code=${httpStatus}, curl exit ${result.exitCode})`,
+    };
+  }
+  if (result.exitCode !== 0 && (!httpStatus || httpStatus === "000")) {
+    return {
+      status: "passed",
+      message: `networkPolicyProbe: ${blockedUrl} blocked (curl exit ${result.exitCode}, no HTTP response)`,
+    };
+  }
+  return {
+    status: "failed",
+    message: `networkPolicyProbe: ${blockedUrl} reachable from sandbox (http_code=${httpStatus ?? "<empty>"}, curl exit ${result.exitCode}); deny-by-default not enforced`,
+  };
+};
diff --git a/test/e2e-scenario/scenarios/probes/shields-config.ts b/test/e2e-scenario/scenarios/probes/shields-config.ts
new file mode 100644
index 0000000000..6e268f69a5
--- /dev/null
+++ b/test/e2e-scenario/scenarios/probes/shields-config.ts
@@ -0,0 +1,196 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+import type { ProbeContext, ProbeFn, ProbeOutcome } from "./types.ts";
+import { runHostCmd, runSandboxCmd, writeProbeEvidence } from "./util.ts";
+
+/**
+ * Probe: security.shields.config (`shieldsConfigProbe`).
+ *
+ * Mirrors test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
+ * `spc_assert_shields_config_consistent`, which itself ports the
+ * legacy test/e2e/test-shields-config.sh contract:
+ *
+ *   1. Ask the host CLI: `nemoclaw <sandbox> shields status` and
+ *      classify the reported state as up | down | not-configured.
+ *   2. If the scenario declares an expected state via
+ *      `E2E_SHIELDS_EXPECTED_STATE` (or the legacy
+ *      `E2E_SHIELDS_EXPECTED`), assert observed === expected.
+ *   3. Verify the in-sandbox config file permissions match the
+ *      observed state:
+ *        - up                  -> root:root + restrictive 4xx mode
+ *                                 (read-only for owner+group, no write
+ *                                  for sandbox user)
+ *        - down|not-configured -> sandbox:sandbox (writable by the
+ *                                  sandbox user, since shields are
+ *                                  not locking the file)
+ *
+ * Config path depends on the agent the scenario onboarded:
+ *   - openclaw -> /sandbox/.openclaw/openclaw.json
+ *   - hermes   -> /sandbox/.hermes/.env
+ *
+ * Evidence: a JSON document at ProbeContext.evidencePath summarizing
+ * status output, observed state, expected state (if declared), and
+ * config-permission stat output.
+ */
+
+const SHIELDS_STATUS_TIMEOUT_MS = 30_000;
+const SANDBOX_STAT_PER_CALL_SECONDS = 25;
+
+type ShieldsState = "up" | "down" | "not-configured";
+
+interface ShieldsEvidence {
+  observed: ShieldsState | null;
+  expected: ShieldsState | null;
+  statusExitCode: number | null;
+  statusStdoutTail: string;
+  configPath: string | null;
+  permissionsLine: string | null;
+  mode: string | null;
+  owner: string | null;
+}
+
+function classifyStatus(stdout: string): ShieldsState | null {
+  if (stdout.includes("Shields: UP")) return "up";
+  if (stdout.includes("Shields: DOWN")) return "down";
+  if (stdout.includes("Shields: NOT CONFIGURED")) return "not-configured";
+  return null;
+}
+
+function configPathFor(agent: string | undefined): string | null {
+  switch (agent) {
+    case "openclaw":
+    case undefined:
+    case "":
+      return "/sandbox/.openclaw/openclaw.json";
+    case "hermes":
+      return "/sandbox/.hermes/.env";
+    default:
+      return null;
+  }
+}
+
+function permissionsOk(observed: ShieldsState, mode: string, owner: string): boolean {
+  if (observed === "up") {
+    // Locked: owner must be root, mode must be 4xx (no group/world
+    // writes; legacy lib accepts 4[0-4][0-4]).
+    return /^4[0-4][0-4]$/.test(mode) && owner === "root:root";
+  }
+  // down | not-configured: sandbox user owns the file so they can
+  // edit when shields are dropped.
+  return owner === "sandbox:sandbox";
+}
+
+function expectedStateFromContext(env: Readonly<Record<string, string>>): ShieldsState | null {
+  const raw = (env.E2E_SHIELDS_EXPECTED_STATE || env.E2E_SHIELDS_EXPECTED || "").trim();
+  if (!raw) return null;
+  const norm = raw.replace(/_/g, "-").toLowerCase();
+  if (norm === "up" || norm === "down" || norm === "not-configured") return norm;
+  return null;
+}
+
+export const shieldsConfigProbe: ProbeFn = async (ctx: ProbeContext): Promise<ProbeOutcome> => {
+  if (!ctx.sandboxName) {
+    return { status: "failed", message: "shieldsConfigProbe: E2E_SANDBOX_NAME missing in context.env" };
+  }
+
+  const evidence: ShieldsEvidence = {
+    observed: null,
+    expected: expectedStateFromContext(ctx.contextEnv),
+    statusExitCode: null,
+    statusStdoutTail: "",
+    configPath: null,
+    permissionsLine: null,
+    mode: null,
+    owner: null,
+  };
+
+  // --- Step 1: nemoclaw <sandbox> shields status ---
+  const statusResult = await runHostCmd(
+    "nemoclaw",
+    [ctx.sandboxName, "shields", "status"],
+    { timeoutMs: SHIELDS_STATUS_TIMEOUT_MS },
+  );
+  evidence.statusExitCode = statusResult.exitCode;
+  evidence.statusStdoutTail = statusResult.stdout;
+  if (statusResult.signal === "SIGTERM") {
+    writeProbeEvidence(ctx.evidencePath, evidence);
+    return {
+      status: "failed",
+      classifier: "runner-infra",
+      message: `shieldsConfigProbe: 'nemoclaw shields status' timed out after ${SHIELDS_STATUS_TIMEOUT_MS}ms`,
+    };
+  }
+  if (statusResult.exitCode !== 0) {
+    writeProbeEvidence(ctx.evidencePath, evidence);
+    return {
+      status: "failed",
+      message: `shieldsConfigProbe: 'nemoclaw shields status' exited ${statusResult.exitCode}; stderr: ${statusResult.stderr.slice(-300)}`,
+    };
+  }
+  const observed = classifyStatus(statusResult.stdout);
+  evidence.observed = observed;
+  if (!observed) {
+    writeProbeEvidence(ctx.evidencePath, evidence);
+    return {
+      status: "failed",
+      message: `shieldsConfigProbe: status output did not report a recognized Shields state; tail: ${statusResult.stdout.slice(-200)}`,
+    };
+  }
+  if (evidence.expected && evidence.expected !== observed) {
+    writeProbeEvidence(ctx.evidencePath, evidence);
+    return {
+      status: "failed",
+      message: `shieldsConfigProbe: expected shields '${evidence.expected}', observed '${observed}'`,
+    };
+  }
+
+  // --- Step 2: in-sandbox stat of the config file ---
+  const configPath = configPathFor(ctx.contextEnv.E2E_AGENT);
+  if (!configPath) {
+    writeProbeEvidence(ctx.evidencePath, evidence);
+    return {
+      status: "failed",
+      message: `shieldsConfigProbe: unsupported E2E_AGENT '${ctx.contextEnv.E2E_AGENT}'`,
+    };
+  }
+  evidence.configPath = configPath;
+  const statResult = await runSandboxCmd(
+    ctx,
+    ["stat", "-c", "%a %U:%G", configPath],
+    { perCallSeconds: SANDBOX_STAT_PER_CALL_SECONDS },
+  );
+  if (statResult.exitCode !== 0) {
+    writeProbeEvidence(ctx.evidencePath, evidence);
+    return {
+      status: "failed",
+      classifier: statResult.signal === "SIGTERM" ? "gateway-transient" : undefined,
+      message: `shieldsConfigProbe: stat of ${configPath} failed (exit ${statResult.exitCode}); stderr: ${statResult.stderr.slice(-300)}`,
+    };
+  }
+  const permsLine = statResult.stdout.trim();
+  evidence.permissionsLine = permsLine;
+  const [mode, owner] = permsLine.split(/\s+/, 2);
+  evidence.mode = mode ?? null;
+  evidence.owner = owner ?? null;
+  if (!mode || !owner) {
+    writeProbeEvidence(ctx.evidencePath, evidence);
+    return {
+      status: "failed",
+      message: `shieldsConfigProbe: could not parse stat output: '${permsLine}'`,
+    };
+  }
+  if (!permissionsOk(observed, mode, owner)) {
+    writeProbeEvidence(ctx.evidencePath, evidence);
+    return {
+      status: "failed",
+      message: `shieldsConfigProbe: shields are '${observed}' but ${configPath} permissions are '${permsLine}'`,
+    };
+  }
+
+  writeProbeEvidence(ctx.evidencePath, evidence);
+  return {
+    status: "passed",
+    message: `shieldsConfigProbe: shields=${observed} ${configPath}=${permsLine}`,
+  };
+};
diff --git a/test/e2e-scenario/scenarios/probes/util.ts b/test/e2e-scenario/scenarios/probes/util.ts
new file mode 100644
index 0000000000..dbb94a9819
--- /dev/null
+++ b/test/e2e-scenario/scenarios/probes/util.ts
@@ -0,0 +1,235 @@
+// SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+// SPDX-License-Identifier: Apache-2.0
+
+import { spawn } from "node:child_process";
+import fs from "node:fs";
+import path from "node:path";
+import type { ProbeContext } from "./types.ts";
+
+/**
+ * Shared utilities for built-in probes. Two responsibilities:
+ *
+ *   1. Entering the sandbox via the canonical bash wrapper
+ *      (`validation_suites/sandbox-exec.sh`) instead of re-implementing
+ *      the ssh-config / openshell-exec logic in TS. This keeps the
+ *      transport choice in ONE place \u2014 if the wrapper changes
+ *      (e.g. switches from openshell-exec to ssh-config preferred),
+ *      every probe inherits the new behavior.
+ *
+ *   2. Spawning host-side CLIs (`nemoclaw`, `openshell`) with timeouts
+ *      and structured outcome capture. Probes never invoke spawn
+ *      directly so timeout and stdio handling stays consistent.
+ *
+ * Probe code MUST treat the returned `stdout`/`stderr` as already-bounded
+ * (we slice the tail). The full output is never returned or logged from
+ * here \u2014 evidence files keep the structured fields a probe explicitly
+ * decides to persist.
+ */
+
+const VALIDATION_SUITES_REL = "test/e2e-scenario/validation_suites";
+const TAIL_BYTES = 2048;
+
+export interface CmdResult {
+  exitCode: number | null;
+  signal: NodeJS.Signals | null;
+  stdout: string;
+  stderr: string;
+  elapsedMs: number;
+}
+
+interface RunOptions {
+  /** Hard cap; on expiry the helper SIGTERMs the child and resolves. */
+  timeoutMs: number;
+  /** stdin payload for `runSandboxCmdStdin`. UTF-8 only. */
+  stdin?: string;
+  /** Override env. Defaults to process.env. */
+  env?: NodeJS.ProcessEnv;
+  /** Override cwd. Defaults to ProbeContext.repoRoot resolution. */
+  cwd?: string;
+}
+
+function tail(buf: string, max = TAIL_BYTES): string {
+  return buf.length <= max ? buf : buf.slice(-max);
+}
+
+/**
+ * Spawn a bash script and capture the result. Internal helper used by
+ * the sandbox-cmd path; not exported because direct bash spawning by
+ * probes invites the same drift the canonical wrapper exists to
+ * prevent.
+ */
+function spawnBash(script: string, opts: RunOptions): Promise<CmdResult> {
+  return new Promise((resolve) => {
+    const startedAt = Date.now();
+    let stdout = "";
+    let stderr = "";
+    const child = spawn("bash", ["-c", script], {
+      env: opts.env ?? process.env,
+      cwd: opts.cwd,
+      stdio: [opts.stdin === undefined ? "ignore" : "pipe", "pipe", "pipe"],
+    });
+    const onTimeout = setTimeout(() => {
+      try {
+        child.kill("SIGTERM");
+      } catch {
+        /* already gone */
+      }
+    }, opts.timeoutMs);
+    child.stdout?.on("data", (chunk: Buffer) => {
+      stdout = tail(stdout + chunk.toString("utf8"));
+    });
+    child.stderr?.on("data", (chunk: Buffer) => {
+      stderr = tail(stderr + chunk.toString("utf8"));
+    });
+    if (opts.stdin !== undefined && child.stdin) {
+      child.stdin.end(opts.stdin);
+    }
+    child.on("error", (err) => {
+      clearTimeout(onTimeout);
+      resolve({
+        exitCode: 127,
+        signal: null,
+        stdout,
+        stderr: tail(stderr + `spawn error: ${err.message}`),
+        elapsedMs: Date.now() - startedAt,
+      });
+    });
+    child.on("close", (code, sig) => {
+      clearTimeout(onTimeout);
+      resolve({
+        exitCode: code,
+        signal: sig,
+        stdout,
+        stderr,
+        elapsedMs: Date.now() - startedAt,
+      });
+    });
+  });
+}
+
+/**
+ * Run a command inside the scenario's sandbox via the canonical
+ * `e2e_sandbox_exec` shell wrapper. Picks up the same ssh-config
+ * preferred / openshell-exec fallback transport, the per-call
+ * timeout, and the classified diagnostic on hang.
+ *
+ * `args` is treated as a single argv vector by the wrapper \u2014 each
+ * element is single-quoted into the bash script body so payloads
+ * with shell metacharacters survive intact.
+ */
+export async function runSandboxCmd(
+  ctx: ProbeContext,
+  args: readonly string[],
+  opts: { timeoutMs?: number; perCallSeconds?: number; stdin?: string } = {},
+): Promise<CmdResult> {
+  if (!ctx.sandboxName) {
+    return {
+      exitCode: 1,
+      signal: null,
+      stdout: "",
+      stderr: "runSandboxCmd: ProbeContext.sandboxName is null (E2E_SANDBOX_NAME unset in context.env)",
+      elapsedMs: 0,
+    };
+  }
+  const wrapperPath = path.resolve(ctx.repoRoot, VALIDATION_SUITES_REL, "sandbox-exec.sh");
+  if (!fs.existsSync(wrapperPath)) {
+    return {
+      exitCode: 1,
+      signal: null,
+      stdout: "",
+      stderr: `runSandboxCmd: wrapper not found at ${wrapperPath}`,
+      elapsedMs: 0,
+    };
+  }
+  // Quote each argv element with single-quote escapement: ' -> '\''.
+  const quotedArgs = args
+    .map((a) => `'${a.replace(/'/g, "'\\''")}'`)
+    .join(" ");
+  const fnName = opts.stdin === undefined ? "e2e_sandbox_exec" : "e2e_sandbox_exec_stdin";
+  // Per-call wrapper cap (bash-side timeout); outer node-side cap
+  // sits a few seconds above so node always wins and we get a clean
+  // CmdResult even if bash hangs mid-output.
+  const perCall = opts.perCallSeconds ?? 25;
+  const outerMs = opts.timeoutMs ?? perCall * 1000 + 5_000;
+  const sandboxQuoted = `'${ctx.sandboxName.replace(/'/g, "'\\''")}'`;
+  const script = `set -uo pipefail
+. ${JSON.stringify(wrapperPath)}
+E2E_SANDBOX_EXEC_TIMEOUT_SECONDS=${perCall} ${fnName} ${sandboxQuoted} -- ${quotedArgs}
+`;
+  return spawnBash(script, {
+    timeoutMs: outerMs,
+    stdin: opts.stdin,
+    env: { ...process.env, E2E_CONTEXT_DIR: ctx.contextDir },
+    cwd: ctx.repoRoot,
+  });
+}
+
+/**
+ * Spawn a host-side CLI directly. Use for `nemoclaw` / `openshell`
+ * commands that operate against the host, not inside the sandbox
+ * (e.g. `nemoclaw <sb> shields status`, `openshell policy get`).
+ */
+export function runHostCmd(
+  bin: string,
+  args: readonly string[],
+  opts: { timeoutMs?: number; cwd?: string; env?: NodeJS.ProcessEnv } = {},
+): Promise<CmdResult> {
+  return new Promise((resolve) => {
+    const startedAt = Date.now();
+    let stdout = "";
+    let stderr = "";
+    const child = spawn(bin, [...args], {
+      env: opts.env ?? process.env,
+      cwd: opts.cwd,
+      stdio: ["ignore", "pipe", "pipe"],
+    });
+    const timeoutMs = opts.timeoutMs ?? 30_000;
+    const onTimeout = setTimeout(() => {
+      try {
+        child.kill("SIGTERM");
+      } catch {
+        /* already gone */
+      }
+    }, timeoutMs);
+    child.stdout?.on("data", (chunk: Buffer) => {
+      stdout = tail(stdout + chunk.toString("utf8"));
+    });
+    child.stderr?.on("data", (chunk: Buffer) => {
+      stderr = tail(stderr + chunk.toString("utf8"));
+    });
+    child.on("error", (err) => {
+      clearTimeout(onTimeout);
+      resolve({
+        exitCode: 127,
+        signal: null,
+        stdout,
+        stderr: tail(stderr + `spawn error: ${err.message}`),
+        elapsedMs: Date.now() - startedAt,
+      });
+    });
+    child.on("close", (code, sig) => {
+      clearTimeout(onTimeout);
+      resolve({
+        exitCode: code,
+        signal: sig,
+        stdout,
+        stderr,
+        elapsedMs: Date.now() - startedAt,
+      });
+    });
+  });
+}
+
+/**
+ * Best-effort write of structured probe evidence. Every built-in
+ * probe writes its structured outcome to ProbeContext.evidencePath
+ * via this helper so the artifact bundle has a uniform JSON layout.
+ */
+export function writeProbeEvidence(evidencePath: string, payload: unknown): void {
+  try {
+    fs.mkdirSync(path.dirname(evidencePath), { recursive: true });
+    fs.writeFileSync(evidencePath, JSON.stringify(payload, null, 2));
+  } catch {
+    /* evidence is best-effort; never fail the probe on IO */
+  }
+}