test(e2e): execute real shell assertions; delete dry-run, --validate-only, and the bash runner by jyaunches · Pull Request #4380 · NVIDIA/NemoClaw

jyaunches · 2026-05-28T01:04:14Z

Why

Removing --dry-run and plan-mode execution from the E2E runner: they let a coding agent ship work behind false-green test runs. The TS runner now has one execution mode: live. No flag, env var, helper, or branch in any production path bypasses real assertion execution.

Two seams needed closing:

phase.ts:executeStep had no child_process.spawn — every real shell/probe step fell through to a hardcoded failed: unsupported live step, and four placeholder refs (fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton) were the only paths that reported passed.
~30 shell scripts under validation_suites/, onboarding_assertions/, and nemoclaw_scenarios/ began with if e2e_env_is_dry_run; then exit 0; fi, so when the workflow passed --dry-run the scripts exited 0 before running.

What changed

Orchestrator (TypeScript)

phase.ts:executeStep now spawns shell steps via child_process.spawn, with detached process groups so timeouts kill bash + sleep cleanly via negative-pid signal. Probe steps return skipped: "probe not registered" until the registry lands. Pending steps return skipped: "pending: <ref>". Unknown kinds throw. Real evidence captured to .e2e/logs/<step-id>.log. Step-level reliability.timeoutSeconds and retry.{attempts,on} enforced here, not in clients.
run.ts: --dry-run, --validate-only deleted. Default invocation is live execution. --list and --plan-only (local debug) survive read-only. --emit-matrix added for the dynamic-matrix workflow (feat(e2e): generate scenario fan-out matrix from typed registry #4359).
types.ts: RunContext.dryRun deleted.

Workflow

e2e-scenarios.yaml: the resolve-runner --plan-only warmup, and both --dry-run invocations (Linux + WSL), are gone. Workflows execute live.
tools/e2e-scenarios/workflow-boundary.mts validator now rejects --dry-run, --plan-only, --validate-only in the workflow.

Bash entrypoints (PR collapses what was originally going to be PR 1 + PR 2)

runtime/run-scenario.sh: 483 lines of duplicated install/onboard/gateway-check/suite-execution → 5-line fail-fast stub pointing at run.ts.
runtime/run-suites.sh: same treatment. PhaseOrchestrator.runShellStep walks typed assertionGroups directly; nothing in TS calls a YAML-walking bash runner.

Shell scripts (the leaves stay, the dry-run skip blocks die)

validation_suites/**, onboarding_assertions/**, nemoclaw_scenarios/**: every if e2e_env_is_dry_run; then ... exit 0; fi and every [[ "${E2E_DRY_RUN:-0}" == "1" ]] short-circuit removed. The real assertion logic that was hiding underneath now runs unconditionally.
runtime/lib/env.sh: e2e_env_is_dry_run helper deleted.
inference_routing.sh: dead _e2e_inference_plan helper deleted.

Tests

DELETED (validated dead code paths): e2e-suite-runner.test.ts, e2e-scenario-first-migration.test.ts, e2e-expected-state-validator.test.ts.
REWRITTEN e2e-phase-orchestrators.test.ts: now exercises real shell spawning via temp scripts (pass / fail-with-stderr-tail / timeout / retry-on-classified-transient / missing-ref), real probe skipping with visible reason, and real pending skipping. Replaces the prior placeholder refs with assertions that observe actual subprocess behavior.
TRIMMED e2e-lib-helpers.test.ts, e2e-scenario-additional-families.test.ts, e2e-scenario-resolver.test.ts, e2e-context-helper.test.ts: dry-run-mode / run-scenario.sh-spawning tests deleted; tests of real bash semantics survive.

Docs

test/e2e-scenario/docs/README.md: one runner, one mode (live), no dry-run, no validate-only.

Verification

$ npx tsx test/e2e-scenario/scenarios/run.ts --list
hybrid scenario registry
- brev-launchable-cloud-openclaw: ...
- ubuntu-repo-cloud-openclaw: ...
(22 scenarios listed)

$ npx tsx test/e2e-scenario/scenarios/run.ts --emit-matrix
{"include":[{"id":"brev-launchable-cloud-openclaw", ...}, ...]}

$ E2E_CONTEXT_DIR=/tmp/x npx tsx test/e2e-scenario/scenarios/run.ts \
    --scenarios ubuntu-repo-cloud-openclaw
... (compiled plan output) ...
Phase results:
  environment: skipped (skipped=1)
  onboarding: failed (passed=1 failed=1)
  runtime: failed (failed=34 skipped=5)

$ cat /tmp/x/.e2e/logs/runtime.smoke.cli-available.log
smoke:cli-available
e2e context: missing required key(s): E2E_SCENARIO

$ vitest run test/e2e-scenario/
Test Files  32 passed (32)
     Tests  274 passed (274)

Audited absent in production paths: --dry-run, dryRun, E2E_DRY_RUN, e2e_env_is_dry_run, fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton, unsupported-live-step, --validate-only, RunContext.dryRun.

Expected CI behavior

This PR's CI will surface real failures wherever the orchestrator does not yet hand the runtime phase enough state (e.g. a seeded context.env). That is intentional: this change exposes the remaining wiring gaps so subsequent PRs (probe registry, OnboardingOrchestrator wiring into the real install/onboard dispatchers, old YAML resolver deletion) can address them directly.

Spec gates addressed

Phase 6 — orchestrators execute live shell/probe assertion steps.
Phase 7 — single TS runtime entrypoint; bash runners deprecated; workflows use the typed runner with no --dry-run.
Workflow side of Phase 9 — --dry-run, --validate-only, and suite_filter gone from active paths.

The old YAML resolver source under runtime/resolver/ is left intact; its deletion is the next PR.

Stat

44 files changed, 454 insertions(+), 1833 deletions(-)

Net deletion of ~1400 lines, primarily dry-run scaffolding and the duplicated bash orchestration layer.

Summary by CodeRabbit

New Features
- Scenarios run live by default with a compact per-phase results summary and an emit-matrix mode for CI.
- Phases execute declared actions and real shell steps with per-action timeouts, retries, and per-step logs/evidence.
Chores
- Deprecates legacy bash E2E runners in favor of the TypeScript entrypoint.
- Removes dry-run/plan-only short-circuits so installs, validations and assertions execute consistently.
Documentation
- Updates E2E docs to reflect the TypeScript entrypoint and deprecated bash scripts.
Tests
- Refactors/removes multiple E2E tests to align with live-run behavior and the new orchestrator.

…only, and the bash runner The merged hybrid scenario architecture shipped scaffolding that looked like it ran E2E tests but did not. Two layers were producing fake green: 1. phase.ts:executeStep had no child_process.spawn anywhere. Every real shell/probe step fell through to a hardcoded { status: "failed", message: "unsupported live step" }, and a handful of fake-pass refs (fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton) were the only paths that reported "passed". CI green meant the plan compiled, not that any assertion executed. 2. ~30 shell scripts under validation_suites/, onboarding_assertions/, nemoclaw_scenarios/install/, and nemoclaw_scenarios/onboard/ began with "if e2e_env_is_dry_run; then echo [dry-run] ...; exit 0; fi". Once the dry-run flag flowed in (which workflows did pass), every script silently exited 0 before its real assertion ran. This change rips out both layers in one shot. The TS runner has one execution mode: live. There is no flag, env var, helper, branch, or comment in any production path that can produce a fake pass. Orchestrator (TypeScript) - phase.ts: executeStep now spawns shell steps via child_process.spawn, with detached process groups so timeouts kill bash + sleep cleanly. Probe steps return skipped "probe not registered" until the registry lands. Pending steps return skipped "pending: <ref>". Unknown kinds throw. Real evidence is captured to .e2e/logs/<step-id>.log. Step-level reliability.timeoutSeconds and retry.{attempts,on} policy are enforced here, not in clients. - run.ts: --dry-run, --validate-only deleted. Default invocation is live execution. --list and --plan-only (local debug) survive read-only. --emit-matrix added for the dynamic-matrix workflow (PR #4359). - types.ts: RunContext.dryRun deleted. AssertionResult already supported "skipped" status, now actually used. Workflow - e2e-scenarios.yaml: the resolve-runner --plan-only warmup, and both --dry-run invocations (Linux + WSL), are gone. Workflows execute live. - workflow-boundary.mts validator now requires --dry-run, --plan-only, and --validate-only to NOT appear in the workflow. Bash entrypoints (PR collapses what was originally going to be PR 1 + PR 2) - runtime/run-scenario.sh: 483 lines of duplicated install/onboard/ gateway-check/suite-execution -> 5-line fail-fast stub pointing at run.ts. The TS phase orchestrators own this work now. - runtime/run-suites.sh: same treatment. PhaseOrchestrator.runShellStep walks typed assertionGroups directly; nothing in TS calls a YAML-walking bash runner. Shell scripts (the leaves stay, the dry-run skip blocks die) - validation_suites/**, onboarding_assertions/**, nemoclaw_scenarios/**: every "if e2e_env_is_dry_run; then ... exit 0; fi" and every "[[ ${E2E_DRY_RUN:-0} == 1 ]]" short-circuit removed. The real assertion logic that was hiding underneath now runs unconditionally. - runtime/lib/env.sh: e2e_env_is_dry_run helper deleted. - inference_routing.sh: dead _e2e_inference_plan helper (only callable from the deleted dry-run paths) deleted. Tests - DELETED (validated dead code paths): e2e-suite-runner.test.ts (run-suites.sh behavior) e2e-scenario-first-migration.test.ts (run-scenario.sh dry-run plan) e2e-expected-state-validator.test.ts (--validate-only mode) - REWRITTEN: e2e-phase-orchestrators.test.ts: now exercises real shell spawning via temp scripts (pass/fail/timeout/retry/missing-ref), real probe skipping with visible reason, and real pending skipping. The previous fake-pass refs in this test were the canonical example of the problem. - TRIMMED: e2e-lib-helpers.test.ts: dry-run-mode unit tests deleted; tests of real bash semantics survive. e2e-scenario-additional-families.test.ts: planOnly-via-bash tests deleted; resolveScenario-direct tests survive. e2e-scenario-resolver.test.ts: run-scenario.sh --plan-only spawn tests deleted; resolver unit tests survive. e2e-context-helper.test.ts: dry-run trace test deleted. Docs - docs/README.md: updated to state one runner, one mode (live), no dry-run, no validate-only. Bash entrypoints documented as deprecated fail-fast stubs. Verification - run.ts --list : prints the typed registry (intact) - run.ts --emit-matrix : emits JSON matrix for the dynamic-matrix workflow - run.ts --scenarios <id>: spawns real shell scripts, real exit codes, real failures with real evidence logs. Phase results show passed/failed/skipped honestly. - All 274 e2e-scenario framework tests pass. - Audited: no surviving --dry-run, dryRun, E2E_DRY_RUN, e2e_env_is_dry_run, fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton, unsupported-live-step, --validate-only, or RunContext.dryRun in any production path. CI for this PR will go red on environments where nemoclaw is not actually installed and onboarded. That is the point. Red is the first honest signal in months. Subsequent PRs (probe registry, OnboardingOrchestrator wiring into the real install/onboard dispatchers, old YAML resolver deletion) fix the real failures rather than hide them. Spec gates addressed: Phase 6 (orchestrators execute live shell steps), Phase 7 (single TS runtime entrypoint, bash runners deprecated), and the workflow side of Phase 9 (--dry-run / --validate-only / suite_filter gone from active paths). The old YAML resolver source under runtime/resolver/ stays for now; its deletion is the next PR.

coderabbitai · 2026-05-28T01:04:26Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Removes plan-only/dry-run modes, makes the TypeScript runner the canonical live executor (adds matrix emission), deprecates bash runners, updates the Phase orchestrator for real shell execution, removes dry-run guards from validation/install scripts, and updates tests and workflow checks.

Changes

Live E2E Execution Framework

Layer / File(s)	Summary
Workflow and docs `.github/workflows/e2e-scenarios.yaml`, `tools/e2e-scenarios/workflow-boundary.mts`, `test/e2e-scenario/docs/README.md`	Workflow steps no longer invoke `run.ts` with `--dry-run`/`--plan-only`; validator enforces run scripts don't contain execution-hiding flags; README documents TS runner as canonical.
CLI runner and matrix emission `test/e2e-scenario/scenarios/run.ts`	`run.ts` adds `--emit-matrix`/`--plan-only`, drops `--dry-run`/`--validate-only`, emits matrices, compiles and executes plans live, aggregates PhaseResult and sets exit code on failure.
Types and plan compiler `test/e2e-scenario/scenarios/types.ts`, `test/e2e-scenario/scenarios/compiler.ts`	Introduce `PhaseAction` and `PhaseActionResult`, change `RunPlanPhase.actions` to `PhaseAction[]`, remove `RunContext.dryRun`, and compile plans with typed actions plus rendered action lines in plan text.
Phase orchestrator real execution `test/e2e-scenario/scenarios/orchestrators/phase.ts`	Execute phase actions before assertions; shell steps run real scripts (repo-relative resolution, existence checks, context.env parsing, per-step logs, detached process groups, timeouts, evidence and classifier handling); treat skipped/pending appropriately.
Bash runner deprecation & env cleanup `test/e2e-scenario/runtime/run-scenario.sh`, `test/e2e-scenario/runtime/run-suites.sh`, `test/e2e-scenario/runtime/lib/env.sh`	Legacy bash runners are now fail-fast stubs that exit code 2; `e2e_env_is_dry_run` helper removed; README updated to mark bash scripts deprecated.
Validation suites: remove dry-run guards `test/e2e-scenario/validation_suites/**`	Remove `E2E_DRY_RUN` short-circuits so inference/gateway/sandbox/Hermes/messaging/rebuild/sandbox lifecycle/platform smoke/CLI checks run live and produce real pass/fail outcomes.
Install & onboard scripts `test/e2e-scenario/nemoclaw_scenarios/install/*`, `.../fixtures/older-base-image.sh`, `.../onboard/dispatch.sh`	Installers, repo installs, Docker pulls, and onboarding dispatch no longer short-circuit on dry-run and execute their steps unconditionally.
E2E framework tests `test/e2e-scenario/framework-tests/*`	PhaseOrchestrator tests refactored to real temp-dir/script execution; many plan-only/dry-run tests removed and some test files deleted; tests updated to stop setting `E2E_DRY_RUN`.
Dispatch entrypoint `test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh`	New deterministic dispatcher script added to invoke dispatcher functions after standard env/context setup.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

NVIDIA/NemoClaw#4017: Overlapping changes to security policy credential checks and redaction/version handling.
NVIDIA/NemoClaw#4283: Prior changes to typed E2E workflow/runner flags and plan-mode invocations related to this adjustment.

Suggested labels

E2E, enhancement: testing, refactor, CI/CD

Suggested reviewers

cv
ericksoa

"🐰 I hopped through scripts at break of day,
No dry-run shadows now in my way.
Shells run for real, logs flourish and play,
Tests march forward, live truth on display.
A carrot for CI — steady, not gray."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.70% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main objective: removing dry-run execution modes and making the TypeScript orchestrator the single execution path for real shell assertions.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/e2e-real-execution

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

# Conflicts: # test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e-scenario/scenarios/orchestrators/phase.ts`:
- Around line 32-34: The branch uses a case-insensitive regex test on ref but
then calls case-sensitive ref.includes("tunnel") / ref.includes("cloudflared"),
causing mixed-case refs to misclassify; update the branching in the same block
(the if handling /provider|inference|chat-completion|cloudflared|tunnel/i) to
compare in a case-insensitive way—e.g., normalize ref to lowerCase once and use
that for the subsequent includes checks (or use case-insensitive regex matches)
so tunnel/cloudflared variants are correctly detected and return
"external-tunnel".

In `@test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh`:
- Line 62: The current substring check [[ "${SANDBOX_LIFECYCLE_LAST_OUTPUT}" ==
*"${E2E_SANDBOX_NAME}"* ]] can produce false positives (e.g., sb1 matching
sb10); change it to perform an exact token/line match of the sandbox name
instead: test SANDBOX_LIFECYCLE_LAST_OUTPUT for the exact E2E_SANDBOX_NAME token
(for example by piping SANDBOX_LIFECYCLE_LAST_OUTPUT to grep -w/-x or using a
word-boundary regex with [[ ... =~ ... ]]) and fail if not found, so the check
around SANDBOX_LIFECYCLE_LAST_OUTPUT and E2E_SANDBOX_NAME only succeeds for
exact sandbox-name matches.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 401b8a23-1471-429d-9905-b413e6bae24c

📥 Commits

Reviewing files that changed from the base of the PR and between 1daf081 and b7acfb7.

📒 Files selected for processing (44)

.github/workflows/e2e-scenarios.yaml
test/e2e-scenario/docs/README.md
test/e2e-scenario/framework-tests/e2e-context-helper.test.ts
test/e2e-scenario/framework-tests/e2e-expected-state-validator.test.ts
test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
test/e2e-scenario/framework-tests/e2e-scenario-additional-families.test.ts
test/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.ts
test/e2e-scenario/framework-tests/e2e-scenario-resolver.test.ts
test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
test/e2e-scenario/nemoclaw_scenarios/fixtures/older-base-image.sh
test/e2e-scenario/nemoclaw_scenarios/install/dispatch.sh
test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
test/e2e-scenario/runtime/lib/env.sh
test/e2e-scenario/runtime/run-scenario.sh
test/e2e-scenario/runtime/run-suites.sh
test/e2e-scenario/scenarios/orchestrators/phase.ts
test/e2e-scenario/scenarios/run.ts
test/e2e-scenario/scenarios/types.ts
test/e2e-scenario/validation_suites/assert/gateway-alive.sh
test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
test/e2e-scenario/validation_suites/lib/inference_routing.sh
test/e2e-scenario/validation_suites/lib/messaging_providers.sh
test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
test/e2e-scenario/validation_suites/sandbox-exec.sh
test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
tools/e2e-scenarios/workflow-boundary.mts

💤 Files with no reviewable changes (30)

test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
test/e2e-scenario/validation_suites/lib/messaging_providers.sh
test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
test/e2e-scenario/scenarios/types.ts
test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
test/e2e-scenario/framework-tests/e2e-expected-state-validator.test.ts
test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
test/e2e-scenario/validation_suites/lib/inference_routing.sh
test/e2e-scenario/validation_suites/assert/gateway-alive.sh
test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
test/e2e-scenario/validation_suites/sandbox-exec.sh
test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
test/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.ts
test/e2e-scenario/runtime/lib/env.sh
test/e2e-scenario/framework-tests/e2e-context-helper.test.ts

github-actions · 2026-05-28T01:13:18Z

E2E Advisor Recommendation

Required E2E: ubuntu-repo-cloud-openclaw, ubuntu-repo-cloud-hermes, ubuntu-no-docker-preflight-negative, gpu-repo-local-ollama-openclaw, macos-repo-cloud-openclaw, wsl-repo-cloud-openclaw, brev-launchable-cloud-openclaw
Optional E2E: ubuntu-repo-cloud-openclaw-telegram, ubuntu-repo-cloud-openclaw-discord, ubuntu-repo-cloud-openclaw-custom-policies, ubuntu-repo-cloud-openclaw-token-rotation

Dispatch hint: workflow_dispatch; runs ubuntu-repo-cloud-openclaw, ubuntu-repo-cloud-hermes, gpu-repo-local-ollama-openclaw, macos-repo-cloud-openclaw, wsl-repo-cloud-openclaw, brev-launchable-cloud-openclaw, ubuntu-no-docker-preflight-negative

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

ubuntu-repo-cloud-openclaw (medium): Baseline live Ubuntu repo install, cloud OpenClaw onboarding, gateway/sandbox smoke, cloud inference, and credentials checks cover the highest-risk workflow and typed-runner changes.
ubuntu-repo-cloud-hermes (medium): Validates the Hermes-specific path after changes to the shared runner/orchestrator and Hermes health assertion.
ubuntu-no-docker-preflight-negative (low): Preflight assertion and action orchestration changed; this negative scenario proves setup fails before gateway/sandbox side effects when Docker is unavailable.
gpu-repo-local-ollama-openclaw (high): Directly exercises changed Ollama install/profile code plus local Ollama inference and auth-proxy validation suites.
macos-repo-cloud-openclaw (medium): Validates the macOS scenario route and changed macOS platform smoke script under the live typed runner.
wsl-repo-cloud-openclaw (medium): Validates the WSL-specific workflow branch and changed WSL platform smoke script under live execution instead of dry-run.
brev-launchable-cloud-openclaw (high): Launchable install code and action dispatch changed; this scenario is the existing coverage for the launchable install/deployment path.

Optional E2E

ubuntu-repo-cloud-openclaw-telegram (medium): Changed messaging provider helpers and common bridge reachability assertion; useful to validate one real messaging provider flow.
ubuntu-repo-cloud-openclaw-discord (medium): Additional confidence for the changed messaging provider library and bridge assertions on Discord-specific OpenClaw onboarding.
ubuntu-repo-cloud-openclaw-custom-policies (medium): Exercises credentials, policy/onboarding state, model-router, and snapshot lifecycle assertions adjacent to changed security and lifecycle helper libraries.
ubuntu-repo-cloud-openclaw-token-rotation (medium): Useful confidence for credential/messaging boundary behavior after security and messaging helper changes.

New E2E recommendations

public-curl-installer (high): The public curl installer implementation changed, but the current typed scenario registry/workflow route list does not expose a canonical public-curl scenario to run in CI.
- Suggested test: Add a canonical typed scenario such as ubuntu-public-curl-cloud-openclaw that installs through the public curl installer and runs smoke plus a minimal inference check.
workflow-route-registry-parity (medium): The workflow keeps a manual route table while the typed registry is the source of scenario IDs; this PR removes plan validation in the resolve-runner step, increasing the chance of route/registry drift.
- Suggested test: Add an E2E workflow-boundary canary that dispatches every registry scenario through e2e-scenarios.yaml in plan/list validation mode or asserts ROUTES exactly covers the registry without invoking live infrastructure.
messaging-scenario-fanout (medium): Messaging helpers changed, but the all-scenarios fanout only covers the first seven scenarios and does not include Telegram/Discord/Slack assistant flows.
- Suggested test: Extend the scenario fanout or add a focused messaging E2E workflow job that runs one OpenClaw messaging scenario per provider family.

Dispatch hint

Workflow: .github/workflows/e2e-scenarios-all.yaml
jobs input: workflow_dispatch; runs ubuntu-repo-cloud-openclaw, ubuntu-repo-cloud-hermes, gpu-repo-local-ollama-openclaw, macos-repo-cloud-openclaw, wsl-repo-cloud-openclaw, brev-launchable-cloud-openclaw, ubuntu-no-docker-preflight-negative

github-actions · 2026-05-28T01:13:19Z

E2E Scenario Advisor Recommendation

Required scenario E2E: e2e-scenarios-all, ubuntu-repo-cloud-hermes-discord:smoke, ubuntu-repo-cloud-hermes-slack:smoke, ubuntu-repo-cloud-openclaw-brave:smoke, ubuntu-repo-cloud-openclaw-custom-policies:inference,smoke, ubuntu-repo-cloud-openclaw-discord:messaging-discord,smoke, ubuntu-repo-cloud-openclaw-double-provider-switch:smoke, ubuntu-repo-cloud-openclaw-double-same-provider:smoke, ubuntu-repo-cloud-openclaw-repair:smoke, ubuntu-repo-cloud-openclaw-resume:smoke, ubuntu-repo-cloud-openclaw-slack:messaging-slack,smoke, ubuntu-repo-cloud-openclaw-telegram:messaging-telegram,smoke, ubuntu-repo-cloud-openclaw-token-rotation:smoke, ubuntu-repo-openai-compatible-openclaw:smoke
Optional scenario E2E: brev-launchable-cloud-openclaw:inference,smoke, gpu-repo-local-ollama-openclaw:smoke, macos-repo-cloud-openclaw:platform-macos, wsl-repo-cloud-openclaw:platform-wsl,smoke

Dispatch required scenario E2E:

gh workflow run e2e-scenarios-all.yaml --ref <pr-head-ref>
gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-hermes-discord
gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-hermes-slack
gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-brave
gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-custom-policies
gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-discord
gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-double-provider-switch
gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-double-same-provider
gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-repair
gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-resume
gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-slack
gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-telegram
gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-token-rotation
gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-openai-compatible-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

e2e-scenarios-all: the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
- Dispatch: gh workflow run e2e-scenarios-all.yaml --ref <pr-head-ref>
ubuntu-repo-cloud-hermes-discord:smoke: Scenario ubuntu-repo-cloud-hermes-discord exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-hermes-discord
ubuntu-repo-cloud-hermes-slack:smoke: Scenario ubuntu-repo-cloud-hermes-slack exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-hermes-slack
ubuntu-repo-cloud-openclaw-brave:smoke: Scenario ubuntu-repo-cloud-openclaw-brave exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-brave
ubuntu-repo-cloud-openclaw-custom-policies:inference,smoke: Scenario ubuntu-repo-cloud-openclaw-custom-policies exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-custom-policies
ubuntu-repo-cloud-openclaw-discord:messaging-discord,smoke: Scenario ubuntu-repo-cloud-openclaw-discord exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-discord
ubuntu-repo-cloud-openclaw-double-provider-switch:smoke: Scenario ubuntu-repo-cloud-openclaw-double-provider-switch exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-double-provider-switch
ubuntu-repo-cloud-openclaw-double-same-provider:smoke: Scenario ubuntu-repo-cloud-openclaw-double-same-provider exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-double-same-provider
ubuntu-repo-cloud-openclaw-repair:smoke: Scenario ubuntu-repo-cloud-openclaw-repair exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-repair
ubuntu-repo-cloud-openclaw-resume:smoke: Scenario ubuntu-repo-cloud-openclaw-resume exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-resume
ubuntu-repo-cloud-openclaw-slack:messaging-slack,smoke: Scenario ubuntu-repo-cloud-openclaw-slack exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-slack
ubuntu-repo-cloud-openclaw-telegram:messaging-telegram,smoke: Scenario ubuntu-repo-cloud-openclaw-telegram exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-telegram
ubuntu-repo-cloud-openclaw-token-rotation:smoke: Scenario ubuntu-repo-cloud-openclaw-token-rotation exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-token-rotation
ubuntu-repo-openai-compatible-openclaw:smoke: Scenario ubuntu-repo-openai-compatible-openclaw exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-openai-compatible-openclaw

Optional scenario E2E

brev-launchable-cloud-openclaw:inference,smoke: Special-runner scenario covers a changed suite but may require scarce hardware/secrets.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=brev-launchable-cloud-openclaw
gpu-repo-local-ollama-openclaw:smoke: Special-runner scenario covers a changed suite but may require scarce hardware/secrets.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw
macos-repo-cloud-openclaw:platform-macos: Special-runner scenario covers a changed suite but may require scarce hardware/secrets.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=macos-repo-cloud-openclaw
wsl-repo-cloud-openclaw:platform-wsl,smoke: Special-runner scenario covers a changed suite but may require scarce hardware/secrets.
- Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=wsl-repo-cloud-openclaw

Relevant changed files

.github/workflows/e2e-scenarios.yaml
test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh
test/e2e-scenario/nemoclaw_scenarios/fixtures/older-base-image.sh
test/e2e-scenario/nemoclaw_scenarios/install/dispatch.sh
test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
test/e2e-scenario/runtime/lib/env.sh
test/e2e-scenario/runtime/run-scenario.sh
test/e2e-scenario/runtime/run-suites.sh
test/e2e-scenario/validation_suites/assert/gateway-alive.sh
test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
test/e2e-scenario/validation_suites/lib/inference_routing.sh
test/e2e-scenario/validation_suites/lib/messaging_providers.sh
test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
test/e2e-scenario/validation_suites/sandbox-exec.sh
test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh

github-actions · 2026-05-28T01:17:13Z

PR Review Advisor

Findings: 8 needs attention, 10 worth checking, 0 nice ideas
Since last review: 2 prior items resolved, 14 still apply, 0 new items found

Review findings

🛠️ Needs attention

Secret-bearing child process output is persisted and uploaded without redaction (test/e2e-scenario/scenarios/orchestrators/phase.ts:112): The live runner now executes install, onboarding, and assertion subprocesses with the full workflow environment, including provider secrets, then writes raw stdout/stderr to evidence logs and copies stderr tails into phase result JSON. The workflow uploads the entire .e2e tree with hidden files included, so any installer, onboarding command, provider response, or CLI error that prints a token can persist that secret as an artifact.
- Recommendation: Use a minimal allowlisted subprocess environment, centrally redact known secret values and secret-shaped strings before writing logs/result JSON, and restrict artifact upload paths to sanitized outputs. Add a regression test where a child prints a secret-shaped value and no .e2e log/result artifact contains it.
- Evidence: phase.ts builds child envs with `...process.env`, pipes child stdout/stderr to createWriteStream logs, and includes `stderrTail` in failure messages. .github/workflows/e2e-scenarios.yaml passes `NVIDIA_API_KEY` to the runner and uploads `.e2e/` with `include-hidden-files: true`.
Required security probes and expected-failure checks still skip without failing live runs (test/e2e-scenario/scenarios/orchestrators/phase.ts:266): Security-sensitive suites for shields, policy enforcement, and injection blocking are represented as probe steps, and negative expected-failure side-effect validation is represented as a pending step. The orchestrator marks both kinds as skipped, while the top-level runner only exits nonzero for failed phase results. A live run can therefore omit required security and negative checks while still appearing non-failing if other assertions pass.
- Recommendation: Fail closed for required/security probe steps and expected-failure side-effect checks, or exclude those suites/scenarios from live workflow selection until implemented. Add tests proving skipped security probes and pending expected-failure checks make run.ts exit nonzero.
- Evidence: phase.ts returns `status: "skipped"` for `kind === "probe"` and `kind === "pending"`. The assertion registry maps `security-shields`, `security-policy`, and `security-injection` to probe steps and maps `runtime.expected-failure.no-side-effects` to a pending step. run.ts sets `process.exitCode` only when a phase status is `failed`.
Negative scenarios no longer force or verify their expected-failure contracts (test/e2e-scenario/runtime/run-scenario.sh:1): Stubbing the bash runner removed the previous negative orchestration that forced Docker-missing, invalid-key, and gateway-port-conflict failures, captured negative logs, matched the expected failure, and checked forbidden side effects. The TS path has expectedFailure metadata but only a pending side-effect placeholder, so negative scenarios do not prove the declared failure mode or side-effect contract.
- Recommendation: Port expected-failure orchestration into the TS phase model before relying on these live runs: force the declared failure mode, write the negative log, derive observed phase/error/log/side effects, invoke the matcher, and fail on forbidden side effects. Add tests for Docker-missing, invalid NVIDIA key, and gateway-port-conflict scenarios.
- Evidence: runtime/run-scenario.sh is now a fail-fast stub. baseline.ts declares expectedFailure for `ubuntu-no-docker-preflight-negative`, `ubuntu-invalid-nvidia-key-negative`, and `ubuntu-gateway-port-conflict-negative`; the TS assertion registry only adds `runtime.expected-failure.no-side-effects` as a pending step. `onboarding_assertions/preflight/00-preflight-expected-failed.sh` requires `negative-preflight.log`, but the TS runner/orchestrators do not create it.
Live installer paths execute mutable network scripts without mandatory verification (test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh:19): Removing dry-run guards makes installer helpers consequential in a secret-bearing workflow. The public installer defaults to a mutable raw GitHub main-branch URL and verifies SHA256 only when an optional environment variable is provided. The Ollama helper still pipes a remote installer directly into bash.
- Recommendation: Pin installer sources to immutable refs or require expected digests in CI. Avoid curl|bash by downloading to a file and verifying it before execution, or use a trusted package source. Add tests that CI installer profiles fail when no pin/digest is configured.
- Evidence: public-curl.sh defaults `E2E_INSTALLER_URL` to `https://raw.githubusercontent.com/NVIDIA/NemoClaw/main/scripts/install.sh\` and only checks `E2E_INSTALLER_SHA256` when non-empty. ollama.sh runs `curl -fsSL --retry 3 --retry-delay 2 "${ollama_url}" | bash`.
Residual E2E_DRY_RUN branch contradicts the one-live-mode contract (test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh:15): The PR states that no flag, environment variable, helper, or branch in production paths bypasses real assertion execution, but the Slack provider assertion still honors E2E_DRY_RUN and emits pass markers without reading the runtime Slack config or querying OpenClaw runtime discovery.
- Recommendation: Remove the E2E_DRY_RUN branch or prove this script is outside every production scenario path. Add a static regression test that fails if `E2E_DRY_RUN`, `e2e_env_is_dry_run`, dry-run pass markers, or other dry-run bypasses appear in executable scenario/assertion paths.
- Evidence: 00-slack-provider-state.sh branches on `[[ -n "${E2E_DRY_RUN:-}" ]]` and emits dry-run `e2e_pass` markers. The assertion registry wires `messaging-slack` to this script for Slack scenarios.
Skeleton pending refs remain in production scenario plans (test/e2e-scenario/scenarios/assertions/environment.ts:17): The PR body lists `phase-1-skeleton` as audited absent from production paths, but the environment baseline still defines a pending step with that ref and scenario plan construction includes that baseline for every scenario. Pending steps are skipped, creating non-failing gaps in live runs.
- Recommendation: Remove skeleton assertion modules from production scenario plans, or fail closed when any `phase-1-skeleton` or pending production step is encountered. Add a static test for the audited-absent list.
- Evidence: environment.ts defines `implementation: { kind: "pending", ref: "phase-1-skeleton" }`, and `assertionGroupsForScenario()` includes `environmentBaseline()` for scenario plans. runtime.ts also still defines a `phase-1-skeleton` placeholder module.
Workflow runner routing still drifts from the typed scenario registry (.github/workflows/e2e-scenarios.yaml:58): The workflow still uses a hardcoded ROUTES map while the typed registry contains canonical scenarios that are not routed here. The previous registry validation before route lookup was removed, so registry-accepted scenarios can be rejected by the executable workflow path. This contradicts the linked matrix-generation goal that adding a scenario to baseline.ts should be sufficient.
- Recommendation: Make workflow runner selection consume the same typed routing source as the registry, or add a static contract test that every `listScenarios()` ID has a workflow route. Until then, add the missing routes or exclude unsupported scenarios from workflow dispatch.
- Evidence: e2e-scenarios.yaml defines ROUTES and omits registry IDs such as `ubuntu-repo-cloud-openclaw-custom-policies`, `ubuntu-invalid-nvidia-key-negative`, and `ubuntu-gateway-port-conflict-negative`. The diff removed the prior `npx tsx ... --plan-only` validation before route lookup.
--emit-matrix does not match the linked feat(e2e): generate scenario fan-out matrix from typed registry #4359 matrix contract (test/e2e-scenario/scenarios/run.ts:56): This PR adds --emit-matrix for the dynamic-matrix workflow, but it emits an object with include entries containing only id and description. Linked feat(e2e): generate scenario fan-out matrix from typed registry #4359 describes a single-line JSON array whose entries include id, runner, label, platform, and suites. The two contracts are incompatible.
- Recommendation: Converge on one matrix contract before merging: implement the feat(e2e): generate scenario fan-out matrix from typed registry #4359 shape here, remove/defer --emit-matrix from this PR, or coordinate the linked PR so all callers and tests consume the same JSON shape.
- Evidence: run.ts builds `payload = { include: listScenarios().map(({ id, description }) => ({ id, description })) }`. Linked feat(e2e): generate scenario fan-out matrix from typed registry #4359 states `--emit-matrix` prints a single-line JSON array with `id`, `runner`, `label`, `platform`, and `suites`.

🔎 Worth checking

Source-of-truth review needed: Probe registry fallback: The advisor marked localized patch analysis as missing.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: phase.ts returns `status: "skipped"` for probes; the assertion registry wires security suites to probes.
Source-of-truth review needed: Expected-failure side-effect placeholder: The advisor marked localized patch analysis as missing.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: runtime/run-scenario.sh was stubbed; registry adds only a pending side-effect check; preflight expected-failed assertion requires a negative log that TS does not create.
Source-of-truth review needed: Workflow runner routing source of truth: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: ROUTES omits registry IDs such as custom-policies and negative scenarios; previous plan-only registry validation was removed.
Source-of-truth review needed: --emit-matrix compatibility with linked feat(e2e): generate scenario fan-out matrix from typed registry #4359: The advisor marked localized patch analysis as missing.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: run.ts emitMatrix builds an object with include entries containing only id and description.
Source-of-truth review needed: Installer trust policy for live scenario runs: The advisor marked localized patch analysis as missing.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: public-curl.sh defaults to raw GitHub main with optional SHA; ollama.sh pipes `https://ollama.ai/install.sh\` to bash.
Source-of-truth review needed: Shell action/assertion path resolution: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: phase.ts uses `path.isAbsolute` branches without realpath containment.
Source-of-truth review needed: Stable alias path for legacy onboard.log: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: e2e-phase-orchestrators.test.ts checks the alias; compiler.ts and phase.ts implement the compatibility copy.
Shell refs and alias paths lack repo-containment guards (test/e2e-scenario/scenarios/orchestrators/phase.ts:83): The orchestrator accepts absolute action/script refs and absolute alias paths, resolving them without realpath containment. Current scenario refs are repository-defined, but this primitive is broad for future scenario, manifest, or registry inputs and could turn a metadata mistake into execution or write outside the intended E2E tree.
- Recommendation: Reject absolute refs unless explicitly allowlisted, realpath-check repo-relative scripts under the repository root, and constrain alias paths under the context directory. Add negative tests for absolute script refs, `..` traversal, symlink escapes, and absolute alias paths.
- Evidence: phase.ts uses `path.isAbsolute(action.scriptRef) ? action.scriptRef : path.resolve(REPO_ROOT, action.scriptRef)` and similarly resolves shell step refs. On success it copies evidence to `path.isAbsolute(action.aliasPath) ? action.aliasPath : path.join(ctx.contextDir, action.aliasPath)`.
Workflow invokes npx without a no-install/local guard (.github/workflows/e2e-scenarios.yaml:134): The workflow installs dependencies with lifecycle scripts disabled, but then invokes `npx tsx` directly. If dependency state drifts or tsx is absent, this can become a network dependency path in a trusted workflow that also receives provider secrets.
- Recommendation: Invoke the repository-local `./node_modules/.bin/tsx` after `npm ci`, or use `npx --no-install tsx`, and add a workflow-boundary test for the no-network execution contract.
- Evidence: Both Linux and WSL workflow paths run `npx tsx test/e2e-scenario/scenarios/run.ts --scenarios "${SCENARIOS}"` after `npm ci --ignore-scripts`.
Legacy onboard.log alias is a compatibility workaround without a removal contract (test/e2e-scenario/scenarios/compiler.ts:126): The compiler emits `aliasPath: "onboard.log"` so legacy assertions can keep reading a fixed filename, and the orchestrator copies evidence there best-effort. The test covers the alias behavior, but the source-of-truth review does not identify when legacy assertions will be migrated to action evidence paths or when the alias can be removed.
- Recommendation: Document the invalid legacy contract, keep the regression test, and add an explicit migration/removal condition. Prefer updating assertions to consume typed action evidence directly once that source is available.
- Evidence: compiler.ts assigns `aliasPath: "onboard.log"` for onboarding actions; phase.ts copies the evidence log to the alias best-effort. e2e-phase-orchestrators.test.ts asserts the alias exists for legacy shell assertions.

🌱 Nice ideas

None.

Since last review details

Current findings:

Source-of-truth review needed: Probe registry fallback: The advisor marked localized patch analysis as missing.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: phase.ts returns `status: "skipped"` for probes; the assertion registry wires security suites to probes.
Source-of-truth review needed: Expected-failure side-effect placeholder: The advisor marked localized patch analysis as missing.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: runtime/run-scenario.sh was stubbed; registry adds only a pending side-effect check; preflight expected-failed assertion requires a negative log that TS does not create.
Source-of-truth review needed: Workflow runner routing source of truth: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: ROUTES omits registry IDs such as custom-policies and negative scenarios; previous plan-only registry validation was removed.
Source-of-truth review needed: --emit-matrix compatibility with linked feat(e2e): generate scenario fan-out matrix from typed registry #4359: The advisor marked localized patch analysis as missing.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: run.ts emitMatrix builds an object with include entries containing only id and description.
Source-of-truth review needed: Installer trust policy for live scenario runs: The advisor marked localized patch analysis as missing.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: public-curl.sh defaults to raw GitHub main with optional SHA; ollama.sh pipes `https://ollama.ai/install.sh\` to bash.
Source-of-truth review needed: Shell action/assertion path resolution: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: phase.ts uses `path.isAbsolute` branches without realpath containment.
Source-of-truth review needed: Stable alias path for legacy onboard.log: The advisor marked localized patch analysis as needs_followup.
- Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
- Evidence: e2e-phase-orchestrators.test.ts checks the alias; compiler.ts and phase.ts implement the compatibility copy.
Secret-bearing child process output is persisted and uploaded without redaction (test/e2e-scenario/scenarios/orchestrators/phase.ts:112): The live runner now executes install, onboarding, and assertion subprocesses with the full workflow environment, including provider secrets, then writes raw stdout/stderr to evidence logs and copies stderr tails into phase result JSON. The workflow uploads the entire .e2e tree with hidden files included, so any installer, onboarding command, provider response, or CLI error that prints a token can persist that secret as an artifact.
- Recommendation: Use a minimal allowlisted subprocess environment, centrally redact known secret values and secret-shaped strings before writing logs/result JSON, and restrict artifact upload paths to sanitized outputs. Add a regression test where a child prints a secret-shaped value and no .e2e log/result artifact contains it.
- Evidence: phase.ts builds child envs with `...process.env`, pipes child stdout/stderr to createWriteStream logs, and includes `stderrTail` in failure messages. .github/workflows/e2e-scenarios.yaml passes `NVIDIA_API_KEY` to the runner and uploads `.e2e/` with `include-hidden-files: true`.
Required security probes and expected-failure checks still skip without failing live runs (test/e2e-scenario/scenarios/orchestrators/phase.ts:266): Security-sensitive suites for shields, policy enforcement, and injection blocking are represented as probe steps, and negative expected-failure side-effect validation is represented as a pending step. The orchestrator marks both kinds as skipped, while the top-level runner only exits nonzero for failed phase results. A live run can therefore omit required security and negative checks while still appearing non-failing if other assertions pass.
- Recommendation: Fail closed for required/security probe steps and expected-failure side-effect checks, or exclude those suites/scenarios from live workflow selection until implemented. Add tests proving skipped security probes and pending expected-failure checks make run.ts exit nonzero.
- Evidence: phase.ts returns `status: "skipped"` for `kind === "probe"` and `kind === "pending"`. The assertion registry maps `security-shields`, `security-policy`, and `security-injection` to probe steps and maps `runtime.expected-failure.no-side-effects` to a pending step. run.ts sets `process.exitCode` only when a phase status is `failed`.
Negative scenarios no longer force or verify their expected-failure contracts (test/e2e-scenario/runtime/run-scenario.sh:1): Stubbing the bash runner removed the previous negative orchestration that forced Docker-missing, invalid-key, and gateway-port-conflict failures, captured negative logs, matched the expected failure, and checked forbidden side effects. The TS path has expectedFailure metadata but only a pending side-effect placeholder, so negative scenarios do not prove the declared failure mode or side-effect contract.
- Recommendation: Port expected-failure orchestration into the TS phase model before relying on these live runs: force the declared failure mode, write the negative log, derive observed phase/error/log/side effects, invoke the matcher, and fail on forbidden side effects. Add tests for Docker-missing, invalid NVIDIA key, and gateway-port-conflict scenarios.
- Evidence: runtime/run-scenario.sh is now a fail-fast stub. baseline.ts declares expectedFailure for `ubuntu-no-docker-preflight-negative`, `ubuntu-invalid-nvidia-key-negative`, and `ubuntu-gateway-port-conflict-negative`; the TS assertion registry only adds `runtime.expected-failure.no-side-effects` as a pending step. `onboarding_assertions/preflight/00-preflight-expected-failed.sh` requires `negative-preflight.log`, but the TS runner/orchestrators do not create it.
Live installer paths execute mutable network scripts without mandatory verification (test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh:19): Removing dry-run guards makes installer helpers consequential in a secret-bearing workflow. The public installer defaults to a mutable raw GitHub main-branch URL and verifies SHA256 only when an optional environment variable is provided. The Ollama helper still pipes a remote installer directly into bash.
- Recommendation: Pin installer sources to immutable refs or require expected digests in CI. Avoid curl|bash by downloading to a file and verifying it before execution, or use a trusted package source. Add tests that CI installer profiles fail when no pin/digest is configured.
- Evidence: public-curl.sh defaults `E2E_INSTALLER_URL` to `https://raw.githubusercontent.com/NVIDIA/NemoClaw/main/scripts/install.sh\` and only checks `E2E_INSTALLER_SHA256` when non-empty. ollama.sh runs `curl -fsSL --retry 3 --retry-delay 2 "${ollama_url}" | bash`.
Residual E2E_DRY_RUN branch contradicts the one-live-mode contract (test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh:15): The PR states that no flag, environment variable, helper, or branch in production paths bypasses real assertion execution, but the Slack provider assertion still honors E2E_DRY_RUN and emits pass markers without reading the runtime Slack config or querying OpenClaw runtime discovery.
- Recommendation: Remove the E2E_DRY_RUN branch or prove this script is outside every production scenario path. Add a static regression test that fails if `E2E_DRY_RUN`, `e2e_env_is_dry_run`, dry-run pass markers, or other dry-run bypasses appear in executable scenario/assertion paths.
- Evidence: 00-slack-provider-state.sh branches on `[[ -n "${E2E_DRY_RUN:-}" ]]` and emits dry-run `e2e_pass` markers. The assertion registry wires `messaging-slack` to this script for Slack scenarios.
Skeleton pending refs remain in production scenario plans (test/e2e-scenario/scenarios/assertions/environment.ts:17): The PR body lists `phase-1-skeleton` as audited absent from production paths, but the environment baseline still defines a pending step with that ref and scenario plan construction includes that baseline for every scenario. Pending steps are skipped, creating non-failing gaps in live runs.
- Recommendation: Remove skeleton assertion modules from production scenario plans, or fail closed when any `phase-1-skeleton` or pending production step is encountered. Add a static test for the audited-absent list.
- Evidence: environment.ts defines `implementation: { kind: "pending", ref: "phase-1-skeleton" }`, and `assertionGroupsForScenario()` includes `environmentBaseline()` for scenario plans. runtime.ts also still defines a `phase-1-skeleton` placeholder module.
Workflow runner routing still drifts from the typed scenario registry (.github/workflows/e2e-scenarios.yaml:58): The workflow still uses a hardcoded ROUTES map while the typed registry contains canonical scenarios that are not routed here. The previous registry validation before route lookup was removed, so registry-accepted scenarios can be rejected by the executable workflow path. This contradicts the linked matrix-generation goal that adding a scenario to baseline.ts should be sufficient.
- Recommendation: Make workflow runner selection consume the same typed routing source as the registry, or add a static contract test that every `listScenarios()` ID has a workflow route. Until then, add the missing routes or exclude unsupported scenarios from workflow dispatch.
- Evidence: e2e-scenarios.yaml defines ROUTES and omits registry IDs such as `ubuntu-repo-cloud-openclaw-custom-policies`, `ubuntu-invalid-nvidia-key-negative`, and `ubuntu-gateway-port-conflict-negative`. The diff removed the prior `npx tsx ... --plan-only` validation before route lookup.
--emit-matrix does not match the linked feat(e2e): generate scenario fan-out matrix from typed registry #4359 matrix contract (test/e2e-scenario/scenarios/run.ts:56): This PR adds --emit-matrix for the dynamic-matrix workflow, but it emits an object with include entries containing only id and description. Linked feat(e2e): generate scenario fan-out matrix from typed registry #4359 describes a single-line JSON array whose entries include id, runner, label, platform, and suites. The two contracts are incompatible.
- Recommendation: Converge on one matrix contract before merging: implement the feat(e2e): generate scenario fan-out matrix from typed registry #4359 shape here, remove/defer --emit-matrix from this PR, or coordinate the linked PR so all callers and tests consume the same JSON shape.
- Evidence: run.ts builds `payload = { include: listScenarios().map(({ id, description }) => ({ id, description })) }`. Linked feat(e2e): generate scenario fan-out matrix from typed registry #4359 states `--emit-matrix` prints a single-line JSON array with `id`, `runner`, `label`, `platform`, and `suites`.
Shell refs and alias paths lack repo-containment guards (test/e2e-scenario/scenarios/orchestrators/phase.ts:83): The orchestrator accepts absolute action/script refs and absolute alias paths, resolving them without realpath containment. Current scenario refs are repository-defined, but this primitive is broad for future scenario, manifest, or registry inputs and could turn a metadata mistake into execution or write outside the intended E2E tree.
- Recommendation: Reject absolute refs unless explicitly allowlisted, realpath-check repo-relative scripts under the repository root, and constrain alias paths under the context directory. Add negative tests for absolute script refs, `..` traversal, symlink escapes, and absolute alias paths.
- Evidence: phase.ts uses `path.isAbsolute(action.scriptRef) ? action.scriptRef : path.resolve(REPO_ROOT, action.scriptRef)` and similarly resolves shell step refs. On success it copies evidence to `path.isAbsolute(action.aliasPath) ? action.aliasPath : path.join(ctx.contextDir, action.aliasPath)`.
Workflow invokes npx without a no-install/local guard (.github/workflows/e2e-scenarios.yaml:134): The workflow installs dependencies with lifecycle scripts disabled, but then invokes `npx tsx` directly. If dependency state drifts or tsx is absent, this can become a network dependency path in a trusted workflow that also receives provider secrets.
- Recommendation: Invoke the repository-local `./node_modules/.bin/tsx` after `npm ci`, or use `npx --no-install tsx`, and add a workflow-boundary test for the no-network execution contract.
- Evidence: Both Linux and WSL workflow paths run `npx tsx test/e2e-scenario/scenarios/run.ts --scenarios "${SCENARIOS}"` after `npm ci --ignore-scripts`.
Legacy onboard.log alias is a compatibility workaround without a removal contract (test/e2e-scenario/scenarios/compiler.ts:126): The compiler emits `aliasPath: "onboard.log"` so legacy assertions can keep reading a fixed filename, and the orchestrator copies evidence there best-effort. The test covers the alias behavior, but the source-of-truth review does not identify when legacy assertions will be migrated to action evidence paths or when the alias can be removed.
- Recommendation: Document the invalid legacy contract, keep the regression test, and add an explicit migration/removal condition. Prefer updating assertions to consume typed action evidence directly once that source is available.
- Evidence: compiler.ts assigns `aliasPath: "onboard.log"` for onboarding actions; phase.ts copies the evidence log to the alias best-effort. e2e-phase-orchestrators.test.ts asserts the alias exists for legacy shell assertions.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

…shape client check - PhaseOrchestrator.runShellStep: wait for the log WriteStream to finish before resolving so callers (and tests) reading evidence synchronously see the actual stdout/stderr instead of an empty file. Race exposed by e2e-phase-orchestrators 'shell_step_passes_when_script_exits_zero'. - e2e-phase-orchestrators: replace client-source toMatch regex (1 source-shape test, budget=0) with a runtime-shape behavior assertion on the HostCliClient observation. Still enforces 'clients do not encode pass/fail or retry/timeout semantics' per hybrid-scenario E2E architecture spec, without violating source-shape budget. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

Closes the spec's reopened Phase 6 gap. The new typed runner now executes the install and onboarding work that the deleted bash runner used to perform, but inside EnvironmentOrchestrator and OnboardingOrchestrator instead of in workflow YAML or a resurrected bash runner. All canonical scenarios now reach a real, live SUT before their assertions run. Architecture (per hybrid-scenario-e2e-architecture spec): * types.ts: introduce typed PhaseAction (kind=shell-fn|shell, scriptRef, fn, arg, timeoutSeconds, evidencePath) and PhaseActionResult. Replace the prior actions: string[] free-form labels with PhaseAction[]. Add actions[] to PhaseResult so failure-layer attribution stays clear: setup failure is recorded distinctly from assertion failure. * compiler.ts: phaseActions() now emits typed actions for environment (context.emit + install.<id>) and onboarding (profile.<id>). Stable action ids: environment.context.emit, environment.install.<install-id>, onboarding.profile.<profile-id>. All install/onboard actions point at the existing dispatcher scripts (install/dispatch.sh, onboard/dispatch.sh) - shell remains the implementation per spec, invocation is centralized. * orchestrators/phase.ts: PhaseOrchestrator.run() executes actions before assertions. Action failure short-circuits the phase so assertions never run against an environment that was never set up. Action runner reuses the same spawn/timeout/process-group/log-flush machinery as runShellStep. Per-action timeout, no retry (install and onboarding must fail loudly). * nemoclaw_scenarios/dispatch-action.sh: new bash launcher (the only new shell file). The install/onboard dispatchers are intentionally library-style (function definitions only); this launcher gives them a deterministic executable entrypoint that sources runtime/lib/env.sh + runtime/lib/context.sh, applies non-interactive env, sources the requested dispatcher, and invokes the named function with one arg. Replaces the orchestration that the deleted run-scenario.sh used to do, but called from the typed orchestrator instead. * plan-only output: now shows 'Action: <id> (timeout=...) -> <fn> <arg>' per phase, before assertion groups. Maintainers can preview the full setup+onboard+assert sequence before dispatch. * framework-tests/e2e-phase-orchestrators.test.ts: add five behavior tests covering action-runs-before-assertions, action-failure short- circuits-assertions, action timeout via orchestrator policy, evidence-log flushed-before-resolve, and compiler emits typed install/onboard actions for all 7 canonical scenarios. What stays out: * No workflow YAML edits. .github/workflows/e2e-scenarios.yaml still invokes only 'npx tsx test/e2e-scenario/scenarios/run.ts --scenarios ...'. Workflow YAML stays innocent of install/onboard plumbing. * No client edits. HostCliClient et al. remain pass/fail/policy free. * No resolver/YAML-first revival. setup_scenarios/test_plans/suite_filter remain unsupported. Validation gate (Phase 6 reopen note) is the next step: after this push goes green on PR CI, dispatch e2e-scenarios-all.yaml against feat/e2e-real-execution and confirm canonical scenarios produce real phase results with action evidence under .e2e/actions/<id>.log, instead of <1s 'failed=34 skipped=5'. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

test/e2e-scenario/scenarios/types.ts (1)

110-131: ⚡ Quick win

Strengthen PhaseAction typing (discriminated union) to enforce per-kind required fields.

Current runtime already fails loudly if a "shell-fn" action is missing fn (the runner passes action.fn ?? "", and dispatch-action.sh errors when the function name isn’t found). Also, current PhaseAction objects are produced in test/e2e-scenario/scenarios/compiler.ts with fn populated for "shell-fn". Still, the type allows invalid combinations to compile, so a discriminated union would prevent that drift for any future producers.

Suggested refactor

-export interface PhaseAction {
-  id: string;
-  phase: PhaseName;
-  description?: string;
-  // "shell-fn" sources the bash dispatcher and invokes the named function.
-  // "shell"    runs an executable script (used for context-emit helper).
-  kind: "shell-fn" | "shell";
-  // Repo-relative path to the script.
-  scriptRef: string;
-  // For "shell-fn": the bash function to invoke after sourcing scriptRef.
-  fn?: string;
-  // Single positional arg passed to the function/script (install method or
-  // onboarding profile id today). Kept as a single string to keep stable
-  // ids predictable; multi-arg variants can extend this later.
-  arg?: string;
-  // Per-action timeout. No retry by default - install/onboard must fail
-  // loudly so the regression is visible. Retry stays a property of
-  // assertion steps, not actions.
-  timeoutSeconds?: number;
-  // Repo-relative evidence log path.
-  evidencePath?: string;
-}
+interface PhaseActionBase {
+  id: string;
+  phase: PhaseName;
+  description?: string;
+  scriptRef: string;
+  timeoutSeconds?: number;
+  evidencePath?: string;
+}
+
+export type PhaseAction =
+  | (PhaseActionBase & {
+      kind: "shell-fn";
+      fn: string;
+      arg?: string;
+    })
+  | (PhaseActionBase & {
+      kind: "shell";
+      arg?: string;
+    });

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e-scenario/scenarios/types.ts` around lines 110 - 131, Change
PhaseAction from a single broad interface into a discriminated union so
TypeScript enforces per-kind fields: define one variant for kind: "shell-fn"
that requires fn: string (plus shared fields like id, phase, scriptRef, arg?,
timeoutSeconds?, evidencePath?, description?) and another variant for kind:
"shell" that omits/marks fn as disallowed/undefined; update any usages (e.g.,
the action objects created in test/e2e-scenario/scenarios/compiler.ts and the
runner that reads action.fn) to satisfy the new types so compilation ensures
"shell-fn" always has fn and "shell" never does.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh`:
- Line 60: Remove the "|| true" suppression so failures in e2e_context_init
surface during test runs: in dispatch-action.sh replace the line that calls the
initializer (the call to e2e_context_init) by invoking e2e_context_init without
"|| true" so that any mkdir or file-write errors creating
${E2E_CONTEXT_DIR}/context.env propagate (this ensures e2e_context_require will
fail immediately instead of masking the error and misattributing missing keys).

In `@test/e2e-scenario/scenarios/compiler.ts`:
- Around line 94-99: The code currently silently returns [] when required phase
action dimensions like installId (from scenario.environment?.install) are
missing; instead throw a hard error to fail-fast: replace the early returns that
yield [] with throwing a descriptive Error (include context such as scenario.id
or scenario.name and which dimension is missing) in the function that generates
phase actions (the branch checking installId / scenario.environment?.install),
and make the identical change in the other similar branch (the second check at
the same pattern) so malformed scenarios surface as hard failures rather than
emitting empty action lists.

---

Nitpick comments:
In `@test/e2e-scenario/scenarios/types.ts`:
- Around line 110-131: Change PhaseAction from a single broad interface into a
discriminated union so TypeScript enforces per-kind fields: define one variant
for kind: "shell-fn" that requires fn: string (plus shared fields like id,
phase, scriptRef, arg?, timeoutSeconds?, evidencePath?, description?) and
another variant for kind: "shell" that omits/marks fn as disallowed/undefined;
update any usages (e.g., the action objects created in
test/e2e-scenario/scenarios/compiler.ts and the runner that reads action.fn) to
satisfy the new types so compilation ensures "shell-fn" always has fn and
"shell" never does.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1a237354-36d1-4e41-8990-3b50d8f973f0

📥 Commits

Reviewing files that changed from the base of the PR and between 903f90b and 628870c.

📒 Files selected for processing (5)

test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh
test/e2e-scenario/scenarios/compiler.ts
test/e2e-scenario/scenarios/orchestrators/phase.ts
test/e2e-scenario/scenarios/types.ts

🚧 Files skipped from review as they are similar to previous changes (2)

test/e2e-scenario/scenarios/orchestrators/phase.ts
test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts

The first live dispatch of the Phase 6 wiring (run 26550310438) gave us real action evidence and surfaced three real bugs. All three are fixed inside the spec's prescribed layers - no workflow YAML, no client, no old-resolver path. 1. environment.context.emit was a shell action that called the legacy emit-context-from-plan.sh helper. That helper expects the OLD YAML-resolver plan.json shape (dimensions.platform.profile.os...), which the typed compiler does not produce. Drop the shell action; add scenarios/orchestrators/context.ts that derives a normalized context.env directly from the typed RunPlan and writes it from ScenarioRunner.run() before any phase. Spec: context emission is framework infrastructure, not a phase action. 2. PhaseOrchestrator.runShellStep was reading context.env from ${ctx.contextDir}/.e2e/context.env, but the shell helper writes to ${E2E_CONTEXT_DIR}/context.env (top-level). Fix the path so shell assertions see seeded keys. 3. ScenarioRunner did not short-circuit across phase boundaries: a failed environment ACTION (real setup work) still let onboarding and runtime run, producing a misleading 34-failure cascade. Runner now consults prior phase results: if any prior action failed, downstream phases are synthesized as skipped with a message naming the blocking phase+action+message. Assertion-only failures still propagate as failures. Tests added (8 new, 292/292 scenario framework tests green). Validation gate next: dispatch e2e-scenarios-all.yaml again. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

Run 26550936453 surfaced two more real bugs after Phase 6 wiring went live. Both fixed inside the spec's prescribed layers; nothing leaks into workflow YAML or clients. 1. dispatch-action.sh called e2e_context_init unconditionally before sourcing the install/onboard dispatcher. e2e_context_init opens context.env with `: > ctx`, which truncated the file the ScenarioRunner had just seeded. All runtime assertions then failed with 'e2e context: missing required key(s): E2E_SCENARIO ...'. Fix: dispatch-action.sh no longer calls e2e_context_init. The TS framework owns context.env initialization; workers may still extend it via e2e_context_set. 2. The legacy onboarding.preflight.passed assertion expects an onboard.log file at ${E2E_CONTEXT_DIR}/onboard.log. The old bash runner used to redirect onboarding output there; the typed orchestrator captured it under .e2e/actions/<action-id>.log. Fix: add optional aliasPath to PhaseAction; compiler sets aliasPath to 'onboard.log' for the onboarding profile action; orchestrator copies the action evidence log to the alias on success. Best- effort - alias copy failures do not fail the action. Live evidence from run 26550936453 (canonical ubuntu-repo-cloud-openclaw): - environment.install.repo-current: passed in 14.2s - onboarding.profile.cloud-openclaw: passed in 302s (real onboarding!) - onboarding.base.cli-installed: passed - onboarding.preflight.passed: failed (onboard.log not found) <- fixed - runtime.* (10 steps): all 'missing key(s)' <- fixed by #1 Tests: 38/38 phase-orchestrator (was 36; +2 alias tests), 294/294 scenario framework. shellcheck clean. Validation gate next: redispatch e2e-scenarios-all and confirm runtime steps actually exercise the SUT (real pass/fail, not key errors). Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

The first canonical-scenario run that reached the onboarding-assertion phase (run 26552550140 / ubuntu-repo-cloud-openclaw) showed that the legacy onboarding.preflight.passed assertion fails on every successful run because its regex matches any mention of 'docker' / 'container' / 'daemon' / 'socket' in onboard.log - and a normal nemoclaw onboarding mentions all of those many times. The action itself succeeded (exit 0, 263s of real onboarding work); the assertion is meant to confirm onboard.log does not contain explicit preflight FAILURE markers. Tighten the regex accordingly: match phrases like 'preflight failed/error', 'cannot connect to the docker daemon', 'onboarding aborted', 'FATAL: docker', 'ERROR: docker daemon' - not bare topic words. Verified: shellcheck passes; bash -n passes. Why we stop here on this PR: This commit lands the last small framework-level fix produced by live action evidence. The Phase 6 wiring is now fully validated end-to-end: Install: passed (~12s) Onboarding: action passed (~263s real onboarding) base.cli-installed passed preflight.passed will now pass Runtime: 9 passed / 25 failed / 5 skipped against live SUT The remaining 25 runtime failures are real product/test bugs surfaced by finally executing the suite against a live SUT (sandbox-shell timeouts, inference 30-60s timeouts, lifecycle.sandbox_operations exit-1 mismatches, lifecycle.rebuild/upgrade 120s timeouts even after retries). They are pre-existing and out of scope for 'execute real shell assertions; delete dry-run, --validate-only, and the bash runner'. They become productive follow-up issues. The 5 skipped runtime steps are 'probe not registered' - known per spec; probe registry lands in a follow-up. Negative scenarios (ubuntu-no-docker-preflight-negative, invalid-key-negative, gateway-port-conflict-negative) need expected- failure semantics and a way to actually simulate docker-missing on the runner. Out of scope here; tracked as follow-up. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

All three findings are valid mechanical bugs introduced/touched by this PR. Batched per the safe-batch policy: same-risk, independently obvious, testable together. 1. orchestrators/phase.ts classifierForRef: Outer guard is /i (case-insensitive), but the inner branch used case-sensitive ref.includes("tunnel") / ref.includes("cloudflared") - mixed-case refs would fall through and misclassify as provider-transient. Replace with /tunnel|cloudflared/i.test(ref). 2. scenarios/compiler.ts phaseActions: Inline comment said "the scenario is malformed; surface it as a hard error" but the code returned []. Hard-fail instead, with a message that names the missing dimension. Empty environment is still tolerated (skeleton scenarios can carry no setup yet). 3. validation_suites/lib/sandbox_lifecycle.sh: Substring match `*${E2E_SANDBOX_NAME}*` would let sb1 falsely match sb10. Use awk with a whole-token equality check on column one of `nemoclaw list` output. Tests: 294/294 scenario framework still green. shellcheck + shfmt clean. No behavior change for canonical scenarios; affected paths were either dormant (case-mixed classifier) or returning a slightly stricter outcome (compiler hard-fail, sandbox exact match). Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

Merge remote-tracking branch 'origin/main' into feat/e2e-real-execution

2716ab0

# Conflicts: # test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread test/e2e-scenario/scenarios/orchestrators/phase.ts

Comment thread test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh Outdated

jyaunches added the v0.0.55 Release target label May 28, 2026

jyaunches added 3 commits May 27, 2026 21:21

fix(e2e): drop extra trailing newline (end-of-file-fixer)

903f90b

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

Comment thread test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh Outdated

Comment thread test/e2e-scenario/scenarios/compiler.ts Outdated

jyaunches added 5 commits May 27, 2026 22:19

style(e2e): shfmt indentation on preflight-passed continuation lines

1d128a1

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>

Conversation

jyaunches commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changed

Orchestrator (TypeScript)

Workflow

Bash entrypoints (PR collapses what was originally going to be PR 1 + PR 2)

Shell scripts (the leaves stay, the dry-run skip blocks die)

Tests

Docs

Verification

Expected CI behavior

Spec gates addressed

Stat

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Advisor Recommendation

E2E Recommendation Advisor

Required E2E

Optional E2E

New E2E recommendations

Dispatch hint

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Scenario Advisor Recommendation

E2E Scenario Advisor

Required scenario E2E

Optional scenario E2E

Relevant changed files

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Advisor

🛠️ Needs attention

🔎 Worth checking

🌱 Nice ideas

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jyaunches commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading