Skip to content

test(e2e): execute real shell assertions; delete dry-run, --validate-only, and the bash runner#4380

Open
jyaunches wants to merge 10 commits into
mainfrom
feat/e2e-real-execution
Open

test(e2e): execute real shell assertions; delete dry-run, --validate-only, and the bash runner#4380
jyaunches wants to merge 10 commits into
mainfrom
feat/e2e-real-execution

Conversation

@jyaunches
Copy link
Copy Markdown
Contributor

@jyaunches jyaunches commented May 28, 2026

Why

Removing --dry-run and plan-mode execution from the E2E runner: they let a coding agent ship work behind false-green test runs. The TS runner now has one execution mode: live. No flag, env var, helper, or branch in any production path bypasses real assertion execution.

Two seams needed closing:

  1. phase.ts:executeStep had no child_process.spawn — every real shell/probe step fell through to a hardcoded failed: unsupported live step, and four placeholder refs (fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton) were the only paths that reported passed.
  2. ~30 shell scripts under validation_suites/, onboarding_assertions/, and nemoclaw_scenarios/ began with if e2e_env_is_dry_run; then exit 0; fi, so when the workflow passed --dry-run the scripts exited 0 before running.

What changed

Orchestrator (TypeScript)

  • phase.ts:executeStep now spawns shell steps via child_process.spawn, with detached process groups so timeouts kill bash + sleep cleanly via negative-pid signal. Probe steps return skipped: "probe not registered" until the registry lands. Pending steps return skipped: "pending: <ref>". Unknown kinds throw. Real evidence captured to .e2e/logs/<step-id>.log. Step-level reliability.timeoutSeconds and retry.{attempts,on} enforced here, not in clients.
  • run.ts: --dry-run, --validate-only deleted. Default invocation is live execution. --list and --plan-only (local debug) survive read-only. --emit-matrix added for the dynamic-matrix workflow (feat(e2e): generate scenario fan-out matrix from typed registry #4359).
  • types.ts: RunContext.dryRun deleted.

Workflow

  • e2e-scenarios.yaml: the resolve-runner --plan-only warmup, and both --dry-run invocations (Linux + WSL), are gone. Workflows execute live.
  • tools/e2e-scenarios/workflow-boundary.mts validator now rejects --dry-run, --plan-only, --validate-only in the workflow.

Bash entrypoints (PR collapses what was originally going to be PR 1 + PR 2)

  • runtime/run-scenario.sh: 483 lines of duplicated install/onboard/gateway-check/suite-execution → 5-line fail-fast stub pointing at run.ts.
  • runtime/run-suites.sh: same treatment. PhaseOrchestrator.runShellStep walks typed assertionGroups directly; nothing in TS calls a YAML-walking bash runner.

Shell scripts (the leaves stay, the dry-run skip blocks die)

  • validation_suites/**, onboarding_assertions/**, nemoclaw_scenarios/**: every if e2e_env_is_dry_run; then ... exit 0; fi and every [[ "${E2E_DRY_RUN:-0}" == "1" ]] short-circuit removed. The real assertion logic that was hiding underneath now runs unconditionally.
  • runtime/lib/env.sh: e2e_env_is_dry_run helper deleted.
  • inference_routing.sh: dead _e2e_inference_plan helper deleted.

Tests

  • DELETED (validated dead code paths): e2e-suite-runner.test.ts, e2e-scenario-first-migration.test.ts, e2e-expected-state-validator.test.ts.
  • REWRITTEN e2e-phase-orchestrators.test.ts: now exercises real shell spawning via temp scripts (pass / fail-with-stderr-tail / timeout / retry-on-classified-transient / missing-ref), real probe skipping with visible reason, and real pending skipping. Replaces the prior placeholder refs with assertions that observe actual subprocess behavior.
  • TRIMMED e2e-lib-helpers.test.ts, e2e-scenario-additional-families.test.ts, e2e-scenario-resolver.test.ts, e2e-context-helper.test.ts: dry-run-mode / run-scenario.sh-spawning tests deleted; tests of real bash semantics survive.

Docs

  • test/e2e-scenario/docs/README.md: one runner, one mode (live), no dry-run, no validate-only.

Verification

$ npx tsx test/e2e-scenario/scenarios/run.ts --list
hybrid scenario registry
- brev-launchable-cloud-openclaw: ...
- ubuntu-repo-cloud-openclaw: ...
(22 scenarios listed)

$ npx tsx test/e2e-scenario/scenarios/run.ts --emit-matrix
{"include":[{"id":"brev-launchable-cloud-openclaw", ...}, ...]}

$ E2E_CONTEXT_DIR=/tmp/x npx tsx test/e2e-scenario/scenarios/run.ts \
    --scenarios ubuntu-repo-cloud-openclaw
... (compiled plan output) ...
Phase results:
  environment: skipped (skipped=1)
  onboarding: failed (passed=1 failed=1)
  runtime: failed (failed=34 skipped=5)

$ cat /tmp/x/.e2e/logs/runtime.smoke.cli-available.log
smoke:cli-available
e2e context: missing required key(s): E2E_SCENARIO

$ vitest run test/e2e-scenario/
Test Files  32 passed (32)
     Tests  274 passed (274)

Audited absent in production paths: --dry-run, dryRun, E2E_DRY_RUN, e2e_env_is_dry_run, fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton, unsupported-live-step, --validate-only, RunContext.dryRun.

Expected CI behavior

This PR's CI will surface real failures wherever the orchestrator does not yet hand the runtime phase enough state (e.g. a seeded context.env). That is intentional: this change exposes the remaining wiring gaps so subsequent PRs (probe registry, OnboardingOrchestrator wiring into the real install/onboard dispatchers, old YAML resolver deletion) can address them directly.

Spec gates addressed

  • Phase 6 — orchestrators execute live shell/probe assertion steps.
  • Phase 7 — single TS runtime entrypoint; bash runners deprecated; workflows use the typed runner with no --dry-run.
  • Workflow side of Phase 9--dry-run, --validate-only, and suite_filter gone from active paths.

The old YAML resolver source under runtime/resolver/ is left intact; its deletion is the next PR.

Stat

44 files changed, 454 insertions(+), 1833 deletions(-)

Net deletion of ~1400 lines, primarily dry-run scaffolding and the duplicated bash orchestration layer.

Summary by CodeRabbit

  • New Features

    • Scenarios run live by default with a compact per-phase results summary and an emit-matrix mode for CI.
    • Phases execute declared actions and real shell steps with per-action timeouts, retries, and per-step logs/evidence.
  • Chores

    • Deprecates legacy bash E2E runners in favor of the TypeScript entrypoint.
    • Removes dry-run/plan-only short-circuits so installs, validations and assertions execute consistently.
  • Documentation

    • Updates E2E docs to reflect the TypeScript entrypoint and deprecated bash scripts.
  • Tests

    • Refactors/removes multiple E2E tests to align with live-run behavior and the new orchestrator.

Review Change Stack

…only, and the bash runner

The merged hybrid scenario architecture shipped scaffolding that looked like it
ran E2E tests but did not. Two layers were producing fake green:

1. phase.ts:executeStep had no child_process.spawn anywhere. Every real
   shell/probe step fell through to a hardcoded
   { status: "failed", message: "unsupported live step" }, and a handful of
   fake-pass refs (fake-pass, fake-retry-once-pass, fake-always-transient,
   phase-1-skeleton) were the only paths that reported "passed". CI green
   meant the plan compiled, not that any assertion executed.

2. ~30 shell scripts under validation_suites/, onboarding_assertions/,
   nemoclaw_scenarios/install/, and nemoclaw_scenarios/onboard/ began with
   "if e2e_env_is_dry_run; then echo [dry-run] ...; exit 0; fi". Once the
   dry-run flag flowed in (which workflows did pass), every script silently
   exited 0 before its real assertion ran.

This change rips out both layers in one shot. The TS runner has one execution
mode: live. There is no flag, env var, helper, branch, or comment in any
production path that can produce a fake pass.

Orchestrator (TypeScript)
- phase.ts: executeStep now spawns shell steps via child_process.spawn,
  with detached process groups so timeouts kill bash + sleep cleanly. Probe
  steps return skipped "probe not registered" until the registry lands.
  Pending steps return skipped "pending: <ref>". Unknown kinds throw.
  Real evidence is captured to .e2e/logs/<step-id>.log. Step-level
  reliability.timeoutSeconds and retry.{attempts,on} policy are enforced
  here, not in clients.
- run.ts: --dry-run, --validate-only deleted. Default invocation is live
  execution. --list and --plan-only (local debug) survive read-only.
  --emit-matrix added for the dynamic-matrix workflow (PR #4359).
- types.ts: RunContext.dryRun deleted. AssertionResult already supported
  "skipped" status, now actually used.

Workflow
- e2e-scenarios.yaml: the resolve-runner --plan-only warmup, and both
  --dry-run invocations (Linux + WSL), are gone. Workflows execute live.
- workflow-boundary.mts validator now requires --dry-run, --plan-only,
  and --validate-only to NOT appear in the workflow.

Bash entrypoints (PR collapses what was originally going to be PR 1 + PR 2)
- runtime/run-scenario.sh: 483 lines of duplicated install/onboard/
  gateway-check/suite-execution -> 5-line fail-fast stub pointing at
  run.ts. The TS phase orchestrators own this work now.
- runtime/run-suites.sh: same treatment. PhaseOrchestrator.runShellStep
  walks typed assertionGroups directly; nothing in TS calls a YAML-walking
  bash runner.

Shell scripts (the leaves stay, the dry-run skip blocks die)
- validation_suites/**, onboarding_assertions/**, nemoclaw_scenarios/**:
  every "if e2e_env_is_dry_run; then ... exit 0; fi" and every
  "[[ ${E2E_DRY_RUN:-0} == 1 ]]" short-circuit removed. The real assertion
  logic that was hiding underneath now runs unconditionally.
- runtime/lib/env.sh: e2e_env_is_dry_run helper deleted.
- inference_routing.sh: dead _e2e_inference_plan helper (only callable
  from the deleted dry-run paths) deleted.

Tests
- DELETED (validated dead code paths):
    e2e-suite-runner.test.ts             (run-suites.sh behavior)
    e2e-scenario-first-migration.test.ts (run-scenario.sh dry-run plan)
    e2e-expected-state-validator.test.ts (--validate-only mode)
- REWRITTEN:
    e2e-phase-orchestrators.test.ts: now exercises real shell spawning
      via temp scripts (pass/fail/timeout/retry/missing-ref), real probe
      skipping with visible reason, and real pending skipping. The
      previous fake-pass refs in this test were the canonical example of
      the problem.
- TRIMMED:
    e2e-lib-helpers.test.ts: dry-run-mode unit tests deleted; tests of
      real bash semantics survive.
    e2e-scenario-additional-families.test.ts: planOnly-via-bash tests
      deleted; resolveScenario-direct tests survive.
    e2e-scenario-resolver.test.ts: run-scenario.sh --plan-only spawn
      tests deleted; resolver unit tests survive.
    e2e-context-helper.test.ts: dry-run trace test deleted.

Docs
- docs/README.md: updated to state one runner, one mode (live), no
  dry-run, no validate-only. Bash entrypoints documented as deprecated
  fail-fast stubs.

Verification
- run.ts --list          : prints the typed registry (intact)
- run.ts --emit-matrix   : emits JSON matrix for the dynamic-matrix
                           workflow
- run.ts --scenarios <id>: spawns real shell scripts, real exit codes,
                           real failures with real evidence logs. Phase
                           results show passed/failed/skipped honestly.
- All 274 e2e-scenario framework tests pass.
- Audited: no surviving --dry-run, dryRun, E2E_DRY_RUN, e2e_env_is_dry_run,
  fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton,
  unsupported-live-step, --validate-only, or RunContext.dryRun in any
  production path.

CI for this PR will go red on environments where nemoclaw is not actually
installed and onboarded. That is the point. Red is the first honest signal
in months. Subsequent PRs (probe registry, OnboardingOrchestrator wiring
into the real install/onboard dispatchers, old YAML resolver deletion)
fix the real failures rather than hide them.

Spec gates addressed: Phase 6 (orchestrators execute live shell steps),
Phase 7 (single TS runtime entrypoint, bash runners deprecated), and
the workflow side of Phase 9 (--dry-run / --validate-only / suite_filter
gone from active paths). The old YAML resolver source under
runtime/resolver/ stays for now; its deletion is the next PR.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Removes plan-only/dry-run modes, makes the TypeScript runner the canonical live executor (adds matrix emission), deprecates bash runners, updates the Phase orchestrator for real shell execution, removes dry-run guards from validation/install scripts, and updates tests and workflow checks.

Changes

Live E2E Execution Framework

Layer / File(s) Summary
Workflow and docs
.github/workflows/e2e-scenarios.yaml, tools/e2e-scenarios/workflow-boundary.mts, test/e2e-scenario/docs/README.md
Workflow steps no longer invoke run.ts with --dry-run/--plan-only; validator enforces run scripts don't contain execution-hiding flags; README documents TS runner as canonical.
CLI runner and matrix emission
test/e2e-scenario/scenarios/run.ts
run.ts adds --emit-matrix/--plan-only, drops --dry-run/--validate-only, emits matrices, compiles and executes plans live, aggregates PhaseResult and sets exit code on failure.
Types and plan compiler
test/e2e-scenario/scenarios/types.ts, test/e2e-scenario/scenarios/compiler.ts
Introduce PhaseAction and PhaseActionResult, change RunPlanPhase.actions to PhaseAction[], remove RunContext.dryRun, and compile plans with typed actions plus rendered action lines in plan text.
Phase orchestrator real execution
test/e2e-scenario/scenarios/orchestrators/phase.ts
Execute phase actions before assertions; shell steps run real scripts (repo-relative resolution, existence checks, context.env parsing, per-step logs, detached process groups, timeouts, evidence and classifier handling); treat skipped/pending appropriately.
Bash runner deprecation & env cleanup
test/e2e-scenario/runtime/run-scenario.sh, test/e2e-scenario/runtime/run-suites.sh, test/e2e-scenario/runtime/lib/env.sh
Legacy bash runners are now fail-fast stubs that exit code 2; e2e_env_is_dry_run helper removed; README updated to mark bash scripts deprecated.
Validation suites: remove dry-run guards
test/e2e-scenario/validation_suites/**
Remove E2E_DRY_RUN short-circuits so inference/gateway/sandbox/Hermes/messaging/rebuild/sandbox lifecycle/platform smoke/CLI checks run live and produce real pass/fail outcomes.
Install & onboard scripts
test/e2e-scenario/nemoclaw_scenarios/install/*, .../fixtures/older-base-image.sh, .../onboard/dispatch.sh
Installers, repo installs, Docker pulls, and onboarding dispatch no longer short-circuit on dry-run and execute their steps unconditionally.
E2E framework tests
test/e2e-scenario/framework-tests/*
PhaseOrchestrator tests refactored to real temp-dir/script execution; many plan-only/dry-run tests removed and some test files deleted; tests updated to stop setting E2E_DRY_RUN.
Dispatch entrypoint
test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh
New deterministic dispatcher script added to invoke dispatcher functions after standard env/context setup.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4017: Overlapping changes to security policy credential checks and redaction/version handling.
  • NVIDIA/NemoClaw#4283: Prior changes to typed E2E workflow/runner flags and plan-mode invocations related to this adjustment.

Suggested labels

E2E, enhancement: testing, refactor, CI/CD

Suggested reviewers

  • cv
  • ericksoa

"🐰 I hopped through scripts at break of day,
No dry-run shadows now in my way.
Shells run for real, logs flourish and play,
Tests march forward, live truth on display.
A carrot for CI — steady, not gray."

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main objective: removing dry-run execution modes and making the TypeScript orchestrator the single execution path for real shell assertions.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/e2e-real-execution

Comment @coderabbitai help to get the list of available commands and usage tips.

# Conflicts:
#	test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e-scenario/scenarios/orchestrators/phase.ts`:
- Around line 32-34: The branch uses a case-insensitive regex test on ref but
then calls case-sensitive ref.includes("tunnel") / ref.includes("cloudflared"),
causing mixed-case refs to misclassify; update the branching in the same block
(the if handling /provider|inference|chat-completion|cloudflared|tunnel/i) to
compare in a case-insensitive way—e.g., normalize ref to lowerCase once and use
that for the subsequent includes checks (or use case-insensitive regex matches)
so tunnel/cloudflared variants are correctly detected and return
"external-tunnel".

In `@test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh`:
- Line 62: The current substring check [[ "${SANDBOX_LIFECYCLE_LAST_OUTPUT}" ==
*"${E2E_SANDBOX_NAME}"* ]] can produce false positives (e.g., sb1 matching
sb10); change it to perform an exact token/line match of the sandbox name
instead: test SANDBOX_LIFECYCLE_LAST_OUTPUT for the exact E2E_SANDBOX_NAME token
(for example by piping SANDBOX_LIFECYCLE_LAST_OUTPUT to grep -w/-x or using a
word-boundary regex with [[ ... =~ ... ]]) and fail if not found, so the check
around SANDBOX_LIFECYCLE_LAST_OUTPUT and E2E_SANDBOX_NAME only succeeds for
exact sandbox-name matches.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 401b8a23-1471-429d-9905-b413e6bae24c

📥 Commits

Reviewing files that changed from the base of the PR and between 1daf081 and b7acfb7.

📒 Files selected for processing (44)
  • .github/workflows/e2e-scenarios.yaml
  • test/e2e-scenario/docs/README.md
  • test/e2e-scenario/framework-tests/e2e-context-helper.test.ts
  • test/e2e-scenario/framework-tests/e2e-expected-state-validator.test.ts
  • test/e2e-scenario/framework-tests/e2e-lib-helpers.test.ts
  • test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
  • test/e2e-scenario/framework-tests/e2e-scenario-additional-families.test.ts
  • test/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.ts
  • test/e2e-scenario/framework-tests/e2e-scenario-resolver.test.ts
  • test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
  • test/e2e-scenario/nemoclaw_scenarios/fixtures/older-base-image.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/dispatch.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
  • test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
  • test/e2e-scenario/runtime/lib/env.sh
  • test/e2e-scenario/runtime/run-scenario.sh
  • test/e2e-scenario/runtime/run-suites.sh
  • test/e2e-scenario/scenarios/orchestrators/phase.ts
  • test/e2e-scenario/scenarios/run.ts
  • test/e2e-scenario/scenarios/types.ts
  • test/e2e-scenario/validation_suites/assert/gateway-alive.sh
  • test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
  • test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
  • test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
  • test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
  • test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
  • test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
  • test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
  • test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
  • test/e2e-scenario/validation_suites/lib/inference_routing.sh
  • test/e2e-scenario/validation_suites/lib/messaging_providers.sh
  • test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
  • test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
  • test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
  • test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
  • test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
  • test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
  • test/e2e-scenario/validation_suites/sandbox-exec.sh
  • test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
  • test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
  • tools/e2e-scenarios/workflow-boundary.mts
💤 Files with no reviewable changes (30)
  • test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
  • test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
  • test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
  • test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
  • test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
  • test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
  • test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
  • test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
  • test/e2e-scenario/validation_suites/lib/messaging_providers.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
  • test/e2e-scenario/scenarios/types.ts
  • test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
  • test/e2e-scenario/framework-tests/e2e-expected-state-validator.test.ts
  • test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
  • test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
  • test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
  • test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
  • test/e2e-scenario/validation_suites/lib/inference_routing.sh
  • test/e2e-scenario/validation_suites/assert/gateway-alive.sh
  • test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
  • test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
  • test/e2e-scenario/validation_suites/sandbox-exec.sh
  • test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
  • test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
  • test/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.ts
  • test/e2e-scenario/runtime/lib/env.sh
  • test/e2e-scenario/framework-tests/e2e-context-helper.test.ts

Comment thread test/e2e-scenario/scenarios/orchestrators/phase.ts
Comment thread test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

E2E Advisor Recommendation

Required E2E: ubuntu-repo-cloud-openclaw, ubuntu-repo-cloud-hermes, ubuntu-no-docker-preflight-negative, gpu-repo-local-ollama-openclaw, macos-repo-cloud-openclaw, wsl-repo-cloud-openclaw, brev-launchable-cloud-openclaw
Optional E2E: ubuntu-repo-cloud-openclaw-telegram, ubuntu-repo-cloud-openclaw-discord, ubuntu-repo-cloud-openclaw-custom-policies, ubuntu-repo-cloud-openclaw-token-rotation

Dispatch hint: workflow_dispatch; runs ubuntu-repo-cloud-openclaw, ubuntu-repo-cloud-hermes, gpu-repo-local-ollama-openclaw, macos-repo-cloud-openclaw, wsl-repo-cloud-openclaw, brev-launchable-cloud-openclaw, ubuntu-no-docker-preflight-negative

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • ubuntu-repo-cloud-openclaw (medium): Baseline live Ubuntu repo install, cloud OpenClaw onboarding, gateway/sandbox smoke, cloud inference, and credentials checks cover the highest-risk workflow and typed-runner changes.
  • ubuntu-repo-cloud-hermes (medium): Validates the Hermes-specific path after changes to the shared runner/orchestrator and Hermes health assertion.
  • ubuntu-no-docker-preflight-negative (low): Preflight assertion and action orchestration changed; this negative scenario proves setup fails before gateway/sandbox side effects when Docker is unavailable.
  • gpu-repo-local-ollama-openclaw (high): Directly exercises changed Ollama install/profile code plus local Ollama inference and auth-proxy validation suites.
  • macos-repo-cloud-openclaw (medium): Validates the macOS scenario route and changed macOS platform smoke script under the live typed runner.
  • wsl-repo-cloud-openclaw (medium): Validates the WSL-specific workflow branch and changed WSL platform smoke script under live execution instead of dry-run.
  • brev-launchable-cloud-openclaw (high): Launchable install code and action dispatch changed; this scenario is the existing coverage for the launchable install/deployment path.

Optional E2E

  • ubuntu-repo-cloud-openclaw-telegram (medium): Changed messaging provider helpers and common bridge reachability assertion; useful to validate one real messaging provider flow.
  • ubuntu-repo-cloud-openclaw-discord (medium): Additional confidence for the changed messaging provider library and bridge assertions on Discord-specific OpenClaw onboarding.
  • ubuntu-repo-cloud-openclaw-custom-policies (medium): Exercises credentials, policy/onboarding state, model-router, and snapshot lifecycle assertions adjacent to changed security and lifecycle helper libraries.
  • ubuntu-repo-cloud-openclaw-token-rotation (medium): Useful confidence for credential/messaging boundary behavior after security and messaging helper changes.

New E2E recommendations

  • public-curl-installer (high): The public curl installer implementation changed, but the current typed scenario registry/workflow route list does not expose a canonical public-curl scenario to run in CI.
    • Suggested test: Add a canonical typed scenario such as ubuntu-public-curl-cloud-openclaw that installs through the public curl installer and runs smoke plus a minimal inference check.
  • workflow-route-registry-parity (medium): The workflow keeps a manual route table while the typed registry is the source of scenario IDs; this PR removes plan validation in the resolve-runner step, increasing the chance of route/registry drift.
    • Suggested test: Add an E2E workflow-boundary canary that dispatches every registry scenario through e2e-scenarios.yaml in plan/list validation mode or asserts ROUTES exactly covers the registry without invoking live infrastructure.
  • messaging-scenario-fanout (medium): Messaging helpers changed, but the all-scenarios fanout only covers the first seven scenarios and does not include Telegram/Discord/Slack assistant flows.
    • Suggested test: Extend the scenario fanout or add a focused messaging E2E workflow job that runs one OpenClaw messaging scenario per provider family.

Dispatch hint

  • Workflow: .github/workflows/e2e-scenarios-all.yaml
  • jobs input: workflow_dispatch; runs ubuntu-repo-cloud-openclaw, ubuntu-repo-cloud-hermes, gpu-repo-local-ollama-openclaw, macos-repo-cloud-openclaw, wsl-repo-cloud-openclaw, brev-launchable-cloud-openclaw, ubuntu-no-docker-preflight-negative

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

E2E Scenario Advisor Recommendation

Required scenario E2E: e2e-scenarios-all, ubuntu-repo-cloud-hermes-discord:smoke, ubuntu-repo-cloud-hermes-slack:smoke, ubuntu-repo-cloud-openclaw-brave:smoke, ubuntu-repo-cloud-openclaw-custom-policies:inference,smoke, ubuntu-repo-cloud-openclaw-discord:messaging-discord,smoke, ubuntu-repo-cloud-openclaw-double-provider-switch:smoke, ubuntu-repo-cloud-openclaw-double-same-provider:smoke, ubuntu-repo-cloud-openclaw-repair:smoke, ubuntu-repo-cloud-openclaw-resume:smoke, ubuntu-repo-cloud-openclaw-slack:messaging-slack,smoke, ubuntu-repo-cloud-openclaw-telegram:messaging-telegram,smoke, ubuntu-repo-cloud-openclaw-token-rotation:smoke, ubuntu-repo-openai-compatible-openclaw:smoke
Optional scenario E2E: brev-launchable-cloud-openclaw:inference,smoke, gpu-repo-local-ollama-openclaw:smoke, macos-repo-cloud-openclaw:platform-macos, wsl-repo-cloud-openclaw:platform-wsl,smoke

Dispatch required scenario E2E:

  • gh workflow run e2e-scenarios-all.yaml --ref <pr-head-ref>
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-hermes-discord
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-hermes-slack
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-brave
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-custom-policies
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-discord
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-double-provider-switch
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-double-same-provider
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-repair
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-resume
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-slack
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-telegram
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-token-rotation
  • gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-openai-compatible-openclaw

Workflow run

Full scenario advisor summary

E2E Scenario Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required scenario E2E

  • e2e-scenarios-all: the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
    • Dispatch: gh workflow run e2e-scenarios-all.yaml --ref <pr-head-ref>
  • ubuntu-repo-cloud-hermes-discord:smoke: Scenario ubuntu-repo-cloud-hermes-discord exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-hermes-discord
  • ubuntu-repo-cloud-hermes-slack:smoke: Scenario ubuntu-repo-cloud-hermes-slack exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-hermes-slack
  • ubuntu-repo-cloud-openclaw-brave:smoke: Scenario ubuntu-repo-cloud-openclaw-brave exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-brave
  • ubuntu-repo-cloud-openclaw-custom-policies:inference,smoke: Scenario ubuntu-repo-cloud-openclaw-custom-policies exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-custom-policies
  • ubuntu-repo-cloud-openclaw-discord:messaging-discord,smoke: Scenario ubuntu-repo-cloud-openclaw-discord exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-discord
  • ubuntu-repo-cloud-openclaw-double-provider-switch:smoke: Scenario ubuntu-repo-cloud-openclaw-double-provider-switch exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-double-provider-switch
  • ubuntu-repo-cloud-openclaw-double-same-provider:smoke: Scenario ubuntu-repo-cloud-openclaw-double-same-provider exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-double-same-provider
  • ubuntu-repo-cloud-openclaw-repair:smoke: Scenario ubuntu-repo-cloud-openclaw-repair exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-repair
  • ubuntu-repo-cloud-openclaw-resume:smoke: Scenario ubuntu-repo-cloud-openclaw-resume exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-resume
  • ubuntu-repo-cloud-openclaw-slack:messaging-slack,smoke: Scenario ubuntu-repo-cloud-openclaw-slack exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-slack
  • ubuntu-repo-cloud-openclaw-telegram:messaging-telegram,smoke: Scenario ubuntu-repo-cloud-openclaw-telegram exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-telegram
  • ubuntu-repo-cloud-openclaw-token-rotation:smoke: Scenario ubuntu-repo-cloud-openclaw-token-rotation exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-cloud-openclaw-token-rotation
  • ubuntu-repo-openai-compatible-openclaw:smoke: Scenario ubuntu-repo-openai-compatible-openclaw exercises the changed scenario E2E surface. Changed suite(s): assert, hermes-specific, inference, lib, messaging-discord, messaging-slack, messaging-telegram, platform-macos, platform-wsl, sandbox-exec.sh, smoke. the reusable single-scenario workflow changed; scenario install/onboard helper code changed; shared scenario runner/runtime code changed; validation suite assert changed; validation suite hermes-specific changed; validation suite inference changed; validation suite lib changed; validation suite messaging-discord changed; validation suite messaging-slack changed; validation suite messaging-telegram changed; validation suite platform-macos changed; validation suite platform-wsl changed; validation suite sandbox-exec.sh changed; validation suite smoke changed
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=ubuntu-repo-openai-compatible-openclaw

Optional scenario E2E

  • brev-launchable-cloud-openclaw:inference,smoke: Special-runner scenario covers a changed suite but may require scarce hardware/secrets.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=brev-launchable-cloud-openclaw
  • gpu-repo-local-ollama-openclaw:smoke: Special-runner scenario covers a changed suite but may require scarce hardware/secrets.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=gpu-repo-local-ollama-openclaw
  • macos-repo-cloud-openclaw:platform-macos: Special-runner scenario covers a changed suite but may require scarce hardware/secrets.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=macos-repo-cloud-openclaw
  • wsl-repo-cloud-openclaw:platform-wsl,smoke: Special-runner scenario covers a changed suite but may require scarce hardware/secrets.
    • Dispatch: gh workflow run e2e-scenarios.yaml --ref <pr-head-ref> --field scenarios=wsl-repo-cloud-openclaw

Relevant changed files

  • .github/workflows/e2e-scenarios.yaml
  • test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh
  • test/e2e-scenario/nemoclaw_scenarios/fixtures/older-base-image.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/dispatch.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
  • test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
  • test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
  • test/e2e-scenario/runtime/lib/env.sh
  • test/e2e-scenario/runtime/run-scenario.sh
  • test/e2e-scenario/runtime/run-suites.sh
  • test/e2e-scenario/validation_suites/assert/gateway-alive.sh
  • test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
  • test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
  • test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
  • test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
  • test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
  • test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
  • test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
  • test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
  • test/e2e-scenario/validation_suites/lib/inference_routing.sh
  • test/e2e-scenario/validation_suites/lib/messaging_providers.sh
  • test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
  • test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh
  • test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
  • test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
  • test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
  • test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
  • test/e2e-scenario/validation_suites/sandbox-exec.sh
  • test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
  • test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh

@jyaunches jyaunches added the v0.0.55 Release target label May 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 28, 2026

PR Review Advisor

Findings: 8 needs attention, 10 worth checking, 0 nice ideas
Since last review: 2 prior items resolved, 14 still apply, 0 new items found

Review findings

🛠️ Needs attention

  • Secret-bearing child process output is persisted and uploaded without redaction (test/e2e-scenario/scenarios/orchestrators/phase.ts:112): The live runner now executes install, onboarding, and assertion subprocesses with the full workflow environment, including provider secrets, then writes raw stdout/stderr to evidence logs and copies stderr tails into phase result JSON. The workflow uploads the entire .e2e tree with hidden files included, so any installer, onboarding command, provider response, or CLI error that prints a token can persist that secret as an artifact.
    • Recommendation: Use a minimal allowlisted subprocess environment, centrally redact known secret values and secret-shaped strings before writing logs/result JSON, and restrict artifact upload paths to sanitized outputs. Add a regression test where a child prints a secret-shaped value and no .e2e log/result artifact contains it.
    • Evidence: phase.ts builds child envs with `...process.env`, pipes child stdout/stderr to createWriteStream logs, and includes `stderrTail` in failure messages. .github/workflows/e2e-scenarios.yaml passes `NVIDIA_API_KEY` to the runner and uploads `.e2e/` with `include-hidden-files: true`.
  • Required security probes and expected-failure checks still skip without failing live runs (test/e2e-scenario/scenarios/orchestrators/phase.ts:266): Security-sensitive suites for shields, policy enforcement, and injection blocking are represented as probe steps, and negative expected-failure side-effect validation is represented as a pending step. The orchestrator marks both kinds as skipped, while the top-level runner only exits nonzero for failed phase results. A live run can therefore omit required security and negative checks while still appearing non-failing if other assertions pass.
    • Recommendation: Fail closed for required/security probe steps and expected-failure side-effect checks, or exclude those suites/scenarios from live workflow selection until implemented. Add tests proving skipped security probes and pending expected-failure checks make run.ts exit nonzero.
    • Evidence: phase.ts returns `status: "skipped"` for `kind === "probe"` and `kind === "pending"`. The assertion registry maps `security-shields`, `security-policy`, and `security-injection` to probe steps and maps `runtime.expected-failure.no-side-effects` to a pending step. run.ts sets `process.exitCode` only when a phase status is `failed`.
  • Negative scenarios no longer force or verify their expected-failure contracts (test/e2e-scenario/runtime/run-scenario.sh:1): Stubbing the bash runner removed the previous negative orchestration that forced Docker-missing, invalid-key, and gateway-port-conflict failures, captured negative logs, matched the expected failure, and checked forbidden side effects. The TS path has expectedFailure metadata but only a pending side-effect placeholder, so negative scenarios do not prove the declared failure mode or side-effect contract.
    • Recommendation: Port expected-failure orchestration into the TS phase model before relying on these live runs: force the declared failure mode, write the negative log, derive observed phase/error/log/side effects, invoke the matcher, and fail on forbidden side effects. Add tests for Docker-missing, invalid NVIDIA key, and gateway-port-conflict scenarios.
    • Evidence: runtime/run-scenario.sh is now a fail-fast stub. baseline.ts declares expectedFailure for `ubuntu-no-docker-preflight-negative`, `ubuntu-invalid-nvidia-key-negative`, and `ubuntu-gateway-port-conflict-negative`; the TS assertion registry only adds `runtime.expected-failure.no-side-effects` as a pending step. `onboarding_assertions/preflight/00-preflight-expected-failed.sh` requires `negative-preflight.log`, but the TS runner/orchestrators do not create it.
  • Live installer paths execute mutable network scripts without mandatory verification (test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh:19): Removing dry-run guards makes installer helpers consequential in a secret-bearing workflow. The public installer defaults to a mutable raw GitHub main-branch URL and verifies SHA256 only when an optional environment variable is provided. The Ollama helper still pipes a remote installer directly into bash.
    • Recommendation: Pin installer sources to immutable refs or require expected digests in CI. Avoid curl|bash by downloading to a file and verifying it before execution, or use a trusted package source. Add tests that CI installer profiles fail when no pin/digest is configured.
    • Evidence: public-curl.sh defaults `E2E_INSTALLER_URL` to `https://raw.githubusercontent.com/NVIDIA/NemoClaw/main/scripts/install.sh\` and only checks `E2E_INSTALLER_SHA256` when non-empty. ollama.sh runs `curl -fsSL --retry 3 --retry-delay 2 "${ollama_url}" | bash`.
  • Residual E2E_DRY_RUN branch contradicts the one-live-mode contract (test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh:15): The PR states that no flag, environment variable, helper, or branch in production paths bypasses real assertion execution, but the Slack provider assertion still honors E2E_DRY_RUN and emits pass markers without reading the runtime Slack config or querying OpenClaw runtime discovery.
    • Recommendation: Remove the E2E_DRY_RUN branch or prove this script is outside every production scenario path. Add a static regression test that fails if `E2E_DRY_RUN`, `e2e_env_is_dry_run`, dry-run pass markers, or other dry-run bypasses appear in executable scenario/assertion paths.
    • Evidence: 00-slack-provider-state.sh branches on `[[ -n "${E2E_DRY_RUN:-}" ]]` and emits dry-run `e2e_pass` markers. The assertion registry wires `messaging-slack` to this script for Slack scenarios.
  • Skeleton pending refs remain in production scenario plans (test/e2e-scenario/scenarios/assertions/environment.ts:17): The PR body lists `phase-1-skeleton` as audited absent from production paths, but the environment baseline still defines a pending step with that ref and scenario plan construction includes that baseline for every scenario. Pending steps are skipped, creating non-failing gaps in live runs.
    • Recommendation: Remove skeleton assertion modules from production scenario plans, or fail closed when any `phase-1-skeleton` or pending production step is encountered. Add a static test for the audited-absent list.
    • Evidence: environment.ts defines `implementation: { kind: "pending", ref: "phase-1-skeleton" }`, and `assertionGroupsForScenario()` includes `environmentBaseline()` for scenario plans. runtime.ts also still defines a `phase-1-skeleton` placeholder module.
  • Workflow runner routing still drifts from the typed scenario registry (.github/workflows/e2e-scenarios.yaml:58): The workflow still uses a hardcoded ROUTES map while the typed registry contains canonical scenarios that are not routed here. The previous registry validation before route lookup was removed, so registry-accepted scenarios can be rejected by the executable workflow path. This contradicts the linked matrix-generation goal that adding a scenario to baseline.ts should be sufficient.
    • Recommendation: Make workflow runner selection consume the same typed routing source as the registry, or add a static contract test that every `listScenarios()` ID has a workflow route. Until then, add the missing routes or exclude unsupported scenarios from workflow dispatch.
    • Evidence: e2e-scenarios.yaml defines ROUTES and omits registry IDs such as `ubuntu-repo-cloud-openclaw-custom-policies`, `ubuntu-invalid-nvidia-key-negative`, and `ubuntu-gateway-port-conflict-negative`. The diff removed the prior `npx tsx ... --plan-only` validation before route lookup.
  • --emit-matrix does not match the linked feat(e2e): generate scenario fan-out matrix from typed registry #4359 matrix contract (test/e2e-scenario/scenarios/run.ts:56): This PR adds --emit-matrix for the dynamic-matrix workflow, but it emits an object with include entries containing only id and description. Linked feat(e2e): generate scenario fan-out matrix from typed registry #4359 describes a single-line JSON array whose entries include id, runner, label, platform, and suites. The two contracts are incompatible.

🔎 Worth checking

  • Source-of-truth review needed: Probe registry fallback: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: phase.ts returns `status: "skipped"` for probes; the assertion registry wires security suites to probes.
  • Source-of-truth review needed: Expected-failure side-effect placeholder: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: runtime/run-scenario.sh was stubbed; registry adds only a pending side-effect check; preflight expected-failed assertion requires a negative log that TS does not create.
  • Source-of-truth review needed: Workflow runner routing source of truth: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: ROUTES omits registry IDs such as custom-policies and negative scenarios; previous plan-only registry validation was removed.
  • Source-of-truth review needed: --emit-matrix compatibility with linked feat(e2e): generate scenario fan-out matrix from typed registry #4359: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: run.ts emitMatrix builds an object with include entries containing only id and description.
  • Source-of-truth review needed: Installer trust policy for live scenario runs: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: public-curl.sh defaults to raw GitHub main with optional SHA; ollama.sh pipes `https://ollama.ai/install.sh\` to bash.
  • Source-of-truth review needed: Shell action/assertion path resolution: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: phase.ts uses `path.isAbsolute` branches without realpath containment.
  • Source-of-truth review needed: Stable alias path for legacy onboard.log: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: e2e-phase-orchestrators.test.ts checks the alias; compiler.ts and phase.ts implement the compatibility copy.
  • Shell refs and alias paths lack repo-containment guards (test/e2e-scenario/scenarios/orchestrators/phase.ts:83): The orchestrator accepts absolute action/script refs and absolute alias paths, resolving them without realpath containment. Current scenario refs are repository-defined, but this primitive is broad for future scenario, manifest, or registry inputs and could turn a metadata mistake into execution or write outside the intended E2E tree.
    • Recommendation: Reject absolute refs unless explicitly allowlisted, realpath-check repo-relative scripts under the repository root, and constrain alias paths under the context directory. Add negative tests for absolute script refs, `..` traversal, symlink escapes, and absolute alias paths.
    • Evidence: phase.ts uses `path.isAbsolute(action.scriptRef) ? action.scriptRef : path.resolve(REPO_ROOT, action.scriptRef)` and similarly resolves shell step refs. On success it copies evidence to `path.isAbsolute(action.aliasPath) ? action.aliasPath : path.join(ctx.contextDir, action.aliasPath)`.
  • Workflow invokes npx without a no-install/local guard (.github/workflows/e2e-scenarios.yaml:134): The workflow installs dependencies with lifecycle scripts disabled, but then invokes `npx tsx` directly. If dependency state drifts or tsx is absent, this can become a network dependency path in a trusted workflow that also receives provider secrets.
    • Recommendation: Invoke the repository-local `./node_modules/.bin/tsx` after `npm ci`, or use `npx --no-install tsx`, and add a workflow-boundary test for the no-network execution contract.
    • Evidence: Both Linux and WSL workflow paths run `npx tsx test/e2e-scenario/scenarios/run.ts --scenarios "${SCENARIOS}"` after `npm ci --ignore-scripts`.
  • Legacy onboard.log alias is a compatibility workaround without a removal contract (test/e2e-scenario/scenarios/compiler.ts:126): The compiler emits `aliasPath: "onboard.log"` so legacy assertions can keep reading a fixed filename, and the orchestrator copies evidence there best-effort. The test covers the alias behavior, but the source-of-truth review does not identify when legacy assertions will be migrated to action evidence paths or when the alias can be removed.
    • Recommendation: Document the invalid legacy contract, keep the regression test, and add an explicit migration/removal condition. Prefer updating assertions to consume typed action evidence directly once that source is available.
    • Evidence: compiler.ts assigns `aliasPath: "onboard.log"` for onboarding actions; phase.ts copies the evidence log to the alias best-effort. e2e-phase-orchestrators.test.ts asserts the alias exists for legacy shell assertions.

🌱 Nice ideas

  • None.
Since last review details

Current findings:

  • Source-of-truth review needed: Probe registry fallback: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: phase.ts returns `status: "skipped"` for probes; the assertion registry wires security suites to probes.
  • Source-of-truth review needed: Expected-failure side-effect placeholder: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: runtime/run-scenario.sh was stubbed; registry adds only a pending side-effect check; preflight expected-failed assertion requires a negative log that TS does not create.
  • Source-of-truth review needed: Workflow runner routing source of truth: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: ROUTES omits registry IDs such as custom-policies and negative scenarios; previous plan-only registry validation was removed.
  • Source-of-truth review needed: --emit-matrix compatibility with linked feat(e2e): generate scenario fan-out matrix from typed registry #4359: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: run.ts emitMatrix builds an object with include entries containing only id and description.
  • Source-of-truth review needed: Installer trust policy for live scenario runs: The advisor marked localized patch analysis as missing.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: public-curl.sh defaults to raw GitHub main with optional SHA; ollama.sh pipes `https://ollama.ai/install.sh\` to bash.
  • Source-of-truth review needed: Shell action/assertion path resolution: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: phase.ts uses `path.isAbsolute` branches without realpath containment.
  • Source-of-truth review needed: Stable alias path for legacy onboard.log: The advisor marked localized patch analysis as needs_followup.
    • Recommendation: Identify the invalid state, source boundary, source-fix constraint, regression test, and removal condition before merging the localized behavior.
    • Evidence: e2e-phase-orchestrators.test.ts checks the alias; compiler.ts and phase.ts implement the compatibility copy.
  • Secret-bearing child process output is persisted and uploaded without redaction (test/e2e-scenario/scenarios/orchestrators/phase.ts:112): The live runner now executes install, onboarding, and assertion subprocesses with the full workflow environment, including provider secrets, then writes raw stdout/stderr to evidence logs and copies stderr tails into phase result JSON. The workflow uploads the entire .e2e tree with hidden files included, so any installer, onboarding command, provider response, or CLI error that prints a token can persist that secret as an artifact.
    • Recommendation: Use a minimal allowlisted subprocess environment, centrally redact known secret values and secret-shaped strings before writing logs/result JSON, and restrict artifact upload paths to sanitized outputs. Add a regression test where a child prints a secret-shaped value and no .e2e log/result artifact contains it.
    • Evidence: phase.ts builds child envs with `...process.env`, pipes child stdout/stderr to createWriteStream logs, and includes `stderrTail` in failure messages. .github/workflows/e2e-scenarios.yaml passes `NVIDIA_API_KEY` to the runner and uploads `.e2e/` with `include-hidden-files: true`.
  • Required security probes and expected-failure checks still skip without failing live runs (test/e2e-scenario/scenarios/orchestrators/phase.ts:266): Security-sensitive suites for shields, policy enforcement, and injection blocking are represented as probe steps, and negative expected-failure side-effect validation is represented as a pending step. The orchestrator marks both kinds as skipped, while the top-level runner only exits nonzero for failed phase results. A live run can therefore omit required security and negative checks while still appearing non-failing if other assertions pass.
    • Recommendation: Fail closed for required/security probe steps and expected-failure side-effect checks, or exclude those suites/scenarios from live workflow selection until implemented. Add tests proving skipped security probes and pending expected-failure checks make run.ts exit nonzero.
    • Evidence: phase.ts returns `status: "skipped"` for `kind === "probe"` and `kind === "pending"`. The assertion registry maps `security-shields`, `security-policy`, and `security-injection` to probe steps and maps `runtime.expected-failure.no-side-effects` to a pending step. run.ts sets `process.exitCode` only when a phase status is `failed`.
  • Negative scenarios no longer force or verify their expected-failure contracts (test/e2e-scenario/runtime/run-scenario.sh:1): Stubbing the bash runner removed the previous negative orchestration that forced Docker-missing, invalid-key, and gateway-port-conflict failures, captured negative logs, matched the expected failure, and checked forbidden side effects. The TS path has expectedFailure metadata but only a pending side-effect placeholder, so negative scenarios do not prove the declared failure mode or side-effect contract.
    • Recommendation: Port expected-failure orchestration into the TS phase model before relying on these live runs: force the declared failure mode, write the negative log, derive observed phase/error/log/side effects, invoke the matcher, and fail on forbidden side effects. Add tests for Docker-missing, invalid NVIDIA key, and gateway-port-conflict scenarios.
    • Evidence: runtime/run-scenario.sh is now a fail-fast stub. baseline.ts declares expectedFailure for `ubuntu-no-docker-preflight-negative`, `ubuntu-invalid-nvidia-key-negative`, and `ubuntu-gateway-port-conflict-negative`; the TS assertion registry only adds `runtime.expected-failure.no-side-effects` as a pending step. `onboarding_assertions/preflight/00-preflight-expected-failed.sh` requires `negative-preflight.log`, but the TS runner/orchestrators do not create it.
  • Live installer paths execute mutable network scripts without mandatory verification (test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh:19): Removing dry-run guards makes installer helpers consequential in a secret-bearing workflow. The public installer defaults to a mutable raw GitHub main-branch URL and verifies SHA256 only when an optional environment variable is provided. The Ollama helper still pipes a remote installer directly into bash.
    • Recommendation: Pin installer sources to immutable refs or require expected digests in CI. Avoid curl|bash by downloading to a file and verifying it before execution, or use a trusted package source. Add tests that CI installer profiles fail when no pin/digest is configured.
    • Evidence: public-curl.sh defaults `E2E_INSTALLER_URL` to `https://raw.githubusercontent.com/NVIDIA/NemoClaw/main/scripts/install.sh\` and only checks `E2E_INSTALLER_SHA256` when non-empty. ollama.sh runs `curl -fsSL --retry 3 --retry-delay 2 "${ollama_url}" | bash`.
  • Residual E2E_DRY_RUN branch contradicts the one-live-mode contract (test/e2e-scenario/validation_suites/messaging/slack/00-slack-provider-state.sh:15): The PR states that no flag, environment variable, helper, or branch in production paths bypasses real assertion execution, but the Slack provider assertion still honors E2E_DRY_RUN and emits pass markers without reading the runtime Slack config or querying OpenClaw runtime discovery.
    • Recommendation: Remove the E2E_DRY_RUN branch or prove this script is outside every production scenario path. Add a static regression test that fails if `E2E_DRY_RUN`, `e2e_env_is_dry_run`, dry-run pass markers, or other dry-run bypasses appear in executable scenario/assertion paths.
    • Evidence: 00-slack-provider-state.sh branches on `[[ -n "${E2E_DRY_RUN:-}" ]]` and emits dry-run `e2e_pass` markers. The assertion registry wires `messaging-slack` to this script for Slack scenarios.
  • Skeleton pending refs remain in production scenario plans (test/e2e-scenario/scenarios/assertions/environment.ts:17): The PR body lists `phase-1-skeleton` as audited absent from production paths, but the environment baseline still defines a pending step with that ref and scenario plan construction includes that baseline for every scenario. Pending steps are skipped, creating non-failing gaps in live runs.
    • Recommendation: Remove skeleton assertion modules from production scenario plans, or fail closed when any `phase-1-skeleton` or pending production step is encountered. Add a static test for the audited-absent list.
    • Evidence: environment.ts defines `implementation: { kind: "pending", ref: "phase-1-skeleton" }`, and `assertionGroupsForScenario()` includes `environmentBaseline()` for scenario plans. runtime.ts also still defines a `phase-1-skeleton` placeholder module.
  • Workflow runner routing still drifts from the typed scenario registry (.github/workflows/e2e-scenarios.yaml:58): The workflow still uses a hardcoded ROUTES map while the typed registry contains canonical scenarios that are not routed here. The previous registry validation before route lookup was removed, so registry-accepted scenarios can be rejected by the executable workflow path. This contradicts the linked matrix-generation goal that adding a scenario to baseline.ts should be sufficient.
    • Recommendation: Make workflow runner selection consume the same typed routing source as the registry, or add a static contract test that every `listScenarios()` ID has a workflow route. Until then, add the missing routes or exclude unsupported scenarios from workflow dispatch.
    • Evidence: e2e-scenarios.yaml defines ROUTES and omits registry IDs such as `ubuntu-repo-cloud-openclaw-custom-policies`, `ubuntu-invalid-nvidia-key-negative`, and `ubuntu-gateway-port-conflict-negative`. The diff removed the prior `npx tsx ... --plan-only` validation before route lookup.
  • --emit-matrix does not match the linked feat(e2e): generate scenario fan-out matrix from typed registry #4359 matrix contract (test/e2e-scenario/scenarios/run.ts:56): This PR adds --emit-matrix for the dynamic-matrix workflow, but it emits an object with include entries containing only id and description. Linked feat(e2e): generate scenario fan-out matrix from typed registry #4359 describes a single-line JSON array whose entries include id, runner, label, platform, and suites. The two contracts are incompatible.
  • Shell refs and alias paths lack repo-containment guards (test/e2e-scenario/scenarios/orchestrators/phase.ts:83): The orchestrator accepts absolute action/script refs and absolute alias paths, resolving them without realpath containment. Current scenario refs are repository-defined, but this primitive is broad for future scenario, manifest, or registry inputs and could turn a metadata mistake into execution or write outside the intended E2E tree.
    • Recommendation: Reject absolute refs unless explicitly allowlisted, realpath-check repo-relative scripts under the repository root, and constrain alias paths under the context directory. Add negative tests for absolute script refs, `..` traversal, symlink escapes, and absolute alias paths.
    • Evidence: phase.ts uses `path.isAbsolute(action.scriptRef) ? action.scriptRef : path.resolve(REPO_ROOT, action.scriptRef)` and similarly resolves shell step refs. On success it copies evidence to `path.isAbsolute(action.aliasPath) ? action.aliasPath : path.join(ctx.contextDir, action.aliasPath)`.
  • Workflow invokes npx without a no-install/local guard (.github/workflows/e2e-scenarios.yaml:134): The workflow installs dependencies with lifecycle scripts disabled, but then invokes `npx tsx` directly. If dependency state drifts or tsx is absent, this can become a network dependency path in a trusted workflow that also receives provider secrets.
    • Recommendation: Invoke the repository-local `./node_modules/.bin/tsx` after `npm ci`, or use `npx --no-install tsx`, and add a workflow-boundary test for the no-network execution contract.
    • Evidence: Both Linux and WSL workflow paths run `npx tsx test/e2e-scenario/scenarios/run.ts --scenarios "${SCENARIOS}"` after `npm ci --ignore-scripts`.
  • Legacy onboard.log alias is a compatibility workaround without a removal contract (test/e2e-scenario/scenarios/compiler.ts:126): The compiler emits `aliasPath: "onboard.log"` so legacy assertions can keep reading a fixed filename, and the orchestrator copies evidence there best-effort. The test covers the alias behavior, but the source-of-truth review does not identify when legacy assertions will be migrated to action evidence paths or when the alias can be removed.
    • Recommendation: Document the invalid legacy contract, keep the regression test, and add an explicit migration/removal condition. Prefer updating assertions to consume typed action evidence directly once that source is available.
    • Evidence: compiler.ts assigns `aliasPath: "onboard.log"` for onboarding actions; phase.ts copies the evidence log to the alias best-effort. e2e-phase-orchestrators.test.ts asserts the alias exists for legacy shell assertions.

Workflow run details

This is an automated advisory review. A human maintainer must make the final merge decision.

jyaunches added 3 commits May 27, 2026 21:21
…shape client check

- PhaseOrchestrator.runShellStep: wait for the log WriteStream to finish
  before resolving so callers (and tests) reading evidence synchronously
  see the actual stdout/stderr instead of an empty file. Race exposed by
  e2e-phase-orchestrators 'shell_step_passes_when_script_exits_zero'.
- e2e-phase-orchestrators: replace client-source toMatch regex (1
  source-shape test, budget=0) with a runtime-shape behavior assertion
  on the HostCliClient observation. Still enforces 'clients do not
  encode pass/fail or retry/timeout semantics' per hybrid-scenario E2E
  architecture spec, without violating source-shape budget.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Closes the spec's reopened Phase 6 gap. The new typed runner now
executes the install and onboarding work that the deleted bash runner
used to perform, but inside EnvironmentOrchestrator and
OnboardingOrchestrator instead of in workflow YAML or a resurrected
bash runner. All canonical scenarios now reach a real, live SUT before
their assertions run.

Architecture (per hybrid-scenario-e2e-architecture spec):

* types.ts: introduce typed PhaseAction (kind=shell-fn|shell, scriptRef,
  fn, arg, timeoutSeconds, evidencePath) and PhaseActionResult. Replace
  the prior actions: string[] free-form labels with PhaseAction[]. Add
  actions[] to PhaseResult so failure-layer attribution stays clear:
  setup failure is recorded distinctly from assertion failure.

* compiler.ts: phaseActions() now emits typed actions for environment
  (context.emit + install.<id>) and onboarding (profile.<id>). Stable
  action ids: environment.context.emit,
  environment.install.<install-id>, onboarding.profile.<profile-id>.
  All install/onboard actions point at the existing dispatcher scripts
  (install/dispatch.sh, onboard/dispatch.sh) - shell remains the
  implementation per spec, invocation is centralized.

* orchestrators/phase.ts: PhaseOrchestrator.run() executes actions
  before assertions. Action failure short-circuits the phase so
  assertions never run against an environment that was never set up.
  Action runner reuses the same spawn/timeout/process-group/log-flush
  machinery as runShellStep. Per-action timeout, no retry (install and
  onboarding must fail loudly).

* nemoclaw_scenarios/dispatch-action.sh: new bash launcher (the only
  new shell file). The install/onboard dispatchers are intentionally
  library-style (function definitions only); this launcher gives them
  a deterministic executable entrypoint that sources runtime/lib/env.sh
  + runtime/lib/context.sh, applies non-interactive env, sources the
  requested dispatcher, and invokes the named function with one arg.
  Replaces the orchestration that the deleted run-scenario.sh used to
  do, but called from the typed orchestrator instead.

* plan-only output: now shows 'Action: <id> (timeout=...) -> <fn> <arg>'
  per phase, before assertion groups. Maintainers can preview the full
  setup+onboard+assert sequence before dispatch.

* framework-tests/e2e-phase-orchestrators.test.ts: add five behavior
  tests covering action-runs-before-assertions, action-failure short-
  circuits-assertions, action timeout via orchestrator policy,
  evidence-log flushed-before-resolve, and compiler emits typed
  install/onboard actions for all 7 canonical scenarios.

What stays out:

* No workflow YAML edits. .github/workflows/e2e-scenarios.yaml still
  invokes only 'npx tsx test/e2e-scenario/scenarios/run.ts --scenarios
  ...'. Workflow YAML stays innocent of install/onboard plumbing.
* No client edits. HostCliClient et al. remain pass/fail/policy free.
* No resolver/YAML-first revival. setup_scenarios/test_plans/suite_filter
  remain unsupported.

Validation gate (Phase 6 reopen note) is the next step: after this
push goes green on PR CI, dispatch e2e-scenarios-all.yaml against
feat/e2e-real-execution and confirm canonical scenarios produce real
phase results with action evidence under .e2e/actions/<id>.log,
instead of <1s 'failed=34 skipped=5'.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
test/e2e-scenario/scenarios/types.ts (1)

110-131: ⚡ Quick win

Strengthen PhaseAction typing (discriminated union) to enforce per-kind required fields.

Current runtime already fails loudly if a "shell-fn" action is missing fn (the runner passes action.fn ?? "", and dispatch-action.sh errors when the function name isn’t found). Also, current PhaseAction objects are produced in test/e2e-scenario/scenarios/compiler.ts with fn populated for "shell-fn". Still, the type allows invalid combinations to compile, so a discriminated union would prevent that drift for any future producers.

Suggested refactor
-export interface PhaseAction {
-  id: string;
-  phase: PhaseName;
-  description?: string;
-  // "shell-fn" sources the bash dispatcher and invokes the named function.
-  // "shell"    runs an executable script (used for context-emit helper).
-  kind: "shell-fn" | "shell";
-  // Repo-relative path to the script.
-  scriptRef: string;
-  // For "shell-fn": the bash function to invoke after sourcing scriptRef.
-  fn?: string;
-  // Single positional arg passed to the function/script (install method or
-  // onboarding profile id today). Kept as a single string to keep stable
-  // ids predictable; multi-arg variants can extend this later.
-  arg?: string;
-  // Per-action timeout. No retry by default - install/onboard must fail
-  // loudly so the regression is visible. Retry stays a property of
-  // assertion steps, not actions.
-  timeoutSeconds?: number;
-  // Repo-relative evidence log path.
-  evidencePath?: string;
-}
+interface PhaseActionBase {
+  id: string;
+  phase: PhaseName;
+  description?: string;
+  scriptRef: string;
+  timeoutSeconds?: number;
+  evidencePath?: string;
+}
+
+export type PhaseAction =
+  | (PhaseActionBase & {
+      kind: "shell-fn";
+      fn: string;
+      arg?: string;
+    })
+  | (PhaseActionBase & {
+      kind: "shell";
+      arg?: string;
+    });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e-scenario/scenarios/types.ts` around lines 110 - 131, Change
PhaseAction from a single broad interface into a discriminated union so
TypeScript enforces per-kind fields: define one variant for kind: "shell-fn"
that requires fn: string (plus shared fields like id, phase, scriptRef, arg?,
timeoutSeconds?, evidencePath?, description?) and another variant for kind:
"shell" that omits/marks fn as disallowed/undefined; update any usages (e.g.,
the action objects created in test/e2e-scenario/scenarios/compiler.ts and the
runner that reads action.fn) to satisfy the new types so compilation ensures
"shell-fn" always has fn and "shell" never does.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh`:
- Line 60: Remove the "|| true" suppression so failures in e2e_context_init
surface during test runs: in dispatch-action.sh replace the line that calls the
initializer (the call to e2e_context_init) by invoking e2e_context_init without
"|| true" so that any mkdir or file-write errors creating
${E2E_CONTEXT_DIR}/context.env propagate (this ensures e2e_context_require will
fail immediately instead of masking the error and misattributing missing keys).

In `@test/e2e-scenario/scenarios/compiler.ts`:
- Around line 94-99: The code currently silently returns [] when required phase
action dimensions like installId (from scenario.environment?.install) are
missing; instead throw a hard error to fail-fast: replace the early returns that
yield [] with throwing a descriptive Error (include context such as scenario.id
or scenario.name and which dimension is missing) in the function that generates
phase actions (the branch checking installId / scenario.environment?.install),
and make the identical change in the other similar branch (the second check at
the same pattern) so malformed scenarios surface as hard failures rather than
emitting empty action lists.

---

Nitpick comments:
In `@test/e2e-scenario/scenarios/types.ts`:
- Around line 110-131: Change PhaseAction from a single broad interface into a
discriminated union so TypeScript enforces per-kind fields: define one variant
for kind: "shell-fn" that requires fn: string (plus shared fields like id,
phase, scriptRef, arg?, timeoutSeconds?, evidencePath?, description?) and
another variant for kind: "shell" that omits/marks fn as disallowed/undefined;
update any usages (e.g., the action objects created in
test/e2e-scenario/scenarios/compiler.ts and the runner that reads action.fn) to
satisfy the new types so compilation ensures "shell-fn" always has fn and
"shell" never does.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1a237354-36d1-4e41-8990-3b50d8f973f0

📥 Commits

Reviewing files that changed from the base of the PR and between 903f90b and 628870c.

📒 Files selected for processing (5)
  • test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
  • test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh
  • test/e2e-scenario/scenarios/compiler.ts
  • test/e2e-scenario/scenarios/orchestrators/phase.ts
  • test/e2e-scenario/scenarios/types.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/e2e-scenario/scenarios/orchestrators/phase.ts
  • test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts

Comment thread test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh Outdated
Comment thread test/e2e-scenario/scenarios/compiler.ts Outdated
jyaunches added 5 commits May 27, 2026 22:19
The first live dispatch of the Phase 6 wiring (run 26550310438) gave us
real action evidence and surfaced three real bugs. All three are fixed
inside the spec's prescribed layers - no workflow YAML, no client, no
old-resolver path.

1. environment.context.emit was a shell action that called the legacy
   emit-context-from-plan.sh helper. That helper expects the OLD
   YAML-resolver plan.json shape (dimensions.platform.profile.os...),
   which the typed compiler does not produce. Drop the shell action;
   add scenarios/orchestrators/context.ts that derives a normalized
   context.env directly from the typed RunPlan and writes it from
   ScenarioRunner.run() before any phase. Spec: context emission is
   framework infrastructure, not a phase action.

2. PhaseOrchestrator.runShellStep was reading context.env from
   ${ctx.contextDir}/.e2e/context.env, but the shell helper writes
   to ${E2E_CONTEXT_DIR}/context.env (top-level). Fix the path so
   shell assertions see seeded keys.

3. ScenarioRunner did not short-circuit across phase boundaries: a
   failed environment ACTION (real setup work) still let onboarding
   and runtime run, producing a misleading 34-failure cascade.
   Runner now consults prior phase results: if any prior action
   failed, downstream phases are synthesized as skipped with a
   message naming the blocking phase+action+message. Assertion-only
   failures still propagate as failures.

Tests added (8 new, 292/292 scenario framework tests green).

Validation gate next: dispatch e2e-scenarios-all.yaml again.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Run 26550936453 surfaced two more real bugs after Phase 6 wiring went
live. Both fixed inside the spec's prescribed layers; nothing leaks
into workflow YAML or clients.

1. dispatch-action.sh called e2e_context_init unconditionally before
   sourcing the install/onboard dispatcher. e2e_context_init opens
   context.env with `: > ctx`, which truncated the file the
   ScenarioRunner had just seeded. All runtime assertions then failed
   with 'e2e context: missing required key(s): E2E_SCENARIO ...'.
   Fix: dispatch-action.sh no longer calls e2e_context_init. The TS
   framework owns context.env initialization; workers may still extend
   it via e2e_context_set.

2. The legacy onboarding.preflight.passed assertion expects an
   onboard.log file at ${E2E_CONTEXT_DIR}/onboard.log. The old bash
   runner used to redirect onboarding output there; the typed
   orchestrator captured it under .e2e/actions/<action-id>.log. Fix:
   add optional aliasPath to PhaseAction; compiler sets aliasPath to
   'onboard.log' for the onboarding profile action; orchestrator
   copies the action evidence log to the alias on success. Best-
   effort - alias copy failures do not fail the action.

Live evidence from run 26550936453 (canonical ubuntu-repo-cloud-openclaw):
- environment.install.repo-current: passed in 14.2s
- onboarding.profile.cloud-openclaw: passed in 302s (real onboarding!)
- onboarding.base.cli-installed: passed
- onboarding.preflight.passed: failed (onboard.log not found) <- fixed
- runtime.* (10 steps): all 'missing key(s)' <- fixed by #1

Tests: 38/38 phase-orchestrator (was 36; +2 alias tests), 294/294
scenario framework. shellcheck clean.

Validation gate next: redispatch e2e-scenarios-all and confirm
runtime steps actually exercise the SUT (real pass/fail, not key
errors).

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
The first canonical-scenario run that reached the onboarding-assertion
phase (run 26552550140 / ubuntu-repo-cloud-openclaw) showed that the
legacy onboarding.preflight.passed assertion fails on every successful
run because its regex matches any mention of 'docker' / 'container' /
'daemon' / 'socket' in onboard.log - and a normal nemoclaw onboarding
mentions all of those many times.

The action itself succeeded (exit 0, 263s of real onboarding work);
the assertion is meant to confirm onboard.log does not contain
explicit preflight FAILURE markers. Tighten the regex accordingly:
match phrases like 'preflight failed/error', 'cannot connect to the
docker daemon', 'onboarding aborted', 'FATAL: docker', 'ERROR: docker
daemon' - not bare topic words.

Verified: shellcheck passes; bash -n passes.

Why we stop here on this PR:

This commit lands the last small framework-level fix produced by live
action evidence. The Phase 6 wiring is now fully validated end-to-end:

  Install:         passed (~12s)
  Onboarding:      action passed (~263s real onboarding)
                   base.cli-installed passed
                   preflight.passed will now pass
  Runtime:         9 passed / 25 failed / 5 skipped against live SUT

The remaining 25 runtime failures are real product/test bugs surfaced
by finally executing the suite against a live SUT (sandbox-shell
timeouts, inference 30-60s timeouts, lifecycle.sandbox_operations
exit-1 mismatches, lifecycle.rebuild/upgrade 120s timeouts even after
retries). They are pre-existing and out of scope for 'execute real
shell assertions; delete dry-run, --validate-only, and the bash
runner'. They become productive follow-up issues.

The 5 skipped runtime steps are 'probe not registered' - known per
spec; probe registry lands in a follow-up.

Negative scenarios (ubuntu-no-docker-preflight-negative,
invalid-key-negative, gateway-port-conflict-negative) need expected-
failure semantics and a way to actually simulate docker-missing on
the runner. Out of scope here; tracked as follow-up.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
All three findings are valid mechanical bugs introduced/touched by this
PR. Batched per the safe-batch policy: same-risk, independently
obvious, testable together.

1. orchestrators/phase.ts classifierForRef:
   Outer guard is /i (case-insensitive), but the inner branch used
   case-sensitive ref.includes("tunnel") / ref.includes("cloudflared")
   - mixed-case refs would fall through and misclassify as
   provider-transient. Replace with /tunnel|cloudflared/i.test(ref).

2. scenarios/compiler.ts phaseActions:
   Inline comment said "the scenario is malformed; surface it as a
   hard error" but the code returned []. Hard-fail instead, with a
   message that names the missing dimension. Empty environment is
   still tolerated (skeleton scenarios can carry no setup yet).

3. validation_suites/lib/sandbox_lifecycle.sh:
   Substring match `*${E2E_SANDBOX_NAME}*` would let sb1 falsely
   match sb10. Use awk with a whole-token equality check on column
   one of `nemoclaw list` output.

Tests: 294/294 scenario framework still green. shellcheck + shfmt
clean. No behavior change for canonical scenarios; affected paths
were either dormant (case-mixed classifier) or returning a slightly
stricter outcome (compiler hard-fail, sandbox exact match).

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.55 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant