test(e2e): execute real shell assertions; delete dry-run, --validate-only, and the bash runner#4380
test(e2e): execute real shell assertions; delete dry-run, --validate-only, and the bash runner#4380jyaunches wants to merge 10 commits into
Conversation
…only, and the bash runner
The merged hybrid scenario architecture shipped scaffolding that looked like it
ran E2E tests but did not. Two layers were producing fake green:
1. phase.ts:executeStep had no child_process.spawn anywhere. Every real
shell/probe step fell through to a hardcoded
{ status: "failed", message: "unsupported live step" }, and a handful of
fake-pass refs (fake-pass, fake-retry-once-pass, fake-always-transient,
phase-1-skeleton) were the only paths that reported "passed". CI green
meant the plan compiled, not that any assertion executed.
2. ~30 shell scripts under validation_suites/, onboarding_assertions/,
nemoclaw_scenarios/install/, and nemoclaw_scenarios/onboard/ began with
"if e2e_env_is_dry_run; then echo [dry-run] ...; exit 0; fi". Once the
dry-run flag flowed in (which workflows did pass), every script silently
exited 0 before its real assertion ran.
This change rips out both layers in one shot. The TS runner has one execution
mode: live. There is no flag, env var, helper, branch, or comment in any
production path that can produce a fake pass.
Orchestrator (TypeScript)
- phase.ts: executeStep now spawns shell steps via child_process.spawn,
with detached process groups so timeouts kill bash + sleep cleanly. Probe
steps return skipped "probe not registered" until the registry lands.
Pending steps return skipped "pending: <ref>". Unknown kinds throw.
Real evidence is captured to .e2e/logs/<step-id>.log. Step-level
reliability.timeoutSeconds and retry.{attempts,on} policy are enforced
here, not in clients.
- run.ts: --dry-run, --validate-only deleted. Default invocation is live
execution. --list and --plan-only (local debug) survive read-only.
--emit-matrix added for the dynamic-matrix workflow (PR #4359).
- types.ts: RunContext.dryRun deleted. AssertionResult already supported
"skipped" status, now actually used.
Workflow
- e2e-scenarios.yaml: the resolve-runner --plan-only warmup, and both
--dry-run invocations (Linux + WSL), are gone. Workflows execute live.
- workflow-boundary.mts validator now requires --dry-run, --plan-only,
and --validate-only to NOT appear in the workflow.
Bash entrypoints (PR collapses what was originally going to be PR 1 + PR 2)
- runtime/run-scenario.sh: 483 lines of duplicated install/onboard/
gateway-check/suite-execution -> 5-line fail-fast stub pointing at
run.ts. The TS phase orchestrators own this work now.
- runtime/run-suites.sh: same treatment. PhaseOrchestrator.runShellStep
walks typed assertionGroups directly; nothing in TS calls a YAML-walking
bash runner.
Shell scripts (the leaves stay, the dry-run skip blocks die)
- validation_suites/**, onboarding_assertions/**, nemoclaw_scenarios/**:
every "if e2e_env_is_dry_run; then ... exit 0; fi" and every
"[[ ${E2E_DRY_RUN:-0} == 1 ]]" short-circuit removed. The real assertion
logic that was hiding underneath now runs unconditionally.
- runtime/lib/env.sh: e2e_env_is_dry_run helper deleted.
- inference_routing.sh: dead _e2e_inference_plan helper (only callable
from the deleted dry-run paths) deleted.
Tests
- DELETED (validated dead code paths):
e2e-suite-runner.test.ts (run-suites.sh behavior)
e2e-scenario-first-migration.test.ts (run-scenario.sh dry-run plan)
e2e-expected-state-validator.test.ts (--validate-only mode)
- REWRITTEN:
e2e-phase-orchestrators.test.ts: now exercises real shell spawning
via temp scripts (pass/fail/timeout/retry/missing-ref), real probe
skipping with visible reason, and real pending skipping. The
previous fake-pass refs in this test were the canonical example of
the problem.
- TRIMMED:
e2e-lib-helpers.test.ts: dry-run-mode unit tests deleted; tests of
real bash semantics survive.
e2e-scenario-additional-families.test.ts: planOnly-via-bash tests
deleted; resolveScenario-direct tests survive.
e2e-scenario-resolver.test.ts: run-scenario.sh --plan-only spawn
tests deleted; resolver unit tests survive.
e2e-context-helper.test.ts: dry-run trace test deleted.
Docs
- docs/README.md: updated to state one runner, one mode (live), no
dry-run, no validate-only. Bash entrypoints documented as deprecated
fail-fast stubs.
Verification
- run.ts --list : prints the typed registry (intact)
- run.ts --emit-matrix : emits JSON matrix for the dynamic-matrix
workflow
- run.ts --scenarios <id>: spawns real shell scripts, real exit codes,
real failures with real evidence logs. Phase
results show passed/failed/skipped honestly.
- All 274 e2e-scenario framework tests pass.
- Audited: no surviving --dry-run, dryRun, E2E_DRY_RUN, e2e_env_is_dry_run,
fake-pass, fake-retry-once-pass, fake-always-transient, phase-1-skeleton,
unsupported-live-step, --validate-only, or RunContext.dryRun in any
production path.
CI for this PR will go red on environments where nemoclaw is not actually
installed and onboarded. That is the point. Red is the first honest signal
in months. Subsequent PRs (probe registry, OnboardingOrchestrator wiring
into the real install/onboard dispatchers, old YAML resolver deletion)
fix the real failures rather than hide them.
Spec gates addressed: Phase 6 (orchestrators execute live shell steps),
Phase 7 (single TS runtime entrypoint, bash runners deprecated), and
the workflow side of Phase 9 (--dry-run / --validate-only / suite_filter
gone from active paths). The old YAML resolver source under
runtime/resolver/ stays for now; its deletion is the next PR.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRemoves plan-only/dry-run modes, makes the TypeScript runner the canonical live executor (adds matrix emission), deprecates bash runners, updates the Phase orchestrator for real shell execution, removes dry-run guards from validation/install scripts, and updates tests and workflow checks. ChangesLive E2E Execution Framework
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
# Conflicts: # test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e-scenario/scenarios/orchestrators/phase.ts`:
- Around line 32-34: The branch uses a case-insensitive regex test on ref but
then calls case-sensitive ref.includes("tunnel") / ref.includes("cloudflared"),
causing mixed-case refs to misclassify; update the branching in the same block
(the if handling /provider|inference|chat-completion|cloudflared|tunnel/i) to
compare in a case-insensitive way—e.g., normalize ref to lowerCase once and use
that for the subsequent includes checks (or use case-insensitive regex matches)
so tunnel/cloudflared variants are correctly detected and return
"external-tunnel".
In `@test/e2e-scenario/validation_suites/lib/sandbox_lifecycle.sh`:
- Line 62: The current substring check [[ "${SANDBOX_LIFECYCLE_LAST_OUTPUT}" ==
*"${E2E_SANDBOX_NAME}"* ]] can produce false positives (e.g., sb1 matching
sb10); change it to perform an exact token/line match of the sandbox name
instead: test SANDBOX_LIFECYCLE_LAST_OUTPUT for the exact E2E_SANDBOX_NAME token
(for example by piping SANDBOX_LIFECYCLE_LAST_OUTPUT to grep -w/-x or using a
word-boundary regex with [[ ... =~ ... ]]) and fail if not found, so the check
around SANDBOX_LIFECYCLE_LAST_OUTPUT and E2E_SANDBOX_NAME only succeeds for
exact sandbox-name matches.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 401b8a23-1471-429d-9905-b413e6bae24c
📒 Files selected for processing (44)
.github/workflows/e2e-scenarios.yamltest/e2e-scenario/docs/README.mdtest/e2e-scenario/framework-tests/e2e-context-helper.test.tstest/e2e-scenario/framework-tests/e2e-expected-state-validator.test.tstest/e2e-scenario/framework-tests/e2e-lib-helpers.test.tstest/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.tstest/e2e-scenario/framework-tests/e2e-scenario-additional-families.test.tstest/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.tstest/e2e-scenario/framework-tests/e2e-scenario-resolver.test.tstest/e2e-scenario/framework-tests/e2e-suite-runner.test.tstest/e2e-scenario/nemoclaw_scenarios/fixtures/older-base-image.shtest/e2e-scenario/nemoclaw_scenarios/install/dispatch.shtest/e2e-scenario/nemoclaw_scenarios/install/launchable.shtest/e2e-scenario/nemoclaw_scenarios/install/ollama.shtest/e2e-scenario/nemoclaw_scenarios/install/public-curl.shtest/e2e-scenario/nemoclaw_scenarios/install/repo-current.shtest/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.shtest/e2e-scenario/runtime/lib/env.shtest/e2e-scenario/runtime/run-scenario.shtest/e2e-scenario/runtime/run-suites.shtest/e2e-scenario/scenarios/orchestrators/phase.tstest/e2e-scenario/scenarios/run.tstest/e2e-scenario/scenarios/types.tstest/e2e-scenario/validation_suites/assert/gateway-alive.shtest/e2e-scenario/validation_suites/assert/sandbox-alive.shtest/e2e-scenario/validation_suites/hermes/00-hermes-health.shtest/e2e-scenario/validation_suites/inference/cloud/00-models-health.shtest/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.shtest/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.shtest/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.shtest/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.shtest/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.shtest/e2e-scenario/validation_suites/lib/inference_routing.shtest/e2e-scenario/validation_suites/lib/messaging_providers.shtest/e2e-scenario/validation_suites/lib/rebuild_upgrade.shtest/e2e-scenario/validation_suites/lib/sandbox_lifecycle.shtest/e2e-scenario/validation_suites/lib/security_policy_credentials.shtest/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.shtest/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.shtest/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.shtest/e2e-scenario/validation_suites/sandbox-exec.shtest/e2e-scenario/validation_suites/smoke/00-cli-available.shtest/e2e-scenario/validation_suites/smoke/03-sandbox-shell.shtools/e2e-scenarios/workflow-boundary.mts
💤 Files with no reviewable changes (30)
- test/e2e-scenario/validation_suites/inference/cloud/01-chat-completion.sh
- test/e2e-scenario/nemoclaw_scenarios/install/launchable.sh
- test/e2e-scenario/validation_suites/messaging/common/03-bridge-reachable.sh
- test/e2e-scenario/validation_suites/inference/ollama-gpu/00-ollama-models-health.sh
- test/e2e-scenario/validation_suites/assert/sandbox-alive.sh
- test/e2e-scenario/validation_suites/smoke/03-sandbox-shell.sh
- test/e2e-scenario/validation_suites/inference/ollama-auth-proxy/00-proxy-reachable.sh
- test/e2e-scenario/validation_suites/inference/cloud/02-inference-local-from-sandbox.sh
- test/e2e-scenario/validation_suites/platform/wsl/00-wsl-smoke.sh
- test/e2e-scenario/validation_suites/lib/messaging_providers.sh
- test/e2e-scenario/nemoclaw_scenarios/install/repo-current.sh
- test/e2e-scenario/scenarios/types.ts
- test/e2e-scenario/validation_suites/platform/macos/00-macos-smoke.sh
- test/e2e-scenario/framework-tests/e2e-expected-state-validator.test.ts
- test/e2e-scenario/validation_suites/inference/ollama-gpu/01-ollama-chat-completion.sh
- test/e2e-scenario/nemoclaw_scenarios/install/ollama.sh
- test/e2e-scenario/validation_suites/hermes/00-hermes-health.sh
- test/e2e-scenario/validation_suites/lib/rebuild_upgrade.sh
- test/e2e-scenario/validation_suites/smoke/00-cli-available.sh
- test/e2e-scenario/validation_suites/lib/inference_routing.sh
- test/e2e-scenario/validation_suites/assert/gateway-alive.sh
- test/e2e-scenario/framework-tests/e2e-suite-runner.test.ts
- test/e2e-scenario/validation_suites/inference/cloud/00-models-health.sh
- test/e2e-scenario/validation_suites/sandbox-exec.sh
- test/e2e-scenario/validation_suites/lib/security_policy_credentials.sh
- test/e2e-scenario/nemoclaw_scenarios/install/public-curl.sh
- test/e2e-scenario/nemoclaw_scenarios/onboard/dispatch.sh
- test/e2e-scenario/framework-tests/e2e-scenario-first-migration.test.ts
- test/e2e-scenario/runtime/lib/env.sh
- test/e2e-scenario/framework-tests/e2e-context-helper.test.ts
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
E2E Scenario Advisor RecommendationRequired scenario E2E: Dispatch required scenario E2E:
Full scenario advisor summaryE2E Scenario AdvisorBase: Required scenario E2E
Optional scenario E2E
Relevant changed files
|
PR Review AdvisorFindings: 8 needs attention, 10 worth checking, 0 nice ideas Review findings🛠️ Needs attention
🔎 Worth checking
🌱 Nice ideas
Since last review detailsCurrent findings:
This is an automated advisory review. A human maintainer must make the final merge decision. |
…shape client check - PhaseOrchestrator.runShellStep: wait for the log WriteStream to finish before resolving so callers (and tests) reading evidence synchronously see the actual stdout/stderr instead of an empty file. Race exposed by e2e-phase-orchestrators 'shell_step_passes_when_script_exits_zero'. - e2e-phase-orchestrators: replace client-source toMatch regex (1 source-shape test, budget=0) with a runtime-shape behavior assertion on the HostCliClient observation. Still enforces 'clients do not encode pass/fail or retry/timeout semantics' per hybrid-scenario E2E architecture spec, without violating source-shape budget. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Closes the spec's reopened Phase 6 gap. The new typed runner now executes the install and onboarding work that the deleted bash runner used to perform, but inside EnvironmentOrchestrator and OnboardingOrchestrator instead of in workflow YAML or a resurrected bash runner. All canonical scenarios now reach a real, live SUT before their assertions run. Architecture (per hybrid-scenario-e2e-architecture spec): * types.ts: introduce typed PhaseAction (kind=shell-fn|shell, scriptRef, fn, arg, timeoutSeconds, evidencePath) and PhaseActionResult. Replace the prior actions: string[] free-form labels with PhaseAction[]. Add actions[] to PhaseResult so failure-layer attribution stays clear: setup failure is recorded distinctly from assertion failure. * compiler.ts: phaseActions() now emits typed actions for environment (context.emit + install.<id>) and onboarding (profile.<id>). Stable action ids: environment.context.emit, environment.install.<install-id>, onboarding.profile.<profile-id>. All install/onboard actions point at the existing dispatcher scripts (install/dispatch.sh, onboard/dispatch.sh) - shell remains the implementation per spec, invocation is centralized. * orchestrators/phase.ts: PhaseOrchestrator.run() executes actions before assertions. Action failure short-circuits the phase so assertions never run against an environment that was never set up. Action runner reuses the same spawn/timeout/process-group/log-flush machinery as runShellStep. Per-action timeout, no retry (install and onboarding must fail loudly). * nemoclaw_scenarios/dispatch-action.sh: new bash launcher (the only new shell file). The install/onboard dispatchers are intentionally library-style (function definitions only); this launcher gives them a deterministic executable entrypoint that sources runtime/lib/env.sh + runtime/lib/context.sh, applies non-interactive env, sources the requested dispatcher, and invokes the named function with one arg. Replaces the orchestration that the deleted run-scenario.sh used to do, but called from the typed orchestrator instead. * plan-only output: now shows 'Action: <id> (timeout=...) -> <fn> <arg>' per phase, before assertion groups. Maintainers can preview the full setup+onboard+assert sequence before dispatch. * framework-tests/e2e-phase-orchestrators.test.ts: add five behavior tests covering action-runs-before-assertions, action-failure short- circuits-assertions, action timeout via orchestrator policy, evidence-log flushed-before-resolve, and compiler emits typed install/onboard actions for all 7 canonical scenarios. What stays out: * No workflow YAML edits. .github/workflows/e2e-scenarios.yaml still invokes only 'npx tsx test/e2e-scenario/scenarios/run.ts --scenarios ...'. Workflow YAML stays innocent of install/onboard plumbing. * No client edits. HostCliClient et al. remain pass/fail/policy free. * No resolver/YAML-first revival. setup_scenarios/test_plans/suite_filter remain unsupported. Validation gate (Phase 6 reopen note) is the next step: after this push goes green on PR CI, dispatch e2e-scenarios-all.yaml against feat/e2e-real-execution and confirm canonical scenarios produce real phase results with action evidence under .e2e/actions/<id>.log, instead of <1s 'failed=34 skipped=5'. Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
test/e2e-scenario/scenarios/types.ts (1)
110-131: ⚡ Quick winStrengthen
PhaseActiontyping (discriminated union) to enforce per-kindrequired fields.Current runtime already fails loudly if a
"shell-fn"action is missingfn(the runner passesaction.fn ?? "", anddispatch-action.sherrors when the function name isn’t found). Also, currentPhaseActionobjects are produced intest/e2e-scenario/scenarios/compiler.tswithfnpopulated for"shell-fn". Still, the type allows invalid combinations to compile, so a discriminated union would prevent that drift for any future producers.Suggested refactor
-export interface PhaseAction { - id: string; - phase: PhaseName; - description?: string; - // "shell-fn" sources the bash dispatcher and invokes the named function. - // "shell" runs an executable script (used for context-emit helper). - kind: "shell-fn" | "shell"; - // Repo-relative path to the script. - scriptRef: string; - // For "shell-fn": the bash function to invoke after sourcing scriptRef. - fn?: string; - // Single positional arg passed to the function/script (install method or - // onboarding profile id today). Kept as a single string to keep stable - // ids predictable; multi-arg variants can extend this later. - arg?: string; - // Per-action timeout. No retry by default - install/onboard must fail - // loudly so the regression is visible. Retry stays a property of - // assertion steps, not actions. - timeoutSeconds?: number; - // Repo-relative evidence log path. - evidencePath?: string; -} +interface PhaseActionBase { + id: string; + phase: PhaseName; + description?: string; + scriptRef: string; + timeoutSeconds?: number; + evidencePath?: string; +} + +export type PhaseAction = + | (PhaseActionBase & { + kind: "shell-fn"; + fn: string; + arg?: string; + }) + | (PhaseActionBase & { + kind: "shell"; + arg?: string; + });🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/e2e-scenario/scenarios/types.ts` around lines 110 - 131, Change PhaseAction from a single broad interface into a discriminated union so TypeScript enforces per-kind fields: define one variant for kind: "shell-fn" that requires fn: string (plus shared fields like id, phase, scriptRef, arg?, timeoutSeconds?, evidencePath?, description?) and another variant for kind: "shell" that omits/marks fn as disallowed/undefined; update any usages (e.g., the action objects created in test/e2e-scenario/scenarios/compiler.ts and the runner that reads action.fn) to satisfy the new types so compilation ensures "shell-fn" always has fn and "shell" never does.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/e2e-scenario/nemoclaw_scenarios/dispatch-action.sh`:
- Line 60: Remove the "|| true" suppression so failures in e2e_context_init
surface during test runs: in dispatch-action.sh replace the line that calls the
initializer (the call to e2e_context_init) by invoking e2e_context_init without
"|| true" so that any mkdir or file-write errors creating
${E2E_CONTEXT_DIR}/context.env propagate (this ensures e2e_context_require will
fail immediately instead of masking the error and misattributing missing keys).
In `@test/e2e-scenario/scenarios/compiler.ts`:
- Around line 94-99: The code currently silently returns [] when required phase
action dimensions like installId (from scenario.environment?.install) are
missing; instead throw a hard error to fail-fast: replace the early returns that
yield [] with throwing a descriptive Error (include context such as scenario.id
or scenario.name and which dimension is missing) in the function that generates
phase actions (the branch checking installId / scenario.environment?.install),
and make the identical change in the other similar branch (the second check at
the same pattern) so malformed scenarios surface as hard failures rather than
emitting empty action lists.
---
Nitpick comments:
In `@test/e2e-scenario/scenarios/types.ts`:
- Around line 110-131: Change PhaseAction from a single broad interface into a
discriminated union so TypeScript enforces per-kind fields: define one variant
for kind: "shell-fn" that requires fn: string (plus shared fields like id,
phase, scriptRef, arg?, timeoutSeconds?, evidencePath?, description?) and
another variant for kind: "shell" that omits/marks fn as disallowed/undefined;
update any usages (e.g., the action objects created in
test/e2e-scenario/scenarios/compiler.ts and the runner that reads action.fn) to
satisfy the new types so compilation ensures "shell-fn" always has fn and
"shell" never does.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1a237354-36d1-4e41-8990-3b50d8f973f0
📒 Files selected for processing (5)
test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.tstest/e2e-scenario/nemoclaw_scenarios/dispatch-action.shtest/e2e-scenario/scenarios/compiler.tstest/e2e-scenario/scenarios/orchestrators/phase.tstest/e2e-scenario/scenarios/types.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- test/e2e-scenario/scenarios/orchestrators/phase.ts
- test/e2e-scenario/framework-tests/e2e-phase-orchestrators.test.ts
The first live dispatch of the Phase 6 wiring (run 26550310438) gave us
real action evidence and surfaced three real bugs. All three are fixed
inside the spec's prescribed layers - no workflow YAML, no client, no
old-resolver path.
1. environment.context.emit was a shell action that called the legacy
emit-context-from-plan.sh helper. That helper expects the OLD
YAML-resolver plan.json shape (dimensions.platform.profile.os...),
which the typed compiler does not produce. Drop the shell action;
add scenarios/orchestrators/context.ts that derives a normalized
context.env directly from the typed RunPlan and writes it from
ScenarioRunner.run() before any phase. Spec: context emission is
framework infrastructure, not a phase action.
2. PhaseOrchestrator.runShellStep was reading context.env from
${ctx.contextDir}/.e2e/context.env, but the shell helper writes
to ${E2E_CONTEXT_DIR}/context.env (top-level). Fix the path so
shell assertions see seeded keys.
3. ScenarioRunner did not short-circuit across phase boundaries: a
failed environment ACTION (real setup work) still let onboarding
and runtime run, producing a misleading 34-failure cascade.
Runner now consults prior phase results: if any prior action
failed, downstream phases are synthesized as skipped with a
message naming the blocking phase+action+message. Assertion-only
failures still propagate as failures.
Tests added (8 new, 292/292 scenario framework tests green).
Validation gate next: dispatch e2e-scenarios-all.yaml again.
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Run 26550936453 surfaced two more real bugs after Phase 6 wiring went
live. Both fixed inside the spec's prescribed layers; nothing leaks
into workflow YAML or clients.
1. dispatch-action.sh called e2e_context_init unconditionally before
sourcing the install/onboard dispatcher. e2e_context_init opens
context.env with `: > ctx`, which truncated the file the
ScenarioRunner had just seeded. All runtime assertions then failed
with 'e2e context: missing required key(s): E2E_SCENARIO ...'.
Fix: dispatch-action.sh no longer calls e2e_context_init. The TS
framework owns context.env initialization; workers may still extend
it via e2e_context_set.
2. The legacy onboarding.preflight.passed assertion expects an
onboard.log file at ${E2E_CONTEXT_DIR}/onboard.log. The old bash
runner used to redirect onboarding output there; the typed
orchestrator captured it under .e2e/actions/<action-id>.log. Fix:
add optional aliasPath to PhaseAction; compiler sets aliasPath to
'onboard.log' for the onboarding profile action; orchestrator
copies the action evidence log to the alias on success. Best-
effort - alias copy failures do not fail the action.
Live evidence from run 26550936453 (canonical ubuntu-repo-cloud-openclaw):
- environment.install.repo-current: passed in 14.2s
- onboarding.profile.cloud-openclaw: passed in 302s (real onboarding!)
- onboarding.base.cli-installed: passed
- onboarding.preflight.passed: failed (onboard.log not found) <- fixed
- runtime.* (10 steps): all 'missing key(s)' <- fixed by #1
Tests: 38/38 phase-orchestrator (was 36; +2 alias tests), 294/294
scenario framework. shellcheck clean.
Validation gate next: redispatch e2e-scenarios-all and confirm
runtime steps actually exercise the SUT (real pass/fail, not key
errors).
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
The first canonical-scenario run that reached the onboarding-assertion
phase (run 26552550140 / ubuntu-repo-cloud-openclaw) showed that the
legacy onboarding.preflight.passed assertion fails on every successful
run because its regex matches any mention of 'docker' / 'container' /
'daemon' / 'socket' in onboard.log - and a normal nemoclaw onboarding
mentions all of those many times.
The action itself succeeded (exit 0, 263s of real onboarding work);
the assertion is meant to confirm onboard.log does not contain
explicit preflight FAILURE markers. Tighten the regex accordingly:
match phrases like 'preflight failed/error', 'cannot connect to the
docker daemon', 'onboarding aborted', 'FATAL: docker', 'ERROR: docker
daemon' - not bare topic words.
Verified: shellcheck passes; bash -n passes.
Why we stop here on this PR:
This commit lands the last small framework-level fix produced by live
action evidence. The Phase 6 wiring is now fully validated end-to-end:
Install: passed (~12s)
Onboarding: action passed (~263s real onboarding)
base.cli-installed passed
preflight.passed will now pass
Runtime: 9 passed / 25 failed / 5 skipped against live SUT
The remaining 25 runtime failures are real product/test bugs surfaced
by finally executing the suite against a live SUT (sandbox-shell
timeouts, inference 30-60s timeouts, lifecycle.sandbox_operations
exit-1 mismatches, lifecycle.rebuild/upgrade 120s timeouts even after
retries). They are pre-existing and out of scope for 'execute real
shell assertions; delete dry-run, --validate-only, and the bash
runner'. They become productive follow-up issues.
The 5 skipped runtime steps are 'probe not registered' - known per
spec; probe registry lands in a follow-up.
Negative scenarios (ubuntu-no-docker-preflight-negative,
invalid-key-negative, gateway-port-conflict-negative) need expected-
failure semantics and a way to actually simulate docker-missing on
the runner. Out of scope here; tracked as follow-up.
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
All three findings are valid mechanical bugs introduced/touched by this
PR. Batched per the safe-batch policy: same-risk, independently
obvious, testable together.
1. orchestrators/phase.ts classifierForRef:
Outer guard is /i (case-insensitive), but the inner branch used
case-sensitive ref.includes("tunnel") / ref.includes("cloudflared")
- mixed-case refs would fall through and misclassify as
provider-transient. Replace with /tunnel|cloudflared/i.test(ref).
2. scenarios/compiler.ts phaseActions:
Inline comment said "the scenario is malformed; surface it as a
hard error" but the code returned []. Hard-fail instead, with a
message that names the missing dimension. Empty environment is
still tolerated (skeleton scenarios can carry no setup yet).
3. validation_suites/lib/sandbox_lifecycle.sh:
Substring match `*${E2E_SANDBOX_NAME}*` would let sb1 falsely
match sb10. Use awk with a whole-token equality check on column
one of `nemoclaw list` output.
Tests: 294/294 scenario framework still green. shellcheck + shfmt
clean. No behavior change for canonical scenarios; affected paths
were either dormant (case-mixed classifier) or returning a slightly
stricter outcome (compiler hard-fail, sandbox exact match).
Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
Why
Removing
--dry-runand plan-mode execution from the E2E runner: they let a coding agent ship work behind false-green test runs. The TS runner now has one execution mode: live. No flag, env var, helper, or branch in any production path bypasses real assertion execution.Two seams needed closing:
phase.ts:executeStephad nochild_process.spawn— every real shell/probe step fell through to a hardcodedfailed: unsupported live step, and four placeholder refs (fake-pass,fake-retry-once-pass,fake-always-transient,phase-1-skeleton) were the only paths that reportedpassed.validation_suites/,onboarding_assertions/, andnemoclaw_scenarios/began withif e2e_env_is_dry_run; then exit 0; fi, so when the workflow passed--dry-runthe scripts exited 0 before running.What changed
Orchestrator (TypeScript)
phase.ts:executeStepnow spawns shell steps viachild_process.spawn, with detached process groups so timeouts killbash + sleepcleanly via negative-pid signal. Probe steps returnskipped: "probe not registered"until the registry lands. Pending steps returnskipped: "pending: <ref>". Unknown kinds throw. Real evidence captured to.e2e/logs/<step-id>.log. Step-levelreliability.timeoutSecondsandretry.{attempts,on}enforced here, not in clients.run.ts:--dry-run,--validate-onlydeleted. Default invocation is live execution.--listand--plan-only(local debug) survive read-only.--emit-matrixadded for the dynamic-matrix workflow (feat(e2e): generate scenario fan-out matrix from typed registry #4359).types.ts:RunContext.dryRundeleted.Workflow
e2e-scenarios.yaml: the resolve-runner--plan-onlywarmup, and both--dry-runinvocations (Linux + WSL), are gone. Workflows execute live.tools/e2e-scenarios/workflow-boundary.mtsvalidator now rejects--dry-run,--plan-only,--validate-onlyin the workflow.Bash entrypoints (PR collapses what was originally going to be PR 1 + PR 2)
runtime/run-scenario.sh: 483 lines of duplicated install/onboard/gateway-check/suite-execution → 5-line fail-fast stub pointing atrun.ts.runtime/run-suites.sh: same treatment.PhaseOrchestrator.runShellStepwalks typedassertionGroupsdirectly; nothing in TS calls a YAML-walking bash runner.Shell scripts (the leaves stay, the dry-run skip blocks die)
validation_suites/**,onboarding_assertions/**,nemoclaw_scenarios/**: everyif e2e_env_is_dry_run; then ... exit 0; fiand every[[ "${E2E_DRY_RUN:-0}" == "1" ]]short-circuit removed. The real assertion logic that was hiding underneath now runs unconditionally.runtime/lib/env.sh:e2e_env_is_dry_runhelper deleted.inference_routing.sh: dead_e2e_inference_planhelper deleted.Tests
e2e-suite-runner.test.ts,e2e-scenario-first-migration.test.ts,e2e-expected-state-validator.test.ts.e2e-phase-orchestrators.test.ts: now exercises real shell spawning via temp scripts (pass / fail-with-stderr-tail / timeout / retry-on-classified-transient / missing-ref), real probe skipping with visible reason, and real pending skipping. Replaces the prior placeholder refs with assertions that observe actual subprocess behavior.e2e-lib-helpers.test.ts,e2e-scenario-additional-families.test.ts,e2e-scenario-resolver.test.ts,e2e-context-helper.test.ts: dry-run-mode /run-scenario.sh-spawning tests deleted; tests of real bash semantics survive.Docs
test/e2e-scenario/docs/README.md: one runner, one mode (live), no dry-run, no validate-only.Verification
Audited absent in production paths:
--dry-run,dryRun,E2E_DRY_RUN,e2e_env_is_dry_run,fake-pass,fake-retry-once-pass,fake-always-transient,phase-1-skeleton,unsupported-live-step,--validate-only,RunContext.dryRun.Expected CI behavior
This PR's CI will surface real failures wherever the orchestrator does not yet hand the runtime phase enough state (e.g. a seeded
context.env). That is intentional: this change exposes the remaining wiring gaps so subsequent PRs (probe registry,OnboardingOrchestratorwiring into the real install/onboard dispatchers, old YAML resolver deletion) can address them directly.Spec gates addressed
--dry-run.--dry-run,--validate-only, andsuite_filtergone from active paths.The old YAML resolver source under
runtime/resolver/is left intact; its deletion is the next PR.Stat
Net deletion of ~1400 lines, primarily dry-run scaffolding and the duplicated bash orchestration layer.
Summary by CodeRabbit
New Features
Chores
Documentation
Tests