Skip to content

fix(onboard): escape Ollama loop on daemon/runner failure (#4365)#4401

Merged
cv merged 4 commits into
NVIDIA:mainfrom
yimoj:fix/4365-ollama-fallback-loop
May 29, 2026
Merged

fix(onboard): escape Ollama loop on daemon/runner failure (#4365)#4401
cv merged 4 commits into
NVIDIA:mainfrom
yimoj:fix/4365-ollama-fallback-loop

Conversation

@yimoj
Copy link
Copy Markdown
Contributor

@yimoj yimoj commented May 28, 2026

Summary

When Ollama is stopped before nemoclaw onboard runs or its model runner crashes mid-onboarding, the wizard kept the user inside Ollama-only loops (issue #4365):

  1. The interactive default autostart-timeout path silently returned to the provider menu with no steering hint, so the user re-picked Local Ollama and hit the same timeout.
  2. A model runner has unexpectedly stopped probe error inside selectAndValidateOllamaModel re-prompted for another Ollama tag that would crash the same broken daemon.

This change detects Ollama runner crashes during the local probe (isOllamaRunnerCrash + new daemonFailure flag on ValidationResult) and:

  • Escapes to provider selection when the daemon is broken, instead of looping inside the Ollama-model prompt.
  • Exits with a clear message when NEMOCLAW_PROVIDER pins any Ollama-using key (ollama, install-ollama, install-windows-ollama, start-windows-ollama, normalized for case/whitespace via the new isOllamaProviderPinned helper, mirroring getNonInteractiveProvider).
  • Aborts cleanly via abortNonInteractive for non-interactive runs.
  • Prints a Pick a non-Ollama provider in the next menu — re-selecting Local Ollama would hit the same timeout hint after an autostart timeout in interactive default mode.

The --no-ollama-autostart fallback behavior is preserved.

Test plan

  • npx vitest run src/lib/inference/local.test.ts src/lib/onboard/ollama-startup.test.ts test/onboard-ollama-autostart.test.ts — 77/77 pass, including:
    • New isOllamaRunnerCrash matcher / daemonFailure flag unit tests in local.test.ts.
    • New src/lib/onboard/ollama-startup.test.ts covering autostart-timeout steer hint, pinned-provider exit, ready short-circuit, and isOllamaProviderPinned normalization (casing/whitespace + all four pinned keys).
    • Scenario G in test/onboard-ollama-autostart.test.ts driving the runner-crash escape end-to-end through the wizard (and Scenario H exercising the casing-variant pin).
  • cd nemoclaw && npm test — plugin tests green.
  • codex review --uncommitted — clean (two earlier P2 findings fixed: provider normalization and broader Ollama key set).
  • Live E2E onboarding repro is blocked on this host: the wizard's systemd loopback override auto-recovers Ollama before reaching the bug path. Reported and verified via hermetic tests.

🤖 Generated with Claude Code

Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

  • New Features

    • Onboarding now differentiates Ollama runner/daemon crashes from model incompatibility via a daemon-failure flag, offers safer recovery (including returning to provider/model selection), and refines pinned-provider timeout/exit behavior to avoid startup loops.
    • Centralized probe-failure handling for consistent interactive vs non-interactive responses.
  • Tests

    • Expanded coverage for runner-crash detection, probe-failure branches, pinned-provider variants, startup/timeout flows, and end-to-end onboarding scenarios.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Caution

Review failed

Pull request was closed or merged during review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3be4b2b9-36a3-4ab2-bb49-afb854571ba0

📥 Commits

Reviewing files that changed from the base of the PR and between 02fffaa and d9b5f59.

📒 Files selected for processing (2)
  • src/lib/inference/local.ts
  • src/lib/onboard.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lib/onboard.ts

📝 Walkthrough

Walkthrough

Classifies Ollama runner/daemon crashes at inference time (isOllamaRunnerCrash), surfaces a daemonFailure flag through validation and proxy types, detects pinned Ollama providers, and centralizes onboarding probe-failure handling to exit, abort, or return to provider selection.

Changes

Ollama Daemon Failure Detection & Onboarding Escape

Layer / File(s) Summary
Daemon-failure classification and type propagation
src/lib/inference/local.ts, src/lib/inference/ollama/proxy.ts, src/lib/inference/local.test.ts
Adds isOllamaRunnerCrash() to detect runner/daemon termination patterns, adds daemonFailure?: boolean to ValidationResult, validateOllamaModel() sets the flag when matched, and prepareOllamaModel return type permits daemonFailure. Tests cover crash and non-crash cases.
Centralized onboarding probe-failure handling
src/lib/onboard/ollama-probe-failure.ts, src/lib/onboard/ollama-probe-failure.test.ts
Introduces handleOllamaProbeFailure() and OllamaProbeFailureInput/action types; branches on daemonFailure and interactive/non-interactive/pinned-provider state, returning "back-to-selection" or "continue" or exiting as appropriate.
Provider-pinning detection for startup gate
src/lib/onboard/ollama-startup.ts, src/lib/onboard/ollama-startup.test.ts
Adds isOllamaProviderPinned() which normalizes NEMOCLAW_PROVIDER and checks known Ollama keys; startup gate uses it and prints updated steer hint or exits when pinned. Tests cover variants and fast-path readiness.
Ollama startup gate tests
src/lib/onboard/ollama-startup.ts, src/lib/onboard/ollama-startup.test.ts
Unit tests for runOllamaStartupOrGate covering timeout path, pinned-provider exit, provider-pinned detection variants, and fast-path when Ollama is already ready.
Integration test harness for runner-crash scenarios
test/onboard-ollama-autostart.test.ts
Extends autostart test harness with scenario-driven /api/generate payloads, per-scenario subprocess timeouts, provider env overrides, conditional proxy/sentinel behavior, and scenarios validating pinned-provider + runner-crash exit behavior.
selectAndValidateOllamaModel delegation
src/lib/onboard.ts
Delegates probe-failure handling to handleOllamaProbeFailure() and returns { outcome: "back-to-selection" } when the handler requests re-selection.

Sequence Diagram(s)

sequenceDiagram
  participant validateOllamaModel
  participant isOllamaRunnerCrash
  participant ValidationResult
  participant prepareOllamaModel
  validateOllamaModel->>isOllamaRunnerCrash: check errText
  isOllamaRunnerCrash-->>validateOllamaModel: true if runner/daemon crash pattern
  validateOllamaModel->>ValidationResult: set daemonFailure: true
  ValidationResult-->>prepareOllamaModel: propagate daemonFailure
Loading
sequenceDiagram
  participant prepareOllamaModel as Prepare
  participant handleOllamaProbeFailure as Handler
  participant isOllamaProviderPinned as Pinned
  participant selectAndValidateOllamaModel as Caller
  Prepare->>Handler: probe (ok, message, daemonFailure?)
  Handler->>Pinned: check pinned state
  alt pinned
    Handler-->>Caller: process.exit(1)
  else not pinned
    alt non-interactive
      Handler-->>Caller: abort (process.exit)
    else interactive
      Handler-->>Caller: "back-to-selection" or "continue"
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#4219: Modifies Ollama probe/validation failure handling inside selectAndValidateOllamaModel and non-interactive abort messaging.
  • NVIDIA/NemoClaw#4108: Related changes to validateOllamaModel and validation behavior at the same entry point.
  • NVIDIA/NemoClaw#4289: Threads additional flags through Ollama model validation, touching similar plumbing.

Suggested labels

Provider: Ollama

Suggested reviewers

  • ericksoa
  • cv
  • cjagwani
  • jyaunches

Poem

🐰 I sniffed the logs where the runner lay,
A daemon flag twitched, then hopped away.
If Ollama's pinned and the daemon's gone,
We exit, abort, or send you back anon.
Hop—pick another model and bound along.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title directly matches the main objective: detecting and escaping Ollama loop when daemon/runner fails. The title is concise, specific, and clearly summarizes the primary change.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard/ollama-startup.ts (1)

94-95: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Pinned-provider error text is overly specific to ollama.

On Line 94, the message always reports NEMOCLAW_PROVIDER=ollama, but the pinned set also includes other Ollama-routing keys (Lines 23-28). This can mislead debugging when a different pinned key is set.

Suggested fix
-          "  NEMOCLAW_PROVIDER=ollama is pinned but Ollama is unreachable; refusing to loop on provider selection.",
+          "  NEMOCLAW_PROVIDER pins an Ollama onboarding path but Ollama is unreachable; refusing to loop on provider selection.",
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/ollama-startup.ts` around lines 94 - 95, The error message in
the pinned-provider check is hardcoded to "NEMOCLAW_PROVIDER=ollama" which is
misleading; change the message to interpolate the actual pinned key/value used
(read from NEMOCLAW_PROVIDER or the local variable holding the pinned provider,
e.g., pinnedProvider/pinnedKey) so it reports the real pinned setting when
Ollama is unreachable rather than always saying "ollama". Ensure the updated
text is used where the pinned-provider check occurs (the startup/pinning logic
in ollama-startup.ts).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/onboard.ts`:
- Around line 3927-3952: Extract the daemon-failure branch that checks
probe.daemonFailure into a new helper module (e.g.,
src/lib/onboard/ollama-startup.ts) that exports a function like
handleOllamaDaemonFailure(probe, selectedModel, context) which contains the
isOllamaProviderPinned() check, the console.error/process.exit call, the
isNonInteractive()/abortNonInteractive call, the console messages, and returns {
outcome: "back-to-selection" } as before; in src/lib/onboard.ts replace the full
branch with a compact call site that invokes handleOllamaDaemonFailure(probe,
selectedModel, /* pass any needed helpers/context */) and forwards its result,
preserving behavior and imports but reducing onboard.ts line growth.

In `@src/lib/onboard/ollama-startup.test.ts`:
- Around line 117-144: The test currently calls restore() at the end and can
leak environment/module state if an assertion throws; wrap the body that mutates
process.env.NEMOCLAW_PROVIDER and calls isOllamaProviderPinned() inside a
try/finally and move restore() into the finally block so restore() always runs;
specifically update the test case that iterates over the cases (where
process.env.NEMOCLAW_PROVIDER is set/deleted) to ensure restore() is invoked in
the finally, leaving the loop and expect calls unchanged.

---

Outside diff comments:
In `@src/lib/onboard/ollama-startup.ts`:
- Around line 94-95: The error message in the pinned-provider check is hardcoded
to "NEMOCLAW_PROVIDER=ollama" which is misleading; change the message to
interpolate the actual pinned key/value used (read from NEMOCLAW_PROVIDER or the
local variable holding the pinned provider, e.g., pinnedProvider/pinnedKey) so
it reports the real pinned setting when Ollama is unreachable rather than always
saying "ollama". Ensure the updated text is used where the pinned-provider check
occurs (the startup/pinning logic in ollama-startup.ts).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fbb6e725-f0c2-4869-9298-3235d82ef7a4

📥 Commits

Reviewing files that changed from the base of the PR and between 1daf081 and 41e0657.

📒 Files selected for processing (7)
  • src/lib/inference/local.test.ts
  • src/lib/inference/local.ts
  • src/lib/inference/ollama/proxy.ts
  • src/lib/onboard.ts
  • src/lib/onboard/ollama-startup.test.ts
  • src/lib/onboard/ollama-startup.ts
  • test/onboard-ollama-autostart.test.ts

Comment thread src/lib/onboard.ts Outdated
Comment on lines +117 to +144
it("isOllamaProviderPinned recognises every Ollama-using provider key (#4365)", () => {
// Mirror the matching logic in providers.getNonInteractiveProvider so a
// user setting NEMOCLAW_PROVIDER to any of the Ollama-using keys still
// triggers the pinned-provider escape paths. Without this, a casing
// variant or an install-* pin would let the wizard return to the
// selection menu and immediately re-pin to the same Ollama action,
// reintroducing the #4365 loop.
const cases: Array<[string | undefined, boolean]> = [
["ollama", true],
["OLLAMA", true],
[" Ollama ", true],
[" ollama\n", true],
["install-ollama", true],
["INSTALL-OLLAMA", true],
["install-windows-ollama", true],
["start-windows-ollama", true],
["build", false],
["openai", false],
["", false],
[undefined, false],
];
for (const [value, expected] of cases) {
if (value === undefined) delete process.env.NEMOCLAW_PROVIDER;
else process.env.NEMOCLAW_PROVIDER = value;
expect(isOllamaProviderPinned(), `pin=${JSON.stringify(value)}`).toBe(expected);
}
restore();
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use try/finally for cleanup in this test case.

If an assertion fails in this block, restore() is skipped and can leak env/module state into subsequent tests.

Proposed fix
   it("isOllamaProviderPinned recognises every Ollama-using provider key (`#4365`)", () => {
@@
-    for (const [value, expected] of cases) {
-      if (value === undefined) delete process.env.NEMOCLAW_PROVIDER;
-      else process.env.NEMOCLAW_PROVIDER = value;
-      expect(isOllamaProviderPinned(), `pin=${JSON.stringify(value)}`).toBe(expected);
-    }
-    restore();
+    try {
+      for (const [value, expected] of cases) {
+        if (value === undefined) delete process.env.NEMOCLAW_PROVIDER;
+        else process.env.NEMOCLAW_PROVIDER = value;
+        expect(isOllamaProviderPinned(), `pin=${JSON.stringify(value)}`).toBe(expected);
+      }
+    } finally {
+      restore();
+    }
   });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
it("isOllamaProviderPinned recognises every Ollama-using provider key (#4365)", () => {
// Mirror the matching logic in providers.getNonInteractiveProvider so a
// user setting NEMOCLAW_PROVIDER to any of the Ollama-using keys still
// triggers the pinned-provider escape paths. Without this, a casing
// variant or an install-* pin would let the wizard return to the
// selection menu and immediately re-pin to the same Ollama action,
// reintroducing the #4365 loop.
const cases: Array<[string | undefined, boolean]> = [
["ollama", true],
["OLLAMA", true],
[" Ollama ", true],
[" ollama\n", true],
["install-ollama", true],
["INSTALL-OLLAMA", true],
["install-windows-ollama", true],
["start-windows-ollama", true],
["build", false],
["openai", false],
["", false],
[undefined, false],
];
for (const [value, expected] of cases) {
if (value === undefined) delete process.env.NEMOCLAW_PROVIDER;
else process.env.NEMOCLAW_PROVIDER = value;
expect(isOllamaProviderPinned(), `pin=${JSON.stringify(value)}`).toBe(expected);
}
restore();
});
it("isOllamaProviderPinned recognises every Ollama-using provider key (`#4365`)", () => {
// Mirror the matching logic in providers.getNonInteractiveProvider so a
// user setting NEMOCLAW_PROVIDER to any of the Ollama-using keys still
// triggers the pinned-provider escape paths. Without this, a casing
// variant or an install-* pin would let the wizard return to the
// selection menu and immediately re-pin to the same Ollama action,
// reintroducing the `#4365` loop.
const cases: Array<[string | undefined, boolean]> = [
["ollama", true],
["OLLAMA", true],
[" Ollama ", true],
[" ollama\n", true],
["install-ollama", true],
["INSTALL-OLLAMA", true],
["install-windows-ollama", true],
["start-windows-ollama", true],
["build", false],
["openai", false],
["", false],
[undefined, false],
];
try {
for (const [value, expected] of cases) {
if (value === undefined) delete process.env.NEMOCLAW_PROVIDER;
else process.env.NEMOCLAW_PROVIDER = value;
expect(isOllamaProviderPinned(), `pin=${JSON.stringify(value)}`).toBe(expected);
}
} finally {
restore();
}
});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/ollama-startup.test.ts` around lines 117 - 144, The test
currently calls restore() at the end and can leak environment/module state if an
assertion throws; wrap the body that mutates process.env.NEMOCLAW_PROVIDER and
calls isOllamaProviderPinned() inside a try/finally and move restore() into the
finally block so restore() always runs; specifically update the test case that
iterates over the cases (where process.env.NEMOCLAW_PROVIDER is set/deleted) to
ensure restore() is invoked in the finally, leaving the loop and expect calls
unchanged.

@wscurran
Copy link
Copy Markdown
Contributor

yimoj and others added 3 commits May 29, 2026 00:20
When Ollama is unreachable or its model runner crashes mid-onboarding, the
wizard kept the user inside Ollama-only loops. The interactive default
autostart-timeout path silently returned to provider selection, and a
"model runner has unexpectedly stopped" probe error re-prompted for another
Ollama tag that would hit the same daemon failure.

Detect Ollama runner crashes during the local probe via a new `daemonFailure`
flag on ValidationResult and isOllamaRunnerCrash matcher. selectAndValidateOllamaModel
escapes to provider selection on daemonFailure; pinned-provider (NEMOCLAW_PROVIDER
of any Ollama key, normalized for case/whitespace) and non-interactive runs
exit instead of looping. The autostart-timeout path now prints a clear
"pick a non-Ollama provider" hint before returning to selection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
The codebase-growth-guardrails CI check requires src/lib/onboard.ts to stay
net-neutral or smaller per PR. The initial fix added 26 lines of daemon-failure
handling inline. Move that logic into the focused
src/lib/onboard/ollama-probe-failure.ts module (allowed under EXTRACTION_DIR)
behind a single dispatcher, handleOllamaProbeFailure, that the
selectAndValidateOllamaModel inner loop calls.

The dispatcher consolidates both the existing model-level failure path and
the new (NVIDIA#4365) daemon-failure escape — pinned-provider exit, non-interactive
abort, and interactive back-to-selection — so the onboard.ts diff is now
net negative versus main. Behavior is unchanged from the previous commit;
added a focused unit-test surface for each of the four branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
CodeRabbit flagged that the pinned-provider error string hard-coded
"NEMOCLAW_PROVIDER=ollama" even though isOllamaProviderPinned also matches
install-ollama, install-windows-ollama, and start-windows-ollama (plus case/
whitespace variants of any of those). When the wizard exits on a pinned-
Ollama unreachable / runner-crash failure, the literal message could disagree
with the user's actual env value and mislead debugging.

Replace the hard-coded "NEMOCLAW_PROVIDER=ollama is pinned" prefix with
"NEMOCLAW_PROVIDER pins onboarding to Ollama" so the message stays accurate
across every Ollama-routing pin variant. Test assertions updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@yimoj yimoj force-pushed the fix/4365-ollama-fallback-loop branch from 47c1c4b to 02fffaa Compare May 29, 2026 00:21
@jyaunches jyaunches added R2 v0.0.56 Release target and removed v0.0.55 labels May 29, 2026
@cv cv enabled auto-merge (squash) May 29, 2026 23:54
@cv cv merged commit 2d0a78b into NVIDIA:main May 29, 2026
20 of 21 checks passed
@wscurran wscurran added area: local-models Local model providers, downloads, launch, or connectivity area: providers Inference provider integrations and provider behavior bug-fix PR fixes a bug or regression and removed Local Models labels Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: local-models Local model providers, downloads, launch, or connectivity area: providers Inference provider integrations and provider behavior bug-fix PR fixes a bug or regression v0.0.56 Release target

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants