Skip to content

Feat/Provider Fallback Chain — Design Document (#2574)#2581

Open
idling11 wants to merge 2 commits into
Hmbown:mainfrom
idling11:feat/provider-fallback-chain
Open

Feat/Provider Fallback Chain — Design Document (#2574)#2581
idling11 wants to merge 2 commits into
Hmbown:mainfrom
idling11:feat/provider-fallback-chain

Conversation

@idling11
Copy link
Copy Markdown
Contributor

@idling11 idling11 commented Jun 2, 2026

Summary

Add an automatic provider fallback chain so that when the active provider
returns a non-recoverable error (429, selected 5xx, connection timeout),
CodeWhale switches to the next configured provider without interrupting
the user's workflow.

Motivation

Currently, users must manually run /provider to switch when their
primary provider fails. This is especially disruptive during long-running
agentic tasks. A fallback chain keeps the agent working without user
intervention.

Design

Configuration

[providers]
active = "nvidia-nim"
fallback = ["deepseek", "openrouter"]

[providers.nvidia-nim]
api_key = "nvapi-..."
base_url = "https://integrate.api.nvidia.com/v1"
model = "meta/llama-4"

[providers.deepseek]
api_key = "$DEEPSEEK_API_KEY"
model = "deepseek-v4-pro"

[providers.openrouter]
api_key = "$OPENROUTER_API_KEY"
model = "deepseek/deepseek-v4-0324"
  • fallback — ordered list of provider names to try
  • active — the primary provider (existing provider key, renamed for clarity)

Fallback triggers

Error Fallback? Rationale
429 (rate limit) Quota exhausted — swap key/provider
502 / 503 / 504 Provider infrastructure issue
Connection timeout / DNS failure Network path broken
401 / 403 Auth issue — no other provider will help
400 (bad request) Client error — not provider-specific
Stream interrupted mid-content Already consumed partial response

Sequence

1. Try primary provider (nvidia-nim)
2. On fallback-eligible error → try fallback[0] (deepseek)
3. On fallback-eligible error → try fallback[1] (openrouter)
4. All exhausted → surface clear error to user

Transcript / UI

  • Status toast: NVIDIA NIM unavailable — switched to DeepSeek
  • Transcript marker: [provider: nvidia-nim → deepseek]
  • /provider command shows current chain position: deepseek (fallback #1)
  • Original (active) provider is remembered so user can /provider reset to go back

Capability awareness

Before switching, the engine checks that the fallback provider supports
the current turn's needs:

Capability Check
Tools / function calling Fallback provider must support tools
Reasoning effort Must support same reasoning levels
Context length Model must have ≥ current turn's token count
Vision Must support image inputs if turn has images

If no fallback provider meets capabilities, the error is surfaced directly.

Retry integration

Existing [retry] settings apply per-provider before fallback triggers.
A provider gets max_retries attempts with retry_delay between them.
Only after retry exhaustion does fallback move to the next provider.

Config schema validation

On startup, validate:

  • Each fallback entry is a known provider
  • No duplicate providers in chain
  • Fallback entry is not the same as active provider
  • Warn if fallback model has different capability profile

Implementation Plan (3 Draft PRs)

Phase 1: Config schema + validation

Branch: feat/provider-fallback-chain-phase1
Files: crates/tui/src/config.rs

  • Add fallback: Option<Vec<String>> field to ProvidersConfig
  • Add #[serde(default)] for backward compatibility
  • Add validation in Config::validate(): known provider, no duplicates, not same as active
  • Add fallback merge logic in merge_provider_config()
  • Unit tests: valid chain, invalid provider, duplicates

Phase 2: Engine fallback logic

Branch: feat/provider-fallback-chain-phase2
Files: crates/tui/src/client.rs, crates/tui/src/core/engine/turn_loop.rs

  • Add ActiveProviderTracker to remember original provider and current position
  • Error classifier: is_fallback_eligible(error) -> bool
  • try_with_fallback() in client.rs: iterate fallback chain on eligible errors
  • Save original provider before first fallback, restore via /provider reset
  • Event emission: ProviderFallback { from, to, reason }

Phase 3: UI feedback

Branch: feat/provider-fallback-chain-phase3
Files: crates/tui/src/tui/ui.rs, crates/tui/src/commands/provider.rs

  • Status toast on fallback switch
  • Transcript marker for provider transitions
  • /provider shows fallback position and chain
  • /provider reset to return to primary provider

Rejected alternatives

  • Per-request model routing: Too fine-grained; turns have state (system prompt, tools) that shouldn't change mid-turn
  • Weighted random selection: Unpredictable billing; users need deterministic behavior
  • Sub-agent-level fallback: Complicates sub-agent lifecycle for marginal gain

Open questions

  1. Should fallback persist across sessions or reset each launch?
    Reset each launch (avoids silently staying on fallback forever)
  2. Should /compact reset to primary provider?
    No — compaction changes context, not provider
  3. Tool call mid-turn: if tool call succeeds but next API call fails, do we fallback?
    Yes, same turn can span providers as long as capabilities match

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 2, 2026

No reviewable files after applying ignore patterns.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

Thanks @idling11 for taking the time to contribute.

This repository is currently observing a maintainer-managed contribution gate in dry-run mode, so this pull request is staying open. When enforcement is enabled, pull requests from contributors who are not listed in .github/APPROVED_CONTRIBUTORS will be closed automatically.

Please read CONTRIBUTING.md for the expected contribution shape. A maintainer can grant PR access by commenting /lgtm on a pull request.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Note

Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant