Feat/Provider Fallback Chain — Design Document (#2574)#2581
Open
idling11 wants to merge 2 commits into
Open
Conversation
Contributor
|
No reviewable files after applying ignore patterns. |
|
Thanks @idling11 for taking the time to contribute. This repository is currently observing a maintainer-managed contribution gate in dry-run mode, so this pull request is staying open. When enforcement is enabled, pull requests from contributors who are not listed in Please read |
Contributor
|
Note Gemini is unable to generate a review for this pull request due to the file types involved not being currently supported. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add an automatic provider fallback chain so that when the active provider
returns a non-recoverable error (429, selected 5xx, connection timeout),
CodeWhale switches to the next configured provider without interrupting
the user's workflow.
Motivation
Currently, users must manually run
/providerto switch when theirprimary provider fails. This is especially disruptive during long-running
agentic tasks. A fallback chain keeps the agent working without user
intervention.
Design
Configuration
fallback— ordered list of provider names to tryactive— the primary provider (existingproviderkey, renamed for clarity)Fallback triggers
Sequence
Transcript / UI
NVIDIA NIM unavailable — switched to DeepSeek[provider: nvidia-nim → deepseek]/providercommand shows current chain position:deepseek (fallback #1)active) provider is remembered so user can/provider resetto go backCapability awareness
Before switching, the engine checks that the fallback provider supports
the current turn's needs:
If no fallback provider meets capabilities, the error is surfaced directly.
Retry integration
Existing
[retry]settings apply per-provider before fallback triggers.A provider gets
max_retriesattempts withretry_delaybetween them.Only after retry exhaustion does fallback move to the next provider.
Config schema validation
On startup, validate:
fallbackentry is a known providerImplementation Plan (3 Draft PRs)
Phase 1: Config schema + validation
Branch:
feat/provider-fallback-chain-phase1Files:
crates/tui/src/config.rsfallback: Option<Vec<String>>field toProvidersConfig#[serde(default)]for backward compatibilityConfig::validate(): known provider, no duplicates, not same as activefallbackmerge logic inmerge_provider_config()Phase 2: Engine fallback logic
Branch:
feat/provider-fallback-chain-phase2Files:
crates/tui/src/client.rs,crates/tui/src/core/engine/turn_loop.rsActiveProviderTrackerto remember original provider and current positionis_fallback_eligible(error) -> booltry_with_fallback()inclient.rs: iterate fallback chain on eligible errors/provider resetProviderFallback { from, to, reason }Phase 3: UI feedback
Branch:
feat/provider-fallback-chain-phase3Files:
crates/tui/src/tui/ui.rs,crates/tui/src/commands/provider.rs/providershows fallback position and chain/provider resetto return to primary providerRejected alternatives
Open questions
→ Reset each launch (avoids silently staying on fallback forever)
/compactreset to primary provider?→ No — compaction changes context, not provider
→ Yes, same turn can span providers as long as capabilities match