From 53d25787cd1aa638f0a13d617029b31d5575d3c7 Mon Sep 17 00:00:00 2001 From: Sameer Kankute Date: Fri, 15 May 2026 12:00:46 +0530 Subject: [PATCH 1/3] docs(proxy): document enable_weighted_failover router setting Co-authored-by: Cursor --- docs/proxy/config_settings.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/proxy/config_settings.md b/docs/proxy/config_settings.md index 901c330a..7dfea79c 100644 --- a/docs/proxy/config_settings.md +++ b/docs/proxy/config_settings.md @@ -378,6 +378,7 @@ router_settings: | content_policy_fallbacks | array of objects | Specifies fallback models for content policy violations. [More information here](reliability) | | fallbacks | array of objects | Specifies fallback models for all types of errors. [More information here](reliability) | | enable_tag_filtering | boolean | If true, uses tag based routing for requests [Tag Based Routing](tag_routing) | +| enable_weighted_failover | boolean | If true and `routing_strategy` is `simple-shuffle`, a retryable failure on one deployment re-picks (weighted) across other deployments in the same model group before cross-group fallbacks. Default: false. | | tag_filtering_match_any | boolean | Tag matching behavior (only when enable_tag_filtering=true). `true`: match if deployment has ANY requested tag; `false`: match only if deployment has ALL requested tags | | cooldown_time | integer | The duration (in seconds) to cooldown a model if it exceeds the allowed failures. | | disable_cooldowns | boolean | If true, disables cooldowns for all models. [More information here](reliability) | From de98ad68340a279029e794d28f7985bfacdcc9fc Mon Sep 17 00:00:00 2001 From: Sameer Kankute Date: Fri, 15 May 2026 16:49:04 +0530 Subject: [PATCH 2/3] docs(routing): document weighted failover (intra-group retry) Explains enable_weighted_failover behavior, scope (simple-shuffle only, async-only, skipped for context-window/content-policy errors), interaction with `order`, and a worked walkthrough. Adds an SDK + proxy config example covering the common multi-region Azure setup. Co-authored-by: Cursor --- docs/routing.md | 99 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 99 insertions(+) diff --git a/docs/routing.md b/docs/routing.md index 0a42f792..5269dbe3 100644 --- a/docs/routing.md +++ b/docs/routing.md @@ -1056,6 +1056,105 @@ model_list: +### Weighted Failover + +By default, when a deployment in a model group fails, the router moves on to the next entry in `fallbacks` (a different model group). With `enable_weighted_failover`, the router first retries **inside the same model group** by re-picking a different deployment using the existing weights, and only escalates to cross-group fallbacks once every deployment in the group has been tried. + +This is useful when you have multiple regional copies of the same model (e.g. Azure `eastus2` + `swedencentral`) and want a failed region to fail over to a healthy peer with the same `model_name`, instead of immediately switching to a different model. + +**Behavior** + +- Only active when `routing_strategy="simple-shuffle"` (the default). +- On a retryable failure, the failing deployment ID is excluded and a new deployment is picked from the remaining peers in the same model group, respecting `weight` / `rpm` / `tpm`. +- Exclusions accumulate across hops: each retry adds the previous failure to the exclusion set, so a deployment that just failed is never picked again in the same request chain. +- Capped by `max_fallbacks` (default `5`). +- Not triggered for `ContextWindowExceededError` or `ContentPolicyViolationError` — those keep their dedicated fallback paths. +- Async-only: honored by `router.acompletion()` and other async entrypoints. The sync `router.completion()` path falls through to regular fallbacks. +- Cooldowns still apply: a deployment that crosses `allowed_fails` is cooled down independently of weighted failover. + +**Order vs. weight** + +If the same group also uses `order`, the order filter runs **before** the weighted pick. So weighted failover re-picks only among the deployments in the current minimum-order tier. Promotion to the next order tier happens through the existing order-based fallback path. + +**Config** + + + + +```python +from litellm import Router + +model_list = [ + { + "model_name": "gpt-4.1-mini", + "litellm_params": { + "model": "azure/gpt-4.1-mini", + "api_base": "https://eastus2.example.azure.com", + "api_key": os.getenv("AZURE_EASTUS2_KEY"), + "weight": 1, + }, + }, + { + "model_name": "gpt-4.1-mini", + "litellm_params": { + "model": "azure/gpt-4.1-mini", + "api_base": "https://swedencentral.example.azure.com", + "api_key": os.getenv("AZURE_SWEDEN_KEY"), + "weight": 1, + }, + }, +] + +router = Router( + model_list=model_list, + routing_strategy="simple-shuffle", + enable_weighted_failover=True, # 👈 retry within the same model group on failure +) + +response = await router.acompletion( + model="gpt-4.1-mini", + messages=[{"role": "user", "content": "Hey"}], +) +``` + + + + +```yaml +model_list: + - model_name: gpt-4.1-mini + litellm_params: + model: azure/gpt-4.1-mini + api_base: https://eastus2.example.azure.com + api_key: os.environ/AZURE_EASTUS2_KEY + weight: 1 + - model_name: gpt-4.1-mini + litellm_params: + model: azure/gpt-4.1-mini + api_base: https://swedencentral.example.azure.com + api_key: os.environ/AZURE_SWEDEN_KEY + weight: 1 + +router_settings: + routing_strategy: simple-shuffle + enable_weighted_failover: true # 👈 retry within the same model group on failure +``` + + + + +**Walkthrough** + +With the config above and a request to `gpt-4.1-mini`: + +1. `simple-shuffle` picks one of the two deployments using `weight`. +2. If the picked deployment raises a provider error (e.g. `RateLimitError`, `InternalServerError`), its deployment ID is added to `metadata._failover_excluded_ids`. +3. The router re-enters `simple-shuffle` with the failed deployment excluded and weights renormalized over what's left. +4. Steps 2–3 repeat until a deployment succeeds, every peer has been excluded, or `max_fallbacks` is reached. +5. Only after all peers are exhausted does the router fall through to any `fallbacks` configured for the group. + +See [`enable_weighted_failover`](./proxy/config_settings#router_settings---reference) in the router settings reference for the flag. + ### Max Parallel Requests (ASYNC) Used in semaphore for async requests on router. Limit the max concurrent calls made to a deployment. Useful in high-traffic scenarios. From 572dd403bb6f49158878424f49e879f06020421a Mon Sep 17 00:00:00 2001 From: Sameer Kankute Date: Mon, 18 May 2026 14:03:40 +0530 Subject: [PATCH 3/3] docs(interactions): document LITELLM_USE_LEGACY_INTERACTIONS_SCHEMA env var Add documentation for the new env var that controls the Google Interactions API schema version (new steps-based vs legacy outputs-based) to the environment variables reference table. Co-authored-by: Cursor --- docs/proxy/config_settings.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/proxy/config_settings.md b/docs/proxy/config_settings.md index 7dfea79c..11f9fb36 100644 --- a/docs/proxy/config_settings.md +++ b/docs/proxy/config_settings.md @@ -910,6 +910,7 @@ router_settings: | LITELLM_TOKEN | Access token for LiteLLM integration | LITELLM_USE_CHAT_COMPLETIONS_URL_FOR_ANTHROPIC_MESSAGES | When set to "true", routes OpenAI /v1/messages requests through chat/completions instead of the Responses API for Anthropic models. Can also be set via `litellm_settings.use_chat_completions_url_for_anthropic_messages` | LITELLM_ROUTE_ALL_CHAT_OPENAI_TO_RESPONSES | When set to "true", routes all OpenAI /chat/completions requests through the Responses API bridge. Recommended for OpenAI models. Can also be set via `litellm_settings.route_all_chat_openai_to_responses` +| LITELLM_USE_LEGACY_INTERACTIONS_SCHEMA | When set to "true", uses the legacy Google Interactions API schema (`outputs` array, `2026-05-07` revision) instead of the new schema (`steps` array, `2026-05-20` revision). The legacy schema will be sunset on June 8, 2026. Can also be set via `litellm_settings.use_legacy_interactions_schema` | LITELLM_USER_AGENT | Custom user agent string for LiteLLM API requests. Used for partner telemetry attribution | LITELLM_WORKER_STARTUP_HOOKS | Comma-separated list of `module.path:function_name` callables to run in each worker process during startup. Runs early in the worker lifecycle (before config/DB loading). Useful for re-initializing per-process state like [gflags](https://github.com/google/python-gflags). See [Worker Startup Hooks](/proxy/worker_startup_hooks) for details | LITELLM_PRINT_STANDARD_LOGGING_PAYLOAD | If true, prints the standard logging payload to the console - useful for debugging