Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/proxy/config_settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -378,6 +378,7 @@ router_settings:
| content_policy_fallbacks | array of objects | Specifies fallback models for content policy violations. [More information here](reliability) |
| fallbacks | array of objects | Specifies fallback models for all types of errors. [More information here](reliability) |
| enable_tag_filtering | boolean | If true, uses tag based routing for requests [Tag Based Routing](tag_routing) |
| enable_weighted_failover | boolean | If true and `routing_strategy` is `simple-shuffle`, a retryable failure on one deployment re-picks (weighted) across other deployments in the same model group before cross-group fallbacks. Default: false. |
| tag_filtering_match_any | boolean | Tag matching behavior (only when enable_tag_filtering=true). `true`: match if deployment has ANY requested tag; `false`: match only if deployment has ALL requested tags |
| cooldown_time | integer | The duration (in seconds) to cooldown a model if it exceeds the allowed failures. |
| disable_cooldowns | boolean | If true, disables cooldowns for all models. [More information here](reliability) |
Expand Down Expand Up @@ -909,6 +910,7 @@ router_settings:
| LITELLM_TOKEN | Access token for LiteLLM integration
| LITELLM_USE_CHAT_COMPLETIONS_URL_FOR_ANTHROPIC_MESSAGES | When set to "true", routes OpenAI /v1/messages requests through chat/completions instead of the Responses API for Anthropic models. Can also be set via `litellm_settings.use_chat_completions_url_for_anthropic_messages`
| LITELLM_ROUTE_ALL_CHAT_OPENAI_TO_RESPONSES | When set to "true", routes all OpenAI /chat/completions requests through the Responses API bridge. Recommended for OpenAI models. Can also be set via `litellm_settings.route_all_chat_openai_to_responses`
| LITELLM_USE_LEGACY_INTERACTIONS_SCHEMA | When set to "true", uses the legacy Google Interactions API schema (`outputs` array, `2026-05-07` revision) instead of the new schema (`steps` array, `2026-05-20` revision). The legacy schema will be sunset on June 8, 2026. Can also be set via `litellm_settings.use_legacy_interactions_schema`
| LITELLM_USER_AGENT | Custom user agent string for LiteLLM API requests. Used for partner telemetry attribution
| LITELLM_WORKER_STARTUP_HOOKS | Comma-separated list of `module.path:function_name` callables to run in each worker process during startup. Runs early in the worker lifecycle (before config/DB loading). Useful for re-initializing per-process state like [gflags](https://github.com/google/python-gflags). See [Worker Startup Hooks](/proxy/worker_startup_hooks) for details
| LITELLM_PRINT_STANDARD_LOGGING_PAYLOAD | If true, prints the standard logging payload to the console - useful for debugging
Expand Down
99 changes: 99 additions & 0 deletions docs/routing.md
Original file line number Diff line number Diff line change
Expand Up @@ -1056,6 +1056,105 @@ model_list:
</TabItem>
</Tabs>

### Weighted Failover

By default, when a deployment in a model group fails, the router moves on to the next entry in `fallbacks` (a different model group). With `enable_weighted_failover`, the router first retries **inside the same model group** by re-picking a different deployment using the existing weights, and only escalates to cross-group fallbacks once every deployment in the group has been tried.

This is useful when you have multiple regional copies of the same model (e.g. Azure `eastus2` + `swedencentral`) and want a failed region to fail over to a healthy peer with the same `model_name`, instead of immediately switching to a different model.

**Behavior**

- Only active when `routing_strategy="simple-shuffle"` (the default).
- On a retryable failure, the failing deployment ID is excluded and a new deployment is picked from the remaining peers in the same model group, respecting `weight` / `rpm` / `tpm`.
- Exclusions accumulate across hops: each retry adds the previous failure to the exclusion set, so a deployment that just failed is never picked again in the same request chain.
- Capped by `max_fallbacks` (default `5`).
- Not triggered for `ContextWindowExceededError` or `ContentPolicyViolationError` — those keep their dedicated fallback paths.
- Async-only: honored by `router.acompletion()` and other async entrypoints. The sync `router.completion()` path falls through to regular fallbacks.
- Cooldowns still apply: a deployment that crosses `allowed_fails` is cooled down independently of weighted failover.

**Order vs. weight**

If the same group also uses `order`, the order filter runs **before** the weighted pick. So weighted failover re-picks only among the deployments in the current minimum-order tier. Promotion to the next order tier happens through the existing order-based fallback path.

**Config**

<Tabs>
<TabItem value="sdk" label="SDK">

```python
from litellm import Router

model_list = [
{
"model_name": "gpt-4.1-mini",
"litellm_params": {
"model": "azure/gpt-4.1-mini",
"api_base": "https://eastus2.example.azure.com",
"api_key": os.getenv("AZURE_EASTUS2_KEY"),
"weight": 1,
},
},
{
"model_name": "gpt-4.1-mini",
"litellm_params": {
"model": "azure/gpt-4.1-mini",
"api_base": "https://swedencentral.example.azure.com",
"api_key": os.getenv("AZURE_SWEDEN_KEY"),
"weight": 1,
},
},
]

router = Router(
model_list=model_list,
routing_strategy="simple-shuffle",
enable_weighted_failover=True, # 👈 retry within the same model group on failure
)

response = await router.acompletion(
model="gpt-4.1-mini",
messages=[{"role": "user", "content": "Hey"}],
)
```

</TabItem>
<TabItem value="proxy" label="PROXY">

```yaml
model_list:
- model_name: gpt-4.1-mini
litellm_params:
model: azure/gpt-4.1-mini
api_base: https://eastus2.example.azure.com
api_key: os.environ/AZURE_EASTUS2_KEY
weight: 1
- model_name: gpt-4.1-mini
litellm_params:
model: azure/gpt-4.1-mini
api_base: https://swedencentral.example.azure.com
api_key: os.environ/AZURE_SWEDEN_KEY
weight: 1

router_settings:
routing_strategy: simple-shuffle
enable_weighted_failover: true # 👈 retry within the same model group on failure
```

</TabItem>
</Tabs>

**Walkthrough**

With the config above and a request to `gpt-4.1-mini`:

1. `simple-shuffle` picks one of the two deployments using `weight`.
2. If the picked deployment raises a provider error (e.g. `RateLimitError`, `InternalServerError`), its deployment ID is added to `metadata._failover_excluded_ids`.
3. The router re-enters `simple-shuffle` with the failed deployment excluded and weights renormalized over what's left.
4. Steps 2–3 repeat until a deployment succeeds, every peer has been excluded, or `max_fallbacks` is reached.
5. Only after all peers are exhausted does the router fall through to any `fallbacks` configured for the group.

See [`enable_weighted_failover`](./proxy/config_settings#router_settings---reference) in the router settings reference for the flag.

### Max Parallel Requests (ASYNC)

Used in semaphore for async requests on router. Limit the max concurrent calls made to a deployment. Useful in high-traffic scenarios.
Expand Down