From 53d25787cd1aa638f0a13d617029b31d5575d3c7 Mon Sep 17 00:00:00 2001
From: Sameer Kankute <sameer@berri.ai>
Date: Fri, 15 May 2026 12:00:46 +0530
Subject: [PATCH 1/3] docs(proxy): document enable_weighted_failover router
 setting

Co-authored-by: Cursor <cursoragent@cursor.com>
---
 docs/proxy/config_settings.md | 1 +
 1 file changed, 1 insertion(+)
diff --git a/docs/proxy/config_settings.md b/docs/proxy/config_settings.md
index 901c330a..7dfea79c 100644
--- a/docs/proxy/config_settings.md
+++ b/docs/proxy/config_settings.md
@@ -378,6 +378,7 @@ router_settings:
 | content_policy_fallbacks | array of objects | Specifies fallback models for content policy violations. [More information here](reliability) |
 | fallbacks | array of objects | Specifies fallback models for all types of errors. [More information here](reliability) |
 | enable_tag_filtering | boolean | If true, uses tag based routing for requests [Tag Based Routing](tag_routing) |
+| enable_weighted_failover | boolean | If true and `routing_strategy` is `simple-shuffle`, a retryable failure on one deployment re-picks (weighted) across other deployments in the same model group before cross-group fallbacks. Default: false. |
 | tag_filtering_match_any | boolean | Tag matching behavior (only when enable_tag_filtering=true). `true`: match if deployment has ANY requested tag; `false`: match only if deployment has ALL requested tags |
 | cooldown_time | integer | The duration (in seconds) to cooldown a model if it exceeds the allowed failures. |
 | disable_cooldowns | boolean | If true, disables cooldowns for all models. [More information here](reliability) |

From de98ad68340a279029e794d28f7985bfacdcc9fc Mon Sep 17 00:00:00 2001
From: Sameer Kankute <sameer@berri.ai>
Date: Fri, 15 May 2026 16:49:04 +0530
Subject: [PATCH 2/3] docs(routing): document weighted failover (intra-group
 retry)

Explains enable_weighted_failover behavior, scope (simple-shuffle only,
async-only, skipped for context-window/content-policy errors), interaction
with `order`, and a worked walkthrough. Adds an SDK + proxy config example
covering the common multi-region Azure setup.

Co-authored-by: Cursor <cursoragent@cursor.com>
---
 docs/routing.md | 99 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 99 insertions(+)

diff --git a/docs/routing.md b/docs/routing.md
index 0a42f792..5269dbe3 100644
--- a/docs/routing.md
+++ b/docs/routing.md
@@ -1056,6 +1056,105 @@ model_list:
 </TabItem>
 </Tabs>
 
+### Weighted Failover
+
+By default, when a deployment in a model group fails, the router moves on to the next entry in `fallbacks` (a different model group). With `enable_weighted_failover`, the router first retries **inside the same model group** by re-picking a different deployment using the existing weights, and only escalates to cross-group fallbacks once every deployment in the group has been tried.
+
+This is useful when you have multiple regional copies of the same model (e.g. Azure `eastus2` + `swedencentral`) and want a failed region to fail over to a healthy peer with the same `model_name`, instead of immediately switching to a different model.
+
+**Behavior**
+
+- Only active when `routing_strategy="simple-shuffle"` (the default).
+- On a retryable failure, the failing deployment ID is excluded and a new deployment is picked from the remaining peers in the same model group, respecting `weight` / `rpm` / `tpm`.
+- Exclusions accumulate across hops: each retry adds the previous failure to the exclusion set, so a deployment that just failed is never picked again in the same request chain.
+- Capped by `max_fallbacks` (default `5`).
+- Not triggered for `ContextWindowExceededError` or `ContentPolicyViolationError` — those keep their dedicated fallback paths.
+- Async-only: honored by `router.acompletion()` and other async entrypoints. The sync `router.completion()` path falls through to regular fallbacks.
+- Cooldowns still apply: a deployment that crosses `allowed_fails` is cooled down independently of weighted failover.
+
+**Order vs. weight**
+
+If the same group also uses `order`, the order filter runs **before** the weighted pick. So weighted failover re-picks only among the deployments in the current minimum-order tier. Promotion to the next order tier happens through the existing order-based fallback path.
+
+**Config**
+
+<Tabs>
+<TabItem value="sdk" label="SDK">
+
+```python
+from litellm import Router
+
+model_list = [
+    {
+        "model_name": "gpt-4.1-mini",
+        "litellm_params": {
+            "model": "azure/gpt-4.1-mini",
+            "api_base": "https://eastus2.example.azure.com",
+            "api_key": os.getenv("AZURE_EASTUS2_KEY"),
+            "weight": 1,
+        },
+    },
+    {
+        "model_name": "gpt-4.1-mini",
+        "litellm_params": {
+            "model": "azure/gpt-4.1-mini",
+            "api_base": "https://swedencentral.example.azure.com",
+            "api_key": os.getenv("AZURE_SWEDEN_KEY"),
+            "weight": 1,
+        },
+    },
+]
+
+router = Router(
+    model_list=model_list,
+    routing_strategy="simple-shuffle",
+    enable_weighted_failover=True,  # 👈 retry within the same model group on failure
+)
+
+response = await router.acompletion(
+    model="gpt-4.1-mini",
+    messages=[{"role": "user", "content": "Hey"}],
+)
+```
+
+</TabItem>
+<TabItem value="proxy" label="PROXY">
+
+```yaml
+model_list:
+  - model_name: gpt-4.1-mini
+    litellm_params:
+      model: azure/gpt-4.1-mini
+      api_base: https://eastus2.example.azure.com
+      api_key: os.environ/AZURE_EASTUS2_KEY
+      weight: 1
+  - model_name: gpt-4.1-mini
+    litellm_params:
+      model: azure/gpt-4.1-mini
+      api_base: https://swedencentral.example.azure.com
+      api_key: os.environ/AZURE_SWEDEN_KEY
+      weight: 1
+
+router_settings:
+  routing_strategy: simple-shuffle
+  enable_weighted_failover: true  # 👈 retry within the same model group on failure
+```
+
+</TabItem>
+</Tabs>
+
+**Walkthrough**
+
+With the config above and a request to `gpt-4.1-mini`:
+
+1. `simple-shuffle` picks one of the two deployments using `weight`.
+2. If the picked deployment raises a provider error (e.g. `RateLimitError`, `InternalServerError`), its deployment ID is added to `metadata._failover_excluded_ids`.
+3. The router re-enters `simple-shuffle` with the failed deployment excluded and weights renormalized over what's left.
+4. Steps 2–3 repeat until a deployment succeeds, every peer has been excluded, or `max_fallbacks` is reached.
+5. Only after all peers are exhausted does the router fall through to any `fallbacks` configured for the group.
+
+See [`enable_weighted_failover`](./proxy/config_settings#router_settings---reference) in the router settings reference for the flag.
+
 ### Max Parallel Requests (ASYNC)
 
 Used in semaphore for async requests on router. Limit the max concurrent calls made to a deployment. Useful in high-traffic scenarios. 

From 572dd403bb6f49158878424f49e879f06020421a Mon Sep 17 00:00:00 2001
From: Sameer Kankute <sameer@berri.ai>
Date: Mon, 18 May 2026 14:03:40 +0530
Subject: [PATCH 3/3] docs(interactions): document
 LITELLM_USE_LEGACY_INTERACTIONS_SCHEMA env var

Add documentation for the new env var that controls the Google Interactions
API schema version (new steps-based vs legacy outputs-based) to the
environment variables reference table.

Co-authored-by: Cursor <cursoragent@cursor.com>
---
 docs/proxy/config_settings.md | 1 +
 1 file changed, 1 insertion(+)

diff --git a/docs/proxy/config_settings.md b/docs/proxy/config_settings.md
index 7dfea79c..11f9fb36 100644
--- a/docs/proxy/config_settings.md
+++ b/docs/proxy/config_settings.md
@@ -910,6 +910,7 @@ router_settings:
 | LITELLM_TOKEN | Access token for LiteLLM integration
 | LITELLM_USE_CHAT_COMPLETIONS_URL_FOR_ANTHROPIC_MESSAGES | When set to "true", routes OpenAI /v1/messages requests through chat/completions instead of the Responses API for Anthropic models. Can also be set via `litellm_settings.use_chat_completions_url_for_anthropic_messages`
 | LITELLM_ROUTE_ALL_CHAT_OPENAI_TO_RESPONSES | When set to "true", routes all OpenAI /chat/completions requests through the Responses API bridge. Recommended for OpenAI models. Can also be set via `litellm_settings.route_all_chat_openai_to_responses`
+| LITELLM_USE_LEGACY_INTERACTIONS_SCHEMA | When set to "true", uses the legacy Google Interactions API schema (`outputs` array, `2026-05-07` revision) instead of the new schema (`steps` array, `2026-05-20` revision). The legacy schema will be sunset on June 8, 2026. Can also be set via `litellm_settings.use_legacy_interactions_schema`
 | LITELLM_USER_AGENT | Custom user agent string for LiteLLM API requests. Used for partner telemetry attribution
 | LITELLM_WORKER_STARTUP_HOOKS | Comma-separated list of `module.path:function_name` callables to run in each worker process during startup. Runs early in the worker lifecycle (before config/DB loading). Useful for re-initializing per-process state like [gflags](https://github.com/google/python-gflags). See [Worker Startup Hooks](/proxy/worker_startup_hooks) for details
 | LITELLM_PRINT_STANDARD_LOGGING_PAYLOAD | If true, prints the standard logging payload to the console - useful for debugging