You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a LiveKit server degrades under load in a large room (650+ participants), the DefaultReconnectPolicy causes all clients to reconnect simultaneously — a thundering herd that collapses the room instead of allowing recovery.
The root cause is in DefaultReconnectPolicy.ts:
constDEFAULT_RETRY_DELAYS_IN_MS=[0,// attempt 0: immediate — zero delay300,// attempt 1: 300ms — zero jitter1200,// attempt 2+: jitter added, but only Math.random() * 1_0002700,4800,7000,7000,7000,7000,7000,];nextRetryDelayInMs(context){if(context.retryCount>=this._retryDelays.length)returnnull;constretryDelay=this._retryDelays[context.retryCount];if(context.retryCount<=1)returnretryDelay;// NO jitter on first 2 attemptsreturnretryDelay+Math.random()*1_000;// only 0-1000ms jitter after that}
The problems:
Attempt 0 has 0ms delay — all clients reconnect instantly and simultaneously
Attempt 1 has 300ms delay with no jitter — all clients retry again at exactly the same time
Jitter is only 0-1000ms — for 650 clients, that's ~0.65 clients per millisecond hitting the server, which is not meaningful spread
We run self-hosted LiveKit (v1.9.12) with rooms of ~650 participants. We've had 5 production outages in 6 weeks where this pattern plays out:
All ~650 clients reconnect at 0ms (attempt 0) — 650 connections/second
Server buckles under the reconnection load, disconnecting more participants
Attempt 1 fires at 300ms — another 650 simultaneous reconnects
Positive feedback loop → room collapses from 658 → 6 participants in ~30 seconds
Participants that get pushed to the other node via Redis trigger the same cascade there
The reconnection storm is the amplifier that turns a degraded-but-recoverable state into a total collapse.
Describe the proposed solution
Add meaningful jitter starting from the very first reconnect attempt. When a server is struggling, spreading reconnections over 5-15 seconds instead of 0ms gives it time to recover.
Suggested change to DefaultReconnectPolicy:
constDEFAULT_RETRY_DELAYS_IN_MS=[2000,// attempt 0: 2s base (was 0)3000,// attempt 1: 3s base (was 300)5000,// attempt 2: 5s base (was 1200)7000,// attempt 3+: same as current7000,7000,7000,7000,7000,7000,];nextRetryDelayInMs(context){if(context.retryCount>=this._retryDelays.length)returnnull;constretryDelay=this._retryDelays[context.retryCount];// Jitter on ALL attempts, proportional to delay (±50%)constjitter=retryDelay*(Math.random()-0.5);returnMath.max(0,Math.round(retryDelay+jitter));}
This would spread 650 clients' first reconnect attempt over a 1-3 second window (~220-650 clients/sec) instead of all at 0ms, and subsequent retries over proportionally wider windows.
The exact values are less important than the principles:
Every attempt should have jitter, including the first
Jitter should scale with the delay, not be a fixed 0-1000ms
The first attempt should not be 0ms — even 1-3 seconds of spread prevents the thundering herd
Alternatives considered
1. Custom reconnectPolicy in application code (our current workaround)
We can pass a custom reconnectPolicy in RoomOptions to add jitter ourselves. This works, but:
The unsafe defaults still affect every other LiveKit deployment
Most self-hosters won't know they need this until they have their first large-room outage
The server could stagger disconnect signals or add backoff to could not restart participant rejections. This would help but doesn't address the client-side thundering herd from network-level disconnects (where the server isn't choosing to disconnect clients).
Simpler but less elegant — a fixed 0-10s jitter would work for large rooms but adds unnecessary latency for small rooms or transient network blips where instant reconnection is appropriate.
Importance
serious, but I can work around it
Additional Information
SDK version: livekit-client 2.17.3
Server version: LiveKit 1.9.12 (self-hosted, bare metal, 2x 128-core nodes)
Describe the problem
When a LiveKit server degrades under load in a large room (650+ participants), the
DefaultReconnectPolicycauses all clients to reconnect simultaneously — a thundering herd that collapses the room instead of allowing recovery.The root cause is in
DefaultReconnectPolicy.ts:The problems:
We run self-hosted LiveKit (v1.9.12) with rooms of ~650 participants. We've had 5 production outages in 6 weeks where this pattern plays out:
The reconnection storm is the amplifier that turns a degraded-but-recoverable state into a total collapse.
Describe the proposed solution
Add meaningful jitter starting from the very first reconnect attempt. When a server is struggling, spreading reconnections over 5-15 seconds instead of 0ms gives it time to recover.
Suggested change to
DefaultReconnectPolicy:This would spread 650 clients' first reconnect attempt over a 1-3 second window (~220-650 clients/sec) instead of all at 0ms, and subsequent retries over proportionally wider windows.
The exact values are less important than the principles:
Alternatives considered
1. Custom
reconnectPolicyin application code (our current workaround)We can pass a custom
reconnectPolicyinRoomOptionsto add jitter ourselves. This works, but:2. Server-side reconnection backoff (livekit/livekit)
The server could stagger disconnect signals or add backoff to
could not restart participantrejections. This would help but doesn't address the client-side thundering herd from network-level disconnects (where the server isn't choosing to disconnect clients).3. Larger fixed jitter (e.g.,
Math.random() * 10_000)Simpler but less elegant — a fixed 0-10s jitter would work for large rooms but adds unnecessary latency for small rooms or transient network blips where instant reconnection is appropriate.
Importance
serious, but I can work around it
Additional Information
reconnectPolicywith exponential backoff + proportional jitter from first attempt