Skip to content

DefaultReconnectPolicy causes thundering herd in large rooms (650+ participants) #1852

@theramjad

Description

@theramjad

Describe the problem

When a LiveKit server degrades under load in a large room (650+ participants), the DefaultReconnectPolicy causes all clients to reconnect simultaneously — a thundering herd that collapses the room instead of allowing recovery.

The root cause is in DefaultReconnectPolicy.ts:

const DEFAULT_RETRY_DELAYS_IN_MS = [
  0,     // attempt 0: immediate — zero delay
  300,   // attempt 1: 300ms — zero jitter
  1200,  // attempt 2+: jitter added, but only Math.random() * 1_000
  2700, 4800, 7000, 7000, 7000, 7000, 7000,
];

nextRetryDelayInMs(context) {
  if (context.retryCount >= this._retryDelays.length) return null;
  const retryDelay = this._retryDelays[context.retryCount];
  if (context.retryCount <= 1) return retryDelay; // NO jitter on first 2 attempts
  return retryDelay + Math.random() * 1_000;       // only 0-1000ms jitter after that
}

The problems:

  1. Attempt 0 has 0ms delay — all clients reconnect instantly and simultaneously
  2. Attempt 1 has 300ms delay with no jitter — all clients retry again at exactly the same time
  3. Jitter is only 0-1000ms — for 650 clients, that's ~0.65 clients per millisecond hitting the server, which is not meaningful spread

We run self-hosted LiveKit (v1.9.12) with rooms of ~650 participants. We've had 5 production outages in 6 weeks where this pattern plays out:

  1. Server hits internal pressure (e.g., SDP renegotiation backlog, subscription binding timeout — see reconnecting when others pull track in the room livekit#4112)
  2. Server starts disconnecting participants
  3. All ~650 clients reconnect at 0ms (attempt 0) — 650 connections/second
  4. Server buckles under the reconnection load, disconnecting more participants
  5. Attempt 1 fires at 300ms — another 650 simultaneous reconnects
  6. Positive feedback loop → room collapses from 658 → 6 participants in ~30 seconds
  7. Participants that get pushed to the other node via Redis trigger the same cascade there

The reconnection storm is the amplifier that turns a degraded-but-recoverable state into a total collapse.

Describe the proposed solution

Add meaningful jitter starting from the very first reconnect attempt. When a server is struggling, spreading reconnections over 5-15 seconds instead of 0ms gives it time to recover.

Suggested change to DefaultReconnectPolicy:

const DEFAULT_RETRY_DELAYS_IN_MS = [
  2000,   // attempt 0: 2s base (was 0)
  3000,   // attempt 1: 3s base (was 300)
  5000,   // attempt 2: 5s base (was 1200)
  7000,   // attempt 3+: same as current
  7000, 7000, 7000, 7000, 7000, 7000,
];

nextRetryDelayInMs(context) {
  if (context.retryCount >= this._retryDelays.length) return null;
  const retryDelay = this._retryDelays[context.retryCount];
  // Jitter on ALL attempts, proportional to delay (±50%)
  const jitter = retryDelay * (Math.random() - 0.5);
  return Math.max(0, Math.round(retryDelay + jitter));
}

This would spread 650 clients' first reconnect attempt over a 1-3 second window (~220-650 clients/sec) instead of all at 0ms, and subsequent retries over proportionally wider windows.

The exact values are less important than the principles:

  • Every attempt should have jitter, including the first
  • Jitter should scale with the delay, not be a fixed 0-1000ms
  • The first attempt should not be 0ms — even 1-3 seconds of spread prevents the thundering herd

Alternatives considered

1. Custom reconnectPolicy in application code (our current workaround)

We can pass a custom reconnectPolicy in RoomOptions to add jitter ourselves. This works, but:

  • The unsafe defaults still affect every other LiveKit deployment
  • Most self-hosters won't know they need this until they have their first large-room outage
  • The SDK should be safe by default at scale

2. Server-side reconnection backoff (livekit/livekit)

The server could stagger disconnect signals or add backoff to could not restart participant rejections. This would help but doesn't address the client-side thundering herd from network-level disconnects (where the server isn't choosing to disconnect clients).

3. Larger fixed jitter (e.g., Math.random() * 10_000)

Simpler but less elegant — a fixed 0-10s jitter would work for large rooms but adds unnecessary latency for small rooms or transient network blips where instant reconnection is appropriate.

Importance

serious, but I can work around it

Additional Information

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions