Skip to content

Add configurable affinity_mode for egress pod selection#1209

Open
abhisheksingh-R41 wants to merge 2 commits into
livekit:mainfrom
recruit41:upstream-affinity-mode
Open

Add configurable affinity_mode for egress pod selection#1209
abhisheksingh-R41 wants to merge 2 commits into
livekit:mainfrom
recruit41:upstream-affinity-mode

Conversation

@abhisheksingh-R41
Copy link
Copy Markdown

@abhisheksingh-R41 abhisheksingh-R41 commented May 6, 2026

Problem

The current StartEgressAffinity scores idle pods at 0.5 and busy pods at 1.0. Combined with MaximumAffinity=1 in the psrpc client, this causes a staircase distribution: as soon as a pod accepts its first job it scores 1.0 and wins all subsequent jobs via the short-circuit, until its CPU budget is exhausted. The next pod then starts filling, and so on.

The existing source comment acknowledges this is intentional:

"if this instance is idle and another is already handling some, the request will go to that server. This avoids having many instances with one track request each, taking availability from room composite."

This is the right behaviour for mixed fleets (Track + RoomComposite). However for a TrackEgress-only fleet the packing strategy provides no benefit and actively hurts KEDA/HPA scale-out — newly provisioned pods start idle at 0.5 and always lose to already-busy pods until those pods saturate.

Solution

Add a configurable affinity_mode field to ServiceConfig. The default (pack) preserves existing behaviour exactly — zero change for current deployments.

Mode Affinity scoring Best for
pack (default) idle=0.5, busy=1.0 Mixed fleet (Track + RoomComposite)
spread CPU-proportional; idle=1.0 → wins immediately Single-type fleet (TrackEgress only)
type_aware RoomComposite/Web prefer idle pods; Track/Participant spread by CPU load Mixed fleet with smarter routing

How spread works

  • Idle pods return AvailableCPUFraction() == 1.0 → hits MaximumAffinity=1 and wins immediately (same speed as current busy-pod short-circuit).
  • Busy pods return their remaining CPU fraction (e.g. 0.6) → client waits ShortCircuitTimeout=500ms and picks the least-loaded pod.
  • Result: jobs distribute evenly across all pods rather than packing sequentially.

Changes

  • pkg/config/service.go — adds AffinityMode string \yaml:"affinity_mode"`toServiceConfig`
  • pkg/stats/monitor.go — adds AvailableCPUFraction() float32 (wraps existing getCPUUsageLocked; idle returns 1.0, busy returns available/total)
  • pkg/server/server_rpc.go — replaces StartEgressAffinity with mode switch; adds isHeavyEgressRequest helper
  • pkg/server/server_rpc_test.go — table-driven unit tests for isHeavyEgressRequest covering all 5 request types

Backwards compatibility

  • affinity_mode defaults to empty string which falls through to default: case — identical to current pack behaviour.
  • No changes to existing config parsing, prometheus metrics, or admission logic.

The current StartEgressAffinity always scores idle pods at 0.5 and busy
pods at 1.0. Combined with MaximumAffinity=1 in the psrpc client, this
means the first pod to accept any job wins all subsequent jobs until its
CPU budget is exhausted — a staircase pattern rather than even spread.

The existing code comment acknowledges this is intentional for mixed
fleets ("avoids having many instances with one track request each, taking
availability from room composite"). However for a TrackEgress-only fleet
the packing strategy provides no benefit and causes sequential scale-out
delays.

This commit adds a configurable affinity_mode field to ServiceConfig:

  pack (default)  — current behaviour, unchanged
  spread          — CPU-proportional scoring; idle pods score 1.0 and
                    win immediately via MaximumAffinity short-circuit,
                    busy pods score proportionally so the least-loaded
                    pod wins after ShortCircuitTimeout. Best for
                    single-type (TrackEgress-only) fleets.
  type_aware      — RoomComposite/Web requests prefer idle pods (1.0
                    idle / 0.5 busy); Track/Participant requests spread
                    by CPU load. Best of both worlds for mixed fleets.

Default is "pack" so all existing deployments are unaffected.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@abhisheksingh-R41 abhisheksingh-R41 requested a review from a team as a code owner May 6, 2026 10:29
@abhisheksingh-R41
Copy link
Copy Markdown
Author

@frostbyte73 @milos-lk — would appreciate a review when you get a chance. This adds a configurable affinity_mode to address staircase distribution in single-type (TrackEgress-only) fleets. Default is pack so existing deployments are unaffected.

@biglittlebigben
Copy link
Copy Markdown
Contributor

Could you provide more information about the motivation for handling track/participant requests differently? The main motivation for the current scheme is related to autoscaling, particularly down scaling: since draining an instance can take a long time, we want to make sure that the instance most likely to get terminated on a down scale event is the one with the east requests (ideally 0).

If we were to take a patch to adjust the behavior, most extensive unit tests would be needed to ensure no regression over time.

…e_aware modes

Addresses both root causes of the 24/51-job-on-one-pod skew observed in
the 2026-05-07 load test:

Cause A (strict > tie-break): psrpc's ShortCircuitTimeout means the first
replier wins when all idle pods return the same score. Fixed by subtracting
rand.Float32()*0.001 jitter so idle pods produce distinct scores, making
the strict-> comparison effectively random among equally-idle peers.

Cause B (m.requests.Inc lag): StartEgressAffinity is called before
StartEgress, so the winning pod's m.requests counter stays 0 across an
entire 200ms burst window and all callers see score 1.0. Fixed by a
pendingClaims atomic.Int32 that increments at affinity time and
decrements at StartEgress accept (consumePendingClaim). A 2s self-decay
timer guards against claims that are never fulfilled. A CAS loop in
consumePendingClaim ensures exactly one decrement fires per increment
even when StartEgress and the timer race.

New monitor helper AvailableCPUFractionWithPending deducts
pendingSlots*TrackCpuCost from the available budget so the score
decreases with each in-flight claim.

Image: asia-south1-docker.pkg.dev/avian-pulsar-430509-f6/r41-livekit/egress:v1.12.0-r41.2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants