Add configurable affinity_mode for egress pod selection by abhisheksingh-R41 · Pull Request #1209 · livekit/egress

abhisheksingh-R41 · 2026-05-06T10:29:04Z

Problem

The current StartEgressAffinity scores idle pods at 0.5 and busy pods at 1.0. Combined with MaximumAffinity=1 in the psrpc client, this causes a staircase distribution: as soon as a pod accepts its first job it scores 1.0 and wins all subsequent jobs via the short-circuit, until its CPU budget is exhausted. The next pod then starts filling, and so on.

The existing source comment acknowledges this is intentional:

"if this instance is idle and another is already handling some, the request will go to that server. This avoids having many instances with one track request each, taking availability from room composite."

This is the right behaviour for mixed fleets (Track + RoomComposite). However for a TrackEgress-only fleet the packing strategy provides no benefit and actively hurts KEDA/HPA scale-out — newly provisioned pods start idle at 0.5 and always lose to already-busy pods until those pods saturate.

Solution

Add a configurable affinity_mode field to ServiceConfig. The default (pack) preserves existing behaviour exactly — zero change for current deployments.

Mode	Affinity scoring	Best for
`pack` (default)	idle=0.5, busy=1.0	Mixed fleet (Track + RoomComposite)
`spread`	CPU-proportional; idle=1.0 → wins immediately	Single-type fleet (TrackEgress only)
`type_aware`	RoomComposite/Web prefer idle pods; Track/Participant spread by CPU load	Mixed fleet with smarter routing

How `spread` works

Idle pods return AvailableCPUFraction() == 1.0 → hits MaximumAffinity=1 and wins immediately (same speed as current busy-pod short-circuit).
Busy pods return their remaining CPU fraction (e.g. 0.6) → client waits ShortCircuitTimeout=500ms and picks the least-loaded pod.
Result: jobs distribute evenly across all pods rather than packing sequentially.

Changes

pkg/config/service.go — adds AffinityMode string \yaml:"affinity_mode"`toServiceConfig`
pkg/stats/monitor.go — adds AvailableCPUFraction() float32 (wraps existing getCPUUsageLocked; idle returns 1.0, busy returns available/total)
pkg/server/server_rpc.go — replaces StartEgressAffinity with mode switch; adds isHeavyEgressRequest helper
pkg/server/server_rpc_test.go — table-driven unit tests for isHeavyEgressRequest covering all 5 request types

Backwards compatibility

affinity_mode defaults to empty string which falls through to default: case — identical to current pack behaviour.
No changes to existing config parsing, prometheus metrics, or admission logic.

The current StartEgressAffinity always scores idle pods at 0.5 and busy pods at 1.0. Combined with MaximumAffinity=1 in the psrpc client, this means the first pod to accept any job wins all subsequent jobs until its CPU budget is exhausted — a staircase pattern rather than even spread. The existing code comment acknowledges this is intentional for mixed fleets ("avoids having many instances with one track request each, taking availability from room composite"). However for a TrackEgress-only fleet the packing strategy provides no benefit and causes sequential scale-out delays. This commit adds a configurable affinity_mode field to ServiceConfig: pack (default) — current behaviour, unchanged spread — CPU-proportional scoring; idle pods score 1.0 and win immediately via MaximumAffinity short-circuit, busy pods score proportionally so the least-loaded pod wins after ShortCircuitTimeout. Best for single-type (TrackEgress-only) fleets. type_aware — RoomComposite/Web requests prefer idle pods (1.0 idle / 0.5 busy); Track/Participant requests spread by CPU load. Best of both worlds for mixed fleets. Default is "pack" so all existing deployments are unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

abhisheksingh-R41 · 2026-05-06T10:51:51Z

@frostbyte73 @milos-lk — would appreciate a review when you get a chance. This adds a configurable affinity_mode to address staircase distribution in single-type (TrackEgress-only) fleets. Default is pack so existing deployments are unaffected.

biglittlebigben · 2026-05-06T18:56:21Z

Could you provide more information about the motivation for handling track/participant requests differently? The main motivation for the current scheme is related to autoscaling, particularly down scaling: since draining an instance can take a long time, we want to make sure that the instance most likely to get terminated on a down scale event is the one with the east requests (ideally 0).

If we were to take a patch to adjust the behavior, most extensive unit tests would be needed to ensure no regression over time.

…e_aware modes Addresses both root causes of the 24/51-job-on-one-pod skew observed in the 2026-05-07 load test: Cause A (strict > tie-break): psrpc's ShortCircuitTimeout means the first replier wins when all idle pods return the same score. Fixed by subtracting rand.Float32()*0.001 jitter so idle pods produce distinct scores, making the strict-> comparison effectively random among equally-idle peers. Cause B (m.requests.Inc lag): StartEgressAffinity is called before StartEgress, so the winning pod's m.requests counter stays 0 across an entire 200ms burst window and all callers see score 1.0. Fixed by a pendingClaims atomic.Int32 that increments at affinity time and decrements at StartEgress accept (consumePendingClaim). A 2s self-decay timer guards against claims that are never fulfilled. A CAS loop in consumePendingClaim ensures exactly one decrement fires per increment even when StartEgress and the timer race. New monitor helper AvailableCPUFractionWithPending deducts pendingSlots*TrackCpuCost from the available budget so the score decreases with each in-flight claim. Image: asia-south1-docker.pkg.dev/avian-pulsar-430509-f6/r41-livekit/egress:v1.12.0-r41.2 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

abhisheksingh-R41 requested a review from a team as a code owner May 6, 2026 10:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configurable affinity_mode for egress pod selection#1209

Add configurable affinity_mode for egress pod selection#1209
abhisheksingh-R41 wants to merge 2 commits into
livekit:mainfrom
recruit41:upstream-affinity-mode

abhisheksingh-R41 commented May 6, 2026 •

edited

Loading

Uh oh!

abhisheksingh-R41 commented May 6, 2026

Uh oh!

biglittlebigben commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

abhisheksingh-R41 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

How spread works

Changes

Backwards compatibility

Uh oh!

abhisheksingh-R41 commented May 6, 2026

Uh oh!

biglittlebigben commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

abhisheksingh-R41 commented May 6, 2026 •

edited

Loading

How `spread` works