Add configurable affinity_mode for egress pod selection#1209
Add configurable affinity_mode for egress pod selection#1209abhisheksingh-R41 wants to merge 2 commits into
Conversation
The current StartEgressAffinity always scores idle pods at 0.5 and busy
pods at 1.0. Combined with MaximumAffinity=1 in the psrpc client, this
means the first pod to accept any job wins all subsequent jobs until its
CPU budget is exhausted — a staircase pattern rather than even spread.
The existing code comment acknowledges this is intentional for mixed
fleets ("avoids having many instances with one track request each, taking
availability from room composite"). However for a TrackEgress-only fleet
the packing strategy provides no benefit and causes sequential scale-out
delays.
This commit adds a configurable affinity_mode field to ServiceConfig:
pack (default) — current behaviour, unchanged
spread — CPU-proportional scoring; idle pods score 1.0 and
win immediately via MaximumAffinity short-circuit,
busy pods score proportionally so the least-loaded
pod wins after ShortCircuitTimeout. Best for
single-type (TrackEgress-only) fleets.
type_aware — RoomComposite/Web requests prefer idle pods (1.0
idle / 0.5 busy); Track/Participant requests spread
by CPU load. Best of both worlds for mixed fleets.
Default is "pack" so all existing deployments are unaffected.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@frostbyte73 @milos-lk — would appreciate a review when you get a chance. This adds a configurable |
|
Could you provide more information about the motivation for handling track/participant requests differently? The main motivation for the current scheme is related to autoscaling, particularly down scaling: since draining an instance can take a long time, we want to make sure that the instance most likely to get terminated on a down scale event is the one with the east requests (ideally 0). If we were to take a patch to adjust the behavior, most extensive unit tests would be needed to ensure no regression over time. |
…e_aware modes Addresses both root causes of the 24/51-job-on-one-pod skew observed in the 2026-05-07 load test: Cause A (strict > tie-break): psrpc's ShortCircuitTimeout means the first replier wins when all idle pods return the same score. Fixed by subtracting rand.Float32()*0.001 jitter so idle pods produce distinct scores, making the strict-> comparison effectively random among equally-idle peers. Cause B (m.requests.Inc lag): StartEgressAffinity is called before StartEgress, so the winning pod's m.requests counter stays 0 across an entire 200ms burst window and all callers see score 1.0. Fixed by a pendingClaims atomic.Int32 that increments at affinity time and decrements at StartEgress accept (consumePendingClaim). A 2s self-decay timer guards against claims that are never fulfilled. A CAS loop in consumePendingClaim ensures exactly one decrement fires per increment even when StartEgress and the timer race. New monitor helper AvailableCPUFractionWithPending deducts pendingSlots*TrackCpuCost from the available budget so the score decreases with each in-flight claim. Image: asia-south1-docker.pkg.dev/avian-pulsar-430509-f6/r41-livekit/egress:v1.12.0-r41.2 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Problem
The current
StartEgressAffinityscores idle pods at 0.5 and busy pods at 1.0. Combined withMaximumAffinity=1in the psrpc client, this causes a staircase distribution: as soon as a pod accepts its first job it scores 1.0 and wins all subsequent jobs via the short-circuit, until its CPU budget is exhausted. The next pod then starts filling, and so on.The existing source comment acknowledges this is intentional:
This is the right behaviour for mixed fleets (Track + RoomComposite). However for a TrackEgress-only fleet the packing strategy provides no benefit and actively hurts KEDA/HPA scale-out — newly provisioned pods start idle at 0.5 and always lose to already-busy pods until those pods saturate.
Solution
Add a configurable
affinity_modefield toServiceConfig. The default (pack) preserves existing behaviour exactly — zero change for current deployments.pack(default)spreadtype_awareHow
spreadworksAvailableCPUFraction() == 1.0→ hitsMaximumAffinity=1and wins immediately (same speed as current busy-pod short-circuit).ShortCircuitTimeout=500msand picks the least-loaded pod.Changes
pkg/config/service.go— addsAffinityMode string \yaml:"affinity_mode"`toServiceConfig`pkg/stats/monitor.go— addsAvailableCPUFraction() float32(wraps existinggetCPUUsageLocked; idle returns 1.0, busy returns available/total)pkg/server/server_rpc.go— replacesStartEgressAffinitywith mode switch; addsisHeavyEgressRequesthelperpkg/server/server_rpc_test.go— table-driven unit tests forisHeavyEgressRequestcovering all 5 request typesBackwards compatibility
affinity_modedefaults to empty string which falls through todefault:case — identical to currentpackbehaviour.