fix(planner): make SLA-mode scale-down consolidation-aware again#9294
fix(planner): make SLA-mode scale-down consolidation-aware again#9294
Conversation
The flat ``load_scaling_down_sensitivity * SLA`` threshold did not account for the consolidation factor: when scaling N -> N-1, each survivor absorbs N/(N-1) more load. Under the default sensitivity=80%, the planner authorised 2 -> 1 scale-downs whenever each worker was under 80% of SLA, but the surviving worker would immediately exceed SLA after consolidation -- causing the 2 <-> 1 oscillation seen in QA. Per-component fixes: - prefill / agg-prefill: re-predict TTFT with ``queued_prefill_tokens * N/(N-1)`` (and ``current_decode_kv * N/(N-1)`` for agg). Only the queue input is scaled; the regression's internal ``avg_isl`` (the new request's own compute) stays put, so we do not inflate the forward- pass time that does not actually shrink with more workers. - decode / agg-decode: switch to input-side check on per-worker KV utilisation. Refuse scale-down if ``util * N/(N-1) >= sensitivity``. The ``load_scaling_down_sensitivity`` knob remains as the operator- tunable safety margin on top of the consolidation prediction. 8 new unit tests cover the production scenario (N=2 at 50% util refuses scale-down) plus the high-N regime (N=10 permits at 70% but refuses at 78%). All 352 planner unit tests pass. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
WalkthroughThis PR introduces consolidation-aware scale-down gating across four scaling decision functions. The ChangesConsolidation-Aware Load Scaling
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/src/dynamo/planner/core/load_scaling.py`:
- Around line 398-401: Collapse the multi-line ternary expressions into
single-line ternaries for all occurrences (e.g., the consolidation assignment in
load_scaling.py where consolidation is set, and the can_scale_down assignments)
so they conform to Ruff's formatting rules, and replace any ambiguous
multiplication sign `×` (U+00D7) in comments/docstrings with a plain asterisk
`*` (notably around the comments/docstrings near the can_scale_down logic and
later docstrings). After making these edits, run `ruff format` and `ruff check
--fix` locally to ensure all remaining style fixes are applied before
committing.
In `@components/src/dynamo/planner/tests/unit/test_state_machine.py`:
- Around line 367-369: The test function signature for _tick is currently broken
across multiple lines and Black expects a single-line signature, and several
docstrings contain the Unicode multiplication sign “×” which Ruff flags as
RUF002; update both _tick definitions (function name _tick) to use a single-line
signature (e.g., def _tick(self, *, num_workers: int, sched_kv_per_worker: int)
-> TickInput:) and replace every “×” in the nearby docstrings (the occurrences
you flagged) with the ASCII letter "x" (or an explicit "times" if clearer), then
run ruff format and ruff check --fix (or the repo pre-commit) on the touched
files to apply and verify formatting fixes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 786c8865-46ec-490b-b6e1-907931253e80
📒 Files selected for processing (2)
components/src/dynamo/planner/core/load_scaling.pycomponents/src/dynamo/planner/tests/unit/test_state_machine.py
Refines the prefill / agg-prefill scale-down check so the ``load_scaling_down_sensitivity`` margin applies to the queue-induced portion of TTFT, not the new request's own forward-pass time. Without this, sensitivity eats into the unavoidable own-compute budget and over-penalises scale-down when ``T_own`` is a meaningful fraction of the SLA. Replaces: TTFT(queued × N/(N-1)) < SLA × sensitivity with: TTFT(queued × N/(N-1)) - T_own_post < (SLA - T_own_post) × sensitivity where ``T_own_post`` is the regression's prediction at queue=0 (and ``decode_kv × N/(N-1)`` for agg, since the survivor absorbs decode state too). When ``T_own_post >= SLA`` the queue budget is non-positive and scale-down is refused -- own compute alone already misses SLA, so losing a worker can only worsen contention. Adds two tests: one where the queue-budget check permits a scale-down that the previous output-side check would have refused, and one where own compute alone exceeds SLA. Existing tests updated so per-tick FPM ``wall_time`` matches the trained regression's slope (otherwise the refit on each tick rejects the model with negative-coefficient warnings, masking the new code path). Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Replace U+00D7 MULTIPLICATION SIGN (x), U+2248 ALMOST EQUAL TO (~=), U+2192 RIGHTWARDS ARROW (->), and U+2264 LESS-THAN OR EQUAL (<=) in comments and docstrings with their ASCII equivalents on lines I introduced. Leaves pre-existing occurrences elsewhere alone. No behaviour change. Ruff RUF002/RUF003 clean. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Closes a hole in the agg dispatcher where ``_agg_prefill_scaling`` returning ``None`` because of an active consolidation safety refusal was conflated with "no prefill signal", causing ``_advance_load_agg`` to fall through to its line-327 fallback and grant decode-only scale-down. Same shape as the original 2 <-> 1 oscillation, just on the prefill-saturation side instead of decode. Concrete trigger: 2 agg workers with decode_kv * N/(N-1) pushing ``T_own_post`` over the TTFT SLA, while decode util * N/(N-1) is still under sensitivity. Pre-fix dispatcher silently overrides the prefill veto. Fix: - ``_agg_prefill_scaling`` and ``_agg_decode_scaling`` now return ``num_workers`` (not ``None``) when the consolidation check actively refused scale-down, so the dispatcher distinguishes "stay at current count" from "no signal". - ``_advance_load_agg`` captures ``p_refused`` immediately after the prefill sub-call (before the decode call can overwrite the shared diag field) and surfaces ``scale_down_refused_consolidation`` as the aggregate reason instead of the generic "no_change". - All four sub-decisions (prefill, decode, agg-prefill, agg-decode) stamp ``_diag_load_reason = "scale_down_refused_consolidation"`` when they refuse, giving operators a distinct signal in diagnostics. Adds two regression tests in ``TestAggConsolidationAwareScaleDown``: one that fails without the fix (decode-only scale-down would have broken prefill SLA), and one sanity test confirming agg still scales down when both sides are safe. Also asserts the new diagnostic reason surfaces correctly. All 371 planner unit tests pass. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Previous version of the consolidation-aware decode check used cache
utilisation ``(sched_kv + queued_kv) / max_kv_tokens`` as the saturation
proxy. That under-protects the common case where the engine becomes
SLA-bound (latency > target) at a KV level far below cache capacity --
e.g. the customer's regression hits SLA at ~309K KV while their cache
holds 400K+. With ``max_kv`` as the denominator and default sensitivity
0.8, scale-down would still fire at a state that breaches SLA after
consolidation.
Replace the input-side util check with two safety gates evaluated at
the survivor's post-consolidation kv (``(sched + queued) * N/(N-1)``):
1. **Cache feasibility** (when ``max_kv_tokens`` is advertised):
refuse if ``post_kv >= max_kv``. Block eviction / queueing past
the cache is non-linear and outside the regression's training
domain, so we hard-fail.
2. **SLA check** via regression: predict ITL at post_kv via
``estimate_next_itl`` and refuse if the prediction crosses
``ITL_SLA * sensitivity``. The regression carries the SLA-bound
capacity directly; this works regardless of cache size.
Mirrored in ``_agg_decode_scaling`` against combined cache pressure
(decode_kv + queued_prefill, since queued prefill becomes decode KV).
Customer scenario verification (their regression: ITL = 7.15e-5*kv +
17.89, SLA=40, sensitivity=0.8):
* T1 transient (each worker 70K, total 140K): post_kv=140K, ITL=28ms
< 32ms threshold -> ALLOW (correct: 1 worker handles 140K at SLA).
* Steady state at 2 workers under post-growth load (each 177K, total
355K): post_kv=354K, ITL=43.2ms >= 32ms -> REFUSE. Cycle breaks.
Tests in ``TestDecodeConsolidationAwareScaleDown`` updated to exercise
the new semantics: post-consolidation within/breaches SLA, exceeds
max_kv, missing max_kv falls through to SLA check. All 370 planner
unit tests pass.
Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Summary
The flat
load_scaling_down_sensitivity × SLAscale-down threshold did not account for consolidation: when scaling N → N-1, each survivor absorbsN/(N-1)more load. With the default 80% sensitivity, the planner authorised 2 → 1 scale-downs whenever each worker was under 80% of SLA — but the surviving worker would immediately blow SLA, causing 2 ↔ 1 oscillation under constant load.Per-component fix (SLA mode only; easy mode unchanged):
queued_prefill_tokens × N/(N-1)(andcurrent_decode_kv × N/(N-1)for agg). Only the queue input is scaled; the regression's internalavg_isl(the new request's own compute) is left alone, so we don't inflate forward-pass time that doesn't shrink with more workers.util × N/(N-1) >= sensitivity.load_scaling_down_sensitivityremains as the operator-tunable safety margin on top of the consolidation prediction. Default behaviour for high N is essentially unchanged ((N-1)/N → 1as N grows); the fix targets the low-N regime where the production oscillation occurs.Behaviour at production T1 (2 workers, decode kv util ≈ 0.7 each)
Before: per-worker ITL 23 ms < threshold (32 ms) → scale down → survivor blows SLA → scale up.
After: util
0.7 × 2 = 1.4 >= 0.8→ refuse scale-down. Oscillation gone.Test plan
test_state_machine.py:TestDecodeConsolidationAwareScaleDown: production scenario (N=2 at 50% util refuses), N=2 at 30% permits, N=10 at 70%/78%, missing-max_kvfallbackTestPrefillConsolidationAwareScaleDown: high queue at N=2 refuses, empty queue permits, queue-only inflation lets N=10 still scale downpytest-marker-reportfailure is pre-existing on origin/main due to an unconditionalkvbm.trtllm_integrationimport)🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests