Skip to content

fix(planner): make SLA-mode scale-down consolidation-aware again#9294

Open
tedzhouhk wants to merge 6 commits intomainfrom
hzhou/planner-consolidation-aware-scale-down
Open

fix(planner): make SLA-mode scale-down consolidation-aware again#9294
tedzhouhk wants to merge 6 commits intomainfrom
hzhou/planner-consolidation-aware-scale-down

Conversation

@tedzhouhk
Copy link
Copy Markdown
Contributor

@tedzhouhk tedzhouhk commented May 7, 2026

Summary

The flat load_scaling_down_sensitivity × SLA scale-down threshold did not account for consolidation: when scaling N → N-1, each survivor absorbs N/(N-1) more load. With the default 80% sensitivity, the planner authorised 2 → 1 scale-downs whenever each worker was under 80% of SLA — but the surviving worker would immediately blow SLA, causing 2 ↔ 1 oscillation under constant load.

Per-component fix (SLA mode only; easy mode unchanged):

  • prefill / agg-prefill — re-predict TTFT with queued_prefill_tokens × N/(N-1) (and current_decode_kv × N/(N-1) for agg). Only the queue input is scaled; the regression's internal avg_isl (the new request's own compute) is left alone, so we don't inflate forward-pass time that doesn't shrink with more workers.
  • decode / agg-decode — switch to input-side check on per-worker KV utilisation. Refuse scale-down if util × N/(N-1) >= sensitivity.

load_scaling_down_sensitivity remains as the operator-tunable safety margin on top of the consolidation prediction. Default behaviour for high N is essentially unchanged ((N-1)/N → 1 as N grows); the fix targets the low-N regime where the production oscillation occurs.

Behaviour at production T1 (2 workers, decode kv util ≈ 0.7 each)

Before: per-worker ITL 23 ms < threshold (32 ms) → scale down → survivor blows SLA → scale up.
After: util 0.7 × 2 = 1.4 >= 0.8 → refuse scale-down. Oscillation gone.

Test plan

  • 8 new unit tests in test_state_machine.py:
    • TestDecodeConsolidationAwareScaleDown: production scenario (N=2 at 50% util refuses), N=2 at 30% permits, N=10 at 70%/78%, missing-max_kv fallback
    • TestPrefillConsolidationAwareScaleDown: high queue at N=2 refuses, empty queue permits, queue-only inflation lets N=10 still scale down
  • All 352 existing planner unit tests pass
  • Pre-commit clean on changed files (the unrelated pytest-marker-report failure is pre-existing on origin/main due to an unconditional kvbm.trtllm_integration import)

🤖 Generated with Claude Code


Open in Devin Review

Summary by CodeRabbit

  • Bug Fixes

    • Improved load-based scaling decisions to prevent SLA violations when consolidating workloads; scale-down is now blocked when post-consolidation latency predictions would exceed sensitivity thresholds.
    • Enhanced latency prediction logic to account for consolidation effects across prefill and decode operations.
  • Tests

    • Added comprehensive test coverage for consolidation-aware scale-down behavior.

The flat ``load_scaling_down_sensitivity * SLA`` threshold did not
account for the consolidation factor: when scaling N -> N-1, each
survivor absorbs N/(N-1) more load. Under the default sensitivity=80%,
the planner authorised 2 -> 1 scale-downs whenever each worker was
under 80% of SLA, but the surviving worker would immediately exceed
SLA after consolidation -- causing the 2 <-> 1 oscillation seen in QA.

Per-component fixes:

- prefill / agg-prefill: re-predict TTFT with ``queued_prefill_tokens *
  N/(N-1)`` (and ``current_decode_kv * N/(N-1)`` for agg). Only the
  queue input is scaled; the regression's internal ``avg_isl`` (the new
  request's own compute) stays put, so we do not inflate the forward-
  pass time that does not actually shrink with more workers.
- decode / agg-decode: switch to input-side check on per-worker KV
  utilisation. Refuse scale-down if ``util * N/(N-1) >= sensitivity``.

The ``load_scaling_down_sensitivity`` knob remains as the operator-
tunable safety margin on top of the consolidation prediction.

8 new unit tests cover the production scenario (N=2 at 50% util refuses
scale-down) plus the high-N regime (N=10 permits at 70% but refuses at
78%). All 352 planner unit tests pass.

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
@tedzhouhk tedzhouhk requested review from a team as code owners May 7, 2026 23:50
@tedzhouhk tedzhouhk changed the title fix(planner): make SLA-mode scale-down consolidation-aware fix(planner): make SLA-mode scale-down consolidation-aware again May 7, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack

Walkthrough

This PR introduces consolidation-aware scale-down gating across four scaling decision functions. The _scale_decision helper is refactored to accept an explicit can_scale_down flag. Each decision function (prefill, decode, agg-prefill, agg-decode) computes whether scale-down is safe by applying a consolidation factor N/(N-1) to post-merger estimates and checking whether they exceed sensitivity-adjusted thresholds.

Changes

Consolidation-Aware Load Scaling

Layer / File(s) Summary
Scale-decision Helper Refactor
components/src/dynamo/planner/core/load_scaling.py
_scale_decision now accepts a can_scale_down keyword-only boolean parameter. Scale-down is returned only when both num_workers > 1 and can_scale_down are true; otherwise, no_change is returned after scale-up handling.
Consolidation-Aware Prefill Scaling
components/src/dynamo/planner/core/load_scaling.py
Prefill decision computes consolidation factor, re-predicts TTFT with queued tokens inflated by N/(N-1), and gates scale-down when post-consolidation TTFT would violate the sensitivity-adjusted SLA.
Consolidation-Aware Decode Scaling
components/src/dynamo/planner/core/load_scaling.py
Decode decision computes per-engine KV utilization from (scheduled + queued) KV over max_kv, inflates by consolidation, and blocks scale-down when utilization would exceed sensitivity threshold.
Consolidation-Aware Agg-Prefill Scaling
components/src/dynamo/planner/core/load_scaling.py
Agg-prefill decision combines queued prefill and decode KV, inflates by consolidation, and disables scale-down when any engine's post-consolidation TTFT breaks the sensitivity-adjusted bound.
Consolidation-Aware Agg-Decode Scaling
components/src/dynamo/planner/core/load_scaling.py
Agg-decode decision combines scheduled/queued decode KV and queued prefill KV, normalizes by max_kv, inflates by consolidation, and blocks scale-down when pressure exceeds sensitivity.
Consolidation-Aware Scale-Down Tests
components/src/dynamo/planner/tests/unit/test_state_machine.py
Adds decode consolidation regression training and multi-worker KV utilization assertions. Adds prefill TTFT regression training and validates post-consolidation TTFT vs SLA bounds across queue sizes, verifying consolidation affects only queue terms.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately describes the main fix: making SLA-mode scale-down consolidation-aware. It is concise, clear, and directly reflects the core change in the PR.
Description check ✅ Passed The description is comprehensive and well-structured. It clearly explains the problem, per-component fixes, behavior changes, and includes test plan confirmation. However, the template requires specific sections (Overview, Details, Where to start, Related Issues) that are not explicitly matched in the provided description.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/src/dynamo/planner/core/load_scaling.py`:
- Around line 398-401: Collapse the multi-line ternary expressions into
single-line ternaries for all occurrences (e.g., the consolidation assignment in
load_scaling.py where consolidation is set, and the can_scale_down assignments)
so they conform to Ruff's formatting rules, and replace any ambiguous
multiplication sign `×` (U+00D7) in comments/docstrings with a plain asterisk
`*` (notably around the comments/docstrings near the can_scale_down logic and
later docstrings). After making these edits, run `ruff format` and `ruff check
--fix` locally to ensure all remaining style fixes are applied before
committing.

In `@components/src/dynamo/planner/tests/unit/test_state_machine.py`:
- Around line 367-369: The test function signature for _tick is currently broken
across multiple lines and Black expects a single-line signature, and several
docstrings contain the Unicode multiplication sign “×” which Ruff flags as
RUF002; update both _tick definitions (function name _tick) to use a single-line
signature (e.g., def _tick(self, *, num_workers: int, sched_kv_per_worker: int)
-> TickInput:) and replace every “×” in the nearby docstrings (the occurrences
you flagged) with the ASCII letter "x" (or an explicit "times" if clearer), then
run ruff format and ruff check --fix (or the repo pre-commit) on the touched
files to apply and verify formatting fixes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 786c8865-46ec-490b-b6e1-907931253e80

📥 Commits

Reviewing files that changed from the base of the PR and between b8f85c2 and 19abff5.

📒 Files selected for processing (2)
  • components/src/dynamo/planner/core/load_scaling.py
  • components/src/dynamo/planner/tests/unit/test_state_machine.py

Comment thread components/src/dynamo/planner/core/load_scaling.py Outdated
Comment thread components/src/dynamo/planner/tests/unit/test_state_machine.py Outdated
Refines the prefill / agg-prefill scale-down check so the
``load_scaling_down_sensitivity`` margin applies to the queue-induced
portion of TTFT, not the new request's own forward-pass time. Without
this, sensitivity eats into the unavoidable own-compute budget and
over-penalises scale-down when ``T_own`` is a meaningful fraction of
the SLA.

Replaces:
  TTFT(queued × N/(N-1))                  < SLA × sensitivity
with:
  TTFT(queued × N/(N-1)) - T_own_post     < (SLA - T_own_post) × sensitivity

where ``T_own_post`` is the regression's prediction at queue=0 (and
``decode_kv × N/(N-1)`` for agg, since the survivor absorbs decode
state too). When ``T_own_post >= SLA`` the queue budget is non-positive
and scale-down is refused -- own compute alone already misses SLA, so
losing a worker can only worsen contention.

Adds two tests: one where the queue-budget check permits a scale-down
that the previous output-side check would have refused, and one where
own compute alone exceeds SLA. Existing tests updated so per-tick FPM
``wall_time`` matches the trained regression's slope (otherwise the
refit on each tick rejects the model with negative-coefficient
warnings, masking the new code path).

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Replace U+00D7 MULTIPLICATION SIGN (x), U+2248 ALMOST EQUAL TO (~=),
U+2192 RIGHTWARDS ARROW (->), and U+2264 LESS-THAN OR EQUAL (<=) in
comments and docstrings with their ASCII equivalents on lines I
introduced. Leaves pre-existing occurrences elsewhere alone.

No behaviour change. Ruff RUF002/RUF003 clean.

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Closes a hole in the agg dispatcher where ``_agg_prefill_scaling``
returning ``None`` because of an active consolidation safety refusal
was conflated with "no prefill signal", causing ``_advance_load_agg``
to fall through to its line-327 fallback and grant decode-only
scale-down. Same shape as the original 2 <-> 1 oscillation, just on
the prefill-saturation side instead of decode.

Concrete trigger: 2 agg workers with decode_kv * N/(N-1) pushing
``T_own_post`` over the TTFT SLA, while decode util * N/(N-1) is
still under sensitivity. Pre-fix dispatcher silently overrides the
prefill veto.

Fix:

- ``_agg_prefill_scaling`` and ``_agg_decode_scaling`` now return
  ``num_workers`` (not ``None``) when the consolidation check
  actively refused scale-down, so the dispatcher distinguishes
  "stay at current count" from "no signal".
- ``_advance_load_agg`` captures ``p_refused`` immediately after
  the prefill sub-call (before the decode call can overwrite the
  shared diag field) and surfaces ``scale_down_refused_consolidation``
  as the aggregate reason instead of the generic "no_change".
- All four sub-decisions (prefill, decode, agg-prefill, agg-decode)
  stamp ``_diag_load_reason = "scale_down_refused_consolidation"``
  when they refuse, giving operators a distinct signal in
  diagnostics.

Adds two regression tests in ``TestAggConsolidationAwareScaleDown``:
one that fails without the fix (decode-only scale-down would have
broken prefill SLA), and one sanity test confirming agg still
scales down when both sides are safe. Also asserts the new
diagnostic reason surfaces correctly. All 371 planner unit tests
pass.

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Previous version of the consolidation-aware decode check used cache
utilisation ``(sched_kv + queued_kv) / max_kv_tokens`` as the saturation
proxy. That under-protects the common case where the engine becomes
SLA-bound (latency > target) at a KV level far below cache capacity --
e.g. the customer's regression hits SLA at ~309K KV while their cache
holds 400K+. With ``max_kv`` as the denominator and default sensitivity
0.8, scale-down would still fire at a state that breaches SLA after
consolidation.

Replace the input-side util check with two safety gates evaluated at
the survivor's post-consolidation kv (``(sched + queued) * N/(N-1)``):

  1. **Cache feasibility** (when ``max_kv_tokens`` is advertised):
     refuse if ``post_kv >= max_kv``. Block eviction / queueing past
     the cache is non-linear and outside the regression's training
     domain, so we hard-fail.
  2. **SLA check** via regression: predict ITL at post_kv via
     ``estimate_next_itl`` and refuse if the prediction crosses
     ``ITL_SLA * sensitivity``. The regression carries the SLA-bound
     capacity directly; this works regardless of cache size.

Mirrored in ``_agg_decode_scaling`` against combined cache pressure
(decode_kv + queued_prefill, since queued prefill becomes decode KV).

Customer scenario verification (their regression: ITL = 7.15e-5*kv +
17.89, SLA=40, sensitivity=0.8):
  * T1 transient (each worker 70K, total 140K): post_kv=140K, ITL=28ms
    < 32ms threshold -> ALLOW (correct: 1 worker handles 140K at SLA).
  * Steady state at 2 workers under post-growth load (each 177K, total
    355K): post_kv=354K, ITL=43.2ms >= 32ms -> REFUSE. Cycle breaks.

Tests in ``TestDecodeConsolidationAwareScaleDown`` updated to exercise
the new semantics: post-consolidation within/breaches SLA, exceeds
max_kv, missing max_kv falls through to SLA check. All 370 planner
unit tests pass.

Signed-off-by: hongkuanz <hongkuanz@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants