fix(planner): make SLA-mode scale-down consolidation-aware again by tedzhouhk · Pull Request #9294 · ai-dynamo/dynamo

tedzhouhk · 2026-05-07T23:50:31Z

Summary

The flat load_scaling_down_sensitivity × SLA scale-down threshold did not account for consolidation: when scaling N → N-1, each survivor absorbs N/(N-1) more load. With the default 80% sensitivity, the planner authorised 2 → 1 scale-downs whenever each worker was under 80% of SLA — but the surviving worker would immediately blow SLA, causing 2 ↔ 1 oscillation under constant load.

Per-component fix (SLA mode only; easy mode unchanged):

prefill / agg-prefill — re-predict TTFT with queued_prefill_tokens × N/(N-1) (and current_decode_kv × N/(N-1) for agg). Only the queue input is scaled; the regression's internal avg_isl (the new request's own compute) is left alone, so we don't inflate forward-pass time that doesn't shrink with more workers.
decode / agg-decode — switch to input-side check on per-worker KV utilisation. Refuse scale-down if util × N/(N-1) >= sensitivity.

load_scaling_down_sensitivity remains as the operator-tunable safety margin on top of the consolidation prediction. Default behaviour for high N is essentially unchanged ((N-1)/N → 1 as N grows); the fix targets the low-N regime where the production oscillation occurs.

Behaviour at production T1 (2 workers, decode kv util ≈ 0.7 each)

Before: per-worker ITL 23 ms < threshold (32 ms) → scale down → survivor blows SLA → scale up.
After: util 0.7 × 2 = 1.4 >= 0.8 → refuse scale-down. Oscillation gone.

Test plan

8 new unit tests in test_state_machine.py:
- TestDecodeConsolidationAwareScaleDown: production scenario (N=2 at 50% util refuses), N=2 at 30% permits, N=10 at 70%/78%, missing-max_kv fallback
- TestPrefillConsolidationAwareScaleDown: high queue at N=2 refuses, empty queue permits, queue-only inflation lets N=10 still scale down
All 352 existing planner unit tests pass
Pre-commit clean on changed files (the unrelated pytest-marker-report failure is pre-existing on origin/main due to an unconditional kvbm.trtllm_integration import)

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Improved load-based scaling decisions to prevent SLA violations when consolidating workloads; scale-down is now blocked when post-consolidation latency predictions would exceed sensitivity thresholds.
- Enhanced latency prediction logic to account for consolidation effects across prefill and decode operations.
Tests
- Added comprehensive test coverage for consolidation-aware scale-down behavior.

The flat ``load_scaling_down_sensitivity * SLA`` threshold did not account for the consolidation factor: when scaling N -> N-1, each survivor absorbs N/(N-1) more load. Under the default sensitivity=80%, the planner authorised 2 -> 1 scale-downs whenever each worker was under 80% of SLA, but the surviving worker would immediately exceed SLA after consolidation -- causing the 2 <-> 1 oscillation seen in QA. Per-component fixes: - prefill / agg-prefill: re-predict TTFT with ``queued_prefill_tokens * N/(N-1)`` (and ``current_decode_kv * N/(N-1)`` for agg). Only the queue input is scaled; the regression's internal ``avg_isl`` (the new request's own compute) stays put, so we do not inflate the forward- pass time that does not actually shrink with more workers. - decode / agg-decode: switch to input-side check on per-worker KV utilisation. Refuse scale-down if ``util * N/(N-1) >= sensitivity``. The ``load_scaling_down_sensitivity`` knob remains as the operator- tunable safety margin on top of the consolidation prediction. 8 new unit tests cover the production scenario (N=2 at 50% util refuses scale-down) plus the high-N regime (N=10 permits at 70% but refuses at 78%). All 352 planner unit tests pass. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

coderabbitai · 2026-05-07T23:56:26Z

Walkthrough

This PR introduces consolidation-aware scale-down gating across four scaling decision functions. The _scale_decision helper is refactored to accept an explicit can_scale_down flag. Each decision function (prefill, decode, agg-prefill, agg-decode) computes whether scale-down is safe by applying a consolidation factor N/(N-1) to post-merger estimates and checking whether they exceed sensitivity-adjusted thresholds.

Changes

Consolidation-Aware Load Scaling

Layer / File(s)	Summary
Scale-decision Helper Refactor `components/src/dynamo/planner/core/load_scaling.py`	`_scale_decision` now accepts a `can_scale_down` keyword-only boolean parameter. Scale-down is returned only when both `num_workers > 1` and `can_scale_down` are true; otherwise, `no_change` is returned after scale-up handling.
Consolidation-Aware Prefill Scaling `components/src/dynamo/planner/core/load_scaling.py`	Prefill decision computes consolidation factor, re-predicts TTFT with queued tokens inflated by N/(N-1), and gates scale-down when post-consolidation TTFT would violate the sensitivity-adjusted SLA.
Consolidation-Aware Decode Scaling `components/src/dynamo/planner/core/load_scaling.py`	Decode decision computes per-engine KV utilization from (scheduled + queued) KV over `max_kv`, inflates by consolidation, and blocks scale-down when utilization would exceed sensitivity threshold.
Consolidation-Aware Agg-Prefill Scaling `components/src/dynamo/planner/core/load_scaling.py`	Agg-prefill decision combines queued prefill and decode KV, inflates by consolidation, and disables scale-down when any engine's post-consolidation TTFT breaks the sensitivity-adjusted bound.
Consolidation-Aware Agg-Decode Scaling `components/src/dynamo/planner/core/load_scaling.py`	Agg-decode decision combines scheduled/queued decode KV and queued prefill KV, normalizes by `max_kv`, inflates by consolidation, and blocks scale-down when pressure exceeds sensitivity.
Consolidation-Aware Scale-Down Tests `components/src/dynamo/planner/tests/unit/test_state_machine.py`	Adds decode consolidation regression training and multi-worker KV utilization assertions. Adds prefill TTFT regression training and validates post-consolidation TTFT vs SLA bounds across queue sizes, verifying consolidation affects only queue terms.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately describes the main fix: making SLA-mode scale-down consolidation-aware. It is concise, clear, and directly reflects the core change in the PR.
Description check	✅ Passed	The description is comprehensive and well-structured. It clearly explains the problem, per-component fixes, behavior changes, and includes test plan confirmation. However, the template requires specific sections (Overview, Details, Where to start, Related Issues) that are not explicitly matched in the provided description.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/src/dynamo/planner/core/load_scaling.py`:
- Around line 398-401: Collapse the multi-line ternary expressions into
single-line ternaries for all occurrences (e.g., the consolidation assignment in
load_scaling.py where consolidation is set, and the can_scale_down assignments)
so they conform to Ruff's formatting rules, and replace any ambiguous
multiplication sign `×` (U+00D7) in comments/docstrings with a plain asterisk
`*` (notably around the comments/docstrings near the can_scale_down logic and
later docstrings). After making these edits, run `ruff format` and `ruff check
--fix` locally to ensure all remaining style fixes are applied before
committing.

In `@components/src/dynamo/planner/tests/unit/test_state_machine.py`:
- Around line 367-369: The test function signature for _tick is currently broken
across multiple lines and Black expects a single-line signature, and several
docstrings contain the Unicode multiplication sign “×” which Ruff flags as
RUF002; update both _tick definitions (function name _tick) to use a single-line
signature (e.g., def _tick(self, *, num_workers: int, sched_kv_per_worker: int)
-> TickInput:) and replace every “×” in the nearby docstrings (the occurrences
you flagged) with the ASCII letter "x" (or an explicit "times" if clearer), then
run ruff format and ruff check --fix (or the repo pre-commit) on the touched
files to apply and verify formatting fixes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 786c8865-46ec-490b-b6e1-907931253e80

📥 Commits

Reviewing files that changed from the base of the PR and between b8f85c2 and 19abff5.

📒 Files selected for processing (2)

components/src/dynamo/planner/core/load_scaling.py
components/src/dynamo/planner/tests/unit/test_state_machine.py

Refines the prefill / agg-prefill scale-down check so the ``load_scaling_down_sensitivity`` margin applies to the queue-induced portion of TTFT, not the new request's own forward-pass time. Without this, sensitivity eats into the unavoidable own-compute budget and over-penalises scale-down when ``T_own`` is a meaningful fraction of the SLA. Replaces: TTFT(queued × N/(N-1)) < SLA × sensitivity with: TTFT(queued × N/(N-1)) - T_own_post < (SLA - T_own_post) × sensitivity where ``T_own_post`` is the regression's prediction at queue=0 (and ``decode_kv × N/(N-1)`` for agg, since the survivor absorbs decode state too). When ``T_own_post >= SLA`` the queue budget is non-positive and scale-down is refused -- own compute alone already misses SLA, so losing a worker can only worsen contention. Adds two tests: one where the queue-budget check permits a scale-down that the previous output-side check would have refused, and one where own compute alone exceeds SLA. Existing tests updated so per-tick FPM ``wall_time`` matches the trained regression's slope (otherwise the refit on each tick rejects the model with negative-coefficient warnings, masking the new code path). Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Replace U+00D7 MULTIPLICATION SIGN (x), U+2248 ALMOST EQUAL TO (~=), U+2192 RIGHTWARDS ARROW (->), and U+2264 LESS-THAN OR EQUAL (<=) in comments and docstrings with their ASCII equivalents on lines I introduced. Leaves pre-existing occurrences elsewhere alone. No behaviour change. Ruff RUF002/RUF003 clean. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Closes a hole in the agg dispatcher where ``_agg_prefill_scaling`` returning ``None`` because of an active consolidation safety refusal was conflated with "no prefill signal", causing ``_advance_load_agg`` to fall through to its line-327 fallback and grant decode-only scale-down. Same shape as the original 2 <-> 1 oscillation, just on the prefill-saturation side instead of decode. Concrete trigger: 2 agg workers with decode_kv * N/(N-1) pushing ``T_own_post`` over the TTFT SLA, while decode util * N/(N-1) is still under sensitivity. Pre-fix dispatcher silently overrides the prefill veto. Fix: - ``_agg_prefill_scaling`` and ``_agg_decode_scaling`` now return ``num_workers`` (not ``None``) when the consolidation check actively refused scale-down, so the dispatcher distinguishes "stay at current count" from "no signal". - ``_advance_load_agg`` captures ``p_refused`` immediately after the prefill sub-call (before the decode call can overwrite the shared diag field) and surfaces ``scale_down_refused_consolidation`` as the aggregate reason instead of the generic "no_change". - All four sub-decisions (prefill, decode, agg-prefill, agg-decode) stamp ``_diag_load_reason = "scale_down_refused_consolidation"`` when they refuse, giving operators a distinct signal in diagnostics. Adds two regression tests in ``TestAggConsolidationAwareScaleDown``: one that fails without the fix (decode-only scale-down would have broken prefill SLA), and one sanity test confirming agg still scales down when both sides are safe. Also asserts the new diagnostic reason surfaces correctly. All 371 planner unit tests pass. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

Previous version of the consolidation-aware decode check used cache utilisation ``(sched_kv + queued_kv) / max_kv_tokens`` as the saturation proxy. That under-protects the common case where the engine becomes SLA-bound (latency > target) at a KV level far below cache capacity -- e.g. the customer's regression hits SLA at ~309K KV while their cache holds 400K+. With ``max_kv`` as the denominator and default sensitivity 0.8, scale-down would still fire at a state that breaches SLA after consolidation. Replace the input-side util check with two safety gates evaluated at the survivor's post-consolidation kv (``(sched + queued) * N/(N-1)``): 1. **Cache feasibility** (when ``max_kv_tokens`` is advertised): refuse if ``post_kv >= max_kv``. Block eviction / queueing past the cache is non-linear and outside the regression's training domain, so we hard-fail. 2. **SLA check** via regression: predict ITL at post_kv via ``estimate_next_itl`` and refuse if the prediction crosses ``ITL_SLA * sensitivity``. The regression carries the SLA-bound capacity directly; this works regardless of cache size. Mirrored in ``_agg_decode_scaling`` against combined cache pressure (decode_kv + queued_prefill, since queued prefill becomes decode KV). Customer scenario verification (their regression: ITL = 7.15e-5*kv + 17.89, SLA=40, sensitivity=0.8): * T1 transient (each worker 70K, total 140K): post_kv=140K, ITL=28ms < 32ms threshold -> ALLOW (correct: 1 worker handles 140K at SLA). * Steady state at 2 workers under post-growth load (each 177K, total 355K): post_kv=354K, ITL=43.2ms >= 32ms -> REFUSE. Cycle breaks. Tests in ``TestDecodeConsolidationAwareScaleDown`` updated to exercise the new semantics: post-consolidation within/breaches SLA, exceeds max_kv, missing max_kv falls through to SLA check. All 370 planner unit tests pass. Signed-off-by: hongkuanz <hongkuanz@nvidia.com>

tedzhouhk requested review from a team as code owners May 7, 2026 23:50

pull-request-size Bot added the size/L label May 7, 2026

github-actions Bot added fix planner labels May 7, 2026

tedzhouhk changed the title ~~fix(planner): make SLA-mode scale-down consolidation-aware~~ fix(planner): make SLA-mode scale-down consolidation-aware again May 7, 2026

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

Comment thread components/src/dynamo/planner/core/load_scaling.py Outdated

Comment thread components/src/dynamo/planner/tests/unit/test_state_machine.py Outdated

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 00:31 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 00:33 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 02:46 Inactive

Merge branch 'main' into hzhou/planner-consolidation-aware-scale-down

58c145b

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 03:54 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 04:06 Inactive

pull-request-size Bot added size/XL and removed size/L labels May 8, 2026

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 04:21 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 04:32 Inactive

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 04:38 Inactive

PeaBrane approved these changes May 8, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to GITLAB May 8, 2026 04:59 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(planner): make SLA-mode scale-down consolidation-aware again#9294

fix(planner): make SLA-mode scale-down consolidation-aware again#9294
tedzhouhk wants to merge 6 commits intomainfrom
hzhou/planner-consolidation-aware-scale-down

tedzhouhk commented May 7, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 7, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tedzhouhk commented May 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behaviour at production T1 (2 workers, decode kv util ≈ 0.7 each)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tedzhouhk commented May 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading