feat: Enable Scale-from-Zero with Flow Control enabled #1952

LukeAVanDrie · 2025-12-04T23:53:17Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR enables Scale-from-Zero support in the Endpoint Picker (EPP).

Previously, the Director eagerly resolved candidate pods before Admission. If the pool was scaled to zero (or no subsets matched), the request was immediately rejected with a 503 Service Unavailable, preventing the Flow Control layer from queuing the request while the autoscaler reacted.

Key Changes

Inverted Control Flow:
The Director now attempts Admission (Queueing) before Resolution (Finding Pods).
- Before: Resolve Pods $\to$ Check Admission $\to$ Schedule.
- After: Check Admission $\to$ Resolve Pods $\to$ Schedule.
Lazy Resolution (Flow Control):
The FlowControlAdmissionController no longer requires a list of candidate pods. It enqueues the request (carrying only Metadata). The ShardProcessor then uses the PodLocator to resolve candidate pods Just-In-Time during the dispatch loop.
- If 0 pods are found during dispatch, the system is considered "Saturated", enforcing Head-of-Line (HoL) blocking until pods appear or the request TTL expires.
Legacy Path Preservation:
The LegacyAdmissionController (used when Flow Control is disabled) retains the need for eager resolution to perform immediate shedding. It has been updated to use the PodLocator internally to resolve pods within the Admit call.
Safety Checks:
The Director retains an explicit check after resolution: if a request passes admission but still resolves to 0 pods (e.g., non-queued traffic, or race conditions), it fails with 503 Service Unavailable.

Which issue(s) this PR fixes:
Tracks #1800 -- not marking as fixed until we have sufficient validation of this use case (@lionelvillard FYI).

Does this PR introduce a user-facing change?:

The Endpoint Picker now supports Scale-from-Zero with Flow Control enabled. Requests targeting a pool with no available backends will be queued in the Flow Control layer (up to their timeout) instead of being immediately rejected, allowing time for backends to scale up.

netlify · 2025-12-04T23:53:25Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`72c3ae1`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/693773634a790d00089dfe7d
😎 Deploy Preview	https://deploy-preview-1952--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-12-04T23:53:27Z

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

LukeAVanDrie · 2025-12-04T23:58:31Z

/hold

Awaiting diffbase to be merged.

LukeAVanDrie · 2025-12-05T00:14:13Z

pkg/epp/flowcontrol/types/request.go

+	// GetMetadata returns the opaque metadata associated with the request (e.g., header-derived context, subset filters).
+	// This data is passed transparently to components like the contracts.PodLocator to resolve resources (candidate pods)
+	// lazily during the dispatch cycle.
+	GetMetadata() map[string]any


@lioraron This incidentally provides the plumbing necessary for #1863. Let me know if this is what you were expecting. This is populated from reqCtx.Request.Metadata.

IntraFlowDispatchPolicy.SelectItem gets access to QueueItemAccessor which exposes OriginalRequest() FlowControlRequest. Now your custom plugin impl can extract whatever it wants from here.

Is this sufficient to resolve #1863 or are you also seeking an extension point to intercept and augment reqCtx.Request.Metadata with additional key-value pairs?

FYI @kfswain as this is possibly relevant to our offline discussion this morning regarding request TTL sources.

dumb0002 · 2025-12-05T14:00:28Z

This PR only enable scale-from-zero for only one use-case: when flow control is enable.

LukeAVanDrie · 2025-12-05T17:42:42Z

This PR only enable scale-from-zero for only one use-case: when flow control is enable.

Yes, I can update the title to be clearer about this requirement. Without a buffer (flow control), there is no way to hold requests until capacity is online.

dumb0002 · 2025-12-05T18:09:24Z

This PR only enable scale-from-zero for only one use-case: when flow control is enable.

Yes, I can update the title to be clearer about this requirement. Without a buffer (flow control), there is no way to hold requests until capacity is online.

@LukeAVanDrie, we have a proposal to also support scale from zero when flow control is disable. It is described in detail here. Basically, the idea is to create an admission Plugin to contain the logic to hold the request and emit a prometheus metrics (e.g., _scale_zero_waiting_request_count) as a signal to the autoscaler. Then, release the request if capacity is online or timeout and drop the request after a pre-defined time period. However, this proposal would require for the admission plugin to be called prior to the computation of the list of candidate pods Or not dropping the request if this list is empty, but that would probably require another check for empty list of pods in or before the scheduling layer. What are your thoughts on this?

This new admission plugin and new metric would be included as part of the llm-d inference-scheduler set of existing plugins and metrics.

LukeAVanDrie · 2025-12-05T18:19:09Z

Basically, the idea is to create an admission Plugin to contain the logic to hold the request and emit a prometheus metrics (e.g., _scale_zero_waiting_request_count) as a signal to the autoscaler. Then, release the request if capacity is online or timeout and drop the request after a pre-defined time period. However, this proposal would require for the admission plugin to be called prior to the computation of the list of candidate pods Or not dropping the request if this list is empty, but that would probably require another check for empty list of pods in or before the scheduling layer. What are your thoughts on this?

This is exactly what Flow Control does already. Flow Control is an Admission Control plugin with some more bells and whistles (priority and fairness). I guess I am not understanding why we need to bifurcate here.

This series of PRs moves candidate resolution to after Admission Control (Flow Control or otherwise).

dumb0002 · 2025-12-05T18:28:21Z

Basically, the idea is to create an admission Plugin to contain the logic to hold the request and emit a prometheus metrics (e.g., _scale_zero_waiting_request_count) as a signal to the autoscaler. Then, release the request if capacity is online or timeout and drop the request after a pre-defined time period. However, this proposal would require for the admission plugin to be called prior to the computation of the list of candidate pods Or not dropping the request if this list is empty, but that would probably require another check for empty list of pods in or before the scheduling layer. What are your thoughts on this?

This is exactly what Flow Control does already. Flow Control is an Admission Control plugin with some more bells and whistles (priority and fairness). I guess I am not understanding why we need to bifurcate here.

This series of PRs moves candidate resolution to after Admission Control (Flow Control or otherwise).

I am referring to the admission plugins that run in this part of the code: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/requestcontrol/director.go#L177-L180. The proposal described above is to handle the scenario when only flow control is not enable but still have support for scale from zero.

LukeAVanDrie · 2025-12-05T18:30:49Z

I am referring to the admission plugins that run in this part of the code: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/requestcontrol/director.go#L177-L180. The proposal describe above is to handle the scenario when only flow control is not enable.

Yes, I guess my question is why create a separate plugin that handles request buffering only for the FC disabled path rather than simply enabling FC? What is the value in duplicating here? If FC has gaps that don't satisfy your use case, doesn't it make more sense to prioritize closing those?

Your approach will work, and this PR already sets the stage for the prerequisites you need. You will just need to implement your plugin externally. I will update the release note to clarify this is only for FC. I am just trying to better understand your use case.

dumb0002 · 2025-12-05T18:48:26Z

I am referring to the admission plugins that run in this part of the code: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/requestcontrol/director.go#L177-L180. The proposal describe above is to handle the scenario when only flow control is not enable.

Yes, I guess my question is why create a separate plugin that handles request buffering only for the FC disabled path rather than simply enabling FC? What is the value in duplicating here? If FC has gaps that don't satisfy your use case, doesn't it make more sense to prioritize closing those?

I am referring to the admission plugins that run in this part of the code: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/requestcontrol/director.go#L177-L180. The proposal describe above is to handle the scenario when only flow control is not enable.

Yes, I guess my question is why create a separate plugin that handles request buffering only for the FC disabled path rather than simply enabling FC? What is the value in duplicating here? If FC has gaps that don't satisfy your use case, doesn't it make more sense to prioritize closing those?

Your approach will work, and this PR already sets the stage for the prerequisites you need. You will just need to implement your plugin externally. I will update the release note to clarify this is only for FC. I am just trying to better understand your use case.

You raised a very important question: do we need to support scale from zero if FC is disable? I started with the assumption of also providing scale from zero support even if FC is disable. However, I agree that more evaluation of this scenario is needed indeed. Sounds good, updating the release note highlighting the FC focus will help to avoid any confusion. Thanks!

LukeAVanDrie · 2025-12-08T18:49:21Z

/remove-hold

LukeAVanDrie · 2025-12-08T18:51:25Z

/assign @ahg-g

cmd/epp/runner/runner.go

ahg-g · 2025-12-08T22:53:58Z

/ok-to-test

This defines the contract for resolving candidate pods based on request metadata, decoupling the resolution logic from the storage layer.

Refactors the Director to use the injected PodLocator interface instead of the private getCandidatePodsForScheduling method. This prepares the Director for lazy resolution without changing current behavior.

Updates the FlowControlRequest interface to carry request metadata instead of a pre-resolved list of candidate pods. This prepares the system for lazy pod resolution. - Adds GetMetadata() to FlowControlRequest. - Removes CandidatePodsForScheduling() from FlowControlRequest. - Updates mocks in flowcontrol/types and contracts.

Refactors the request processing flow to support queuing when no backends are available. - Inverts Director flow: Admission is now called before Pod Resolution. - Updates AdmissionController interface to remove eager pod list. - LegacyAdmissionController now resolves pods internally via PodLocator. - ShardProcessor (Flow Control) now resolves pods lazily via PodLocator during the dispatch cycle. - Updates Runner wiring to inject PodLocator where needed.

ahg-g · 2025-12-09T06:05:56Z

/lgtm
/approve

k8s-ci-robot · 2025-12-09T06:06:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, LukeAVanDrie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

LukeAVanDrie · 2025-12-09T08:25:34Z

/unhold

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Dec 4, 2025

k8s-ci-robot requested review from liu-cong and nirrozenbaum December 4, 2025 23:53

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 4, 2025

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 4, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 4, 2025

LukeAVanDrie commented Dec 5, 2025

View reviewed changes

LukeAVanDrie mentioned this pull request Dec 5, 2025

refactor: [Scale from Zero] Introduce PodLocator #1950

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 6, 2025

LukeAVanDrie force-pushed the feat/scale-from-zero-support branch from 8191661 to 55180b3 Compare December 8, 2025 18:46

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 8, 2025

LukeAVanDrie marked this pull request as ready for review December 8, 2025 18:50

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 8, 2025

LukeAVanDrie changed the title ~~feat: Enable Scale-from-Zero~~ feat: Enable Scale-from-Zero with Flow Control enabled Dec 8, 2025

k8s-ci-robot assigned ahg-g Dec 8, 2025

LukeAVanDrie commented Dec 8, 2025

View reviewed changes

cmd/epp/runner/runner.go Outdated Show resolved Hide resolved

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 8, 2025

LukeAVanDrie added 6 commits December 9, 2025 00:46

contracts: add PodLocator for candidate resolution

5ac948e

This defines the contract for resolving candidate pods based on request metadata, decoupling the resolution logic from the storage layer.

director: delegate candidate resolution

22fb2d3

Refactors the Director to use the injected PodLocator interface instead of the private getCandidatePodsForScheduling method. This prepares the Director for lazy resolution without changing current behavior.

resolve merge conflicts after rebase

dc7515b

Address reviewer feedback

72c3ae1

LukeAVanDrie force-pushed the feat/scale-from-zero-support branch from 55180b3 to 72c3ae1 Compare December 9, 2025 00:54

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 9, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 9, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 9, 2025

k8s-ci-robot merged commit 4738bee into kubernetes-sigs:main Dec 9, 2025
12 checks passed

feat: Enable Scale-from-Zero with Flow Control enabled #1952

feat: Enable Scale-from-Zero with Flow Control enabled #1952

Uh oh!

Conversation

LukeAVanDrie commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Uh oh!

netlify bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

k8s-ci-robot commented Dec 4, 2025

Uh oh!

LukeAVanDrie commented Dec 4, 2025

Uh oh!

LukeAVanDrie Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

dumb0002 commented Dec 5, 2025

Uh oh!

LukeAVanDrie commented Dec 5, 2025

Uh oh!

dumb0002 commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukeAVanDrie commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dumb0002 commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LukeAVanDrie commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dumb0002 commented Dec 5, 2025

Uh oh!

LukeAVanDrie commented Dec 8, 2025

Uh oh!

LukeAVanDrie commented Dec 8, 2025

Uh oh!

Uh oh!

ahg-g commented Dec 8, 2025

Uh oh!

ahg-g commented Dec 9, 2025

Uh oh!

k8s-ci-robot commented Dec 9, 2025

Uh oh!

LukeAVanDrie commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LukeAVanDrie commented Dec 4, 2025 •

edited

Loading

netlify bot commented Dec 4, 2025 •

edited

Loading

dumb0002 commented Dec 5, 2025 •

edited

Loading

LukeAVanDrie commented Dec 5, 2025 •

edited

Loading

dumb0002 commented Dec 5, 2025 •

edited

Loading

LukeAVanDrie commented Dec 5, 2025 •

edited

Loading