Skip to content

Conversation

@LukeAVanDrie
Copy link
Contributor

@LukeAVanDrie LukeAVanDrie commented Dec 4, 2025

What type of PR is this?
/kind feature

What this PR does / why we need it:
This PR enables Scale-from-Zero support in the Endpoint Picker (EPP).

Previously, the Director eagerly resolved candidate pods before Admission. If the pool was scaled to zero (or no subsets matched), the request was immediately rejected with a 503 Service Unavailable, preventing the Flow Control layer from queuing the request while the autoscaler reacted.

Key Changes

  1. Inverted Control Flow:
    The Director now attempts Admission (Queueing) before Resolution (Finding Pods).

    • Before: Resolve Pods $\to$ Check Admission $\to$ Schedule.
    • After: Check Admission $\to$ Resolve Pods $\to$ Schedule.
  2. Lazy Resolution (Flow Control):
    The FlowControlAdmissionController no longer requires a list of candidate pods. It enqueues the request (carrying only Metadata). The ShardProcessor then uses the PodLocator to resolve candidate pods Just-In-Time during the dispatch loop.

    • If 0 pods are found during dispatch, the system is considered "Saturated", enforcing Head-of-Line (HoL) blocking until pods appear or the request TTL expires.
  3. Legacy Path Preservation:
    The LegacyAdmissionController (used when Flow Control is disabled) retains the need for eager resolution to perform immediate shedding. It has been updated to use the PodLocator internally to resolve pods within the Admit call.

  4. Safety Checks:
    The Director retains an explicit check after resolution: if a request passes admission but still resolves to 0 pods (e.g., non-queued traffic, or race conditions), it fails with 503 Service Unavailable.

Which issue(s) this PR fixes:
Tracks #1800 -- not marking as fixed until we have sufficient validation of this use case (@lionelvillard FYI).

Does this PR introduce a user-facing change?:

The Endpoint Picker now supports Scale-from-Zero with Flow Control enabled. Requests targeting a pool with no available backends will be queued in the Flow Control layer (up to their timeout) instead of being immediately rejected, allowing time for backends to scale up.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Dec 4, 2025
@netlify
Copy link

netlify bot commented Dec 4, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 72c3ae1
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/693773634a790d00089dfe7d
😎 Deploy Preview https://deploy-preview-1952--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 4, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 4, 2025
@LukeAVanDrie
Copy link
Contributor Author

/hold

Awaiting diffbase to be merged.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 4, 2025
// GetMetadata returns the opaque metadata associated with the request (e.g., header-derived context, subset filters).
// This data is passed transparently to components like the contracts.PodLocator to resolve resources (candidate pods)
// lazily during the dispatch cycle.
GetMetadata() map[string]any
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lioraron This incidentally provides the plumbing necessary for #1863. Let me know if this is what you were expecting. This is populated from reqCtx.Request.Metadata.

IntraFlowDispatchPolicy.SelectItem gets access to QueueItemAccessor which exposes OriginalRequest() FlowControlRequest. Now your custom plugin impl can extract whatever it wants from here.

Is this sufficient to resolve #1863 or are you also seeking an extension point to intercept and augment reqCtx.Request.Metadata with additional key-value pairs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @kfswain as this is possibly relevant to our offline discussion this morning regarding request TTL sources.

@dumb0002
Copy link

dumb0002 commented Dec 5, 2025

This PR only enable scale-from-zero for only one use-case: when flow control is enable.

@LukeAVanDrie
Copy link
Contributor Author

This PR only enable scale-from-zero for only one use-case: when flow control is enable.

Yes, I can update the title to be clearer about this requirement. Without a buffer (flow control), there is no way to hold requests until capacity is online.

@dumb0002
Copy link

dumb0002 commented Dec 5, 2025

This PR only enable scale-from-zero for only one use-case: when flow control is enable.

Yes, I can update the title to be clearer about this requirement. Without a buffer (flow control), there is no way to hold requests until capacity is online.

@LukeAVanDrie, we have a proposal to also support scale from zero when flow control is disable. It is described in detail here. Basically, the idea is to create an admission Plugin to contain the logic to hold the request and emit a prometheus metrics (e.g., _scale_zero_waiting_request_count) as a signal to the autoscaler. Then, release the request if capacity is online or timeout and drop the request after a pre-defined time period. However, this proposal would require for the admission plugin to be called prior to the computation of the list of candidate pods Or not dropping the request if this list is empty, but that would probably require another check for empty list of pods in or before the scheduling layer. What are your thoughts on this?

This new admission plugin and new metric would be included as part of the llm-d inference-scheduler set of existing plugins and metrics.

@LukeAVanDrie
Copy link
Contributor Author

LukeAVanDrie commented Dec 5, 2025

Basically, the idea is to create an admission Plugin to contain the logic to hold the request and emit a prometheus metrics (e.g., _scale_zero_waiting_request_count) as a signal to the autoscaler. Then, release the request if capacity is online or timeout and drop the request after a pre-defined time period. However, this proposal would require for the admission plugin to be called prior to the computation of the list of candidate pods Or not dropping the request if this list is empty, but that would probably require another check for empty list of pods in or before the scheduling layer. What are your thoughts on this?

This is exactly what Flow Control does already. Flow Control is an Admission Control plugin with some more bells and whistles (priority and fairness). I guess I am not understanding why we need to bifurcate here.

This series of PRs moves candidate resolution to after Admission Control (Flow Control or otherwise).

@dumb0002
Copy link

dumb0002 commented Dec 5, 2025

Basically, the idea is to create an admission Plugin to contain the logic to hold the request and emit a prometheus metrics (e.g., _scale_zero_waiting_request_count) as a signal to the autoscaler. Then, release the request if capacity is online or timeout and drop the request after a pre-defined time period. However, this proposal would require for the admission plugin to be called prior to the computation of the list of candidate pods Or not dropping the request if this list is empty, but that would probably require another check for empty list of pods in or before the scheduling layer. What are your thoughts on this?

This is exactly what Flow Control does already. Flow Control is an Admission Control plugin with some more bells and whistles (priority and fairness). I guess I am not understanding why we need to bifurcate here.

This series of PRs moves candidate resolution to after Admission Control (Flow Control or otherwise).

I am referring to the admission plugins that run in this part of the code: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/requestcontrol/director.go#L177-L180. The proposal described above is to handle the scenario when only flow control is not enable but still have support for scale from zero.

@LukeAVanDrie
Copy link
Contributor Author

LukeAVanDrie commented Dec 5, 2025

I am referring to the admission plugins that run in this part of the code: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/requestcontrol/director.go#L177-L180. The proposal describe above is to handle the scenario when only flow control is not enable.

Yes, I guess my question is why create a separate plugin that handles request buffering only for the FC disabled path rather than simply enabling FC? What is the value in duplicating here? If FC has gaps that don't satisfy your use case, doesn't it make more sense to prioritize closing those?


Your approach will work, and this PR already sets the stage for the prerequisites you need. You will just need to implement your plugin externally. I will update the release note to clarify this is only for FC. I am just trying to better understand your use case.

@dumb0002
Copy link

dumb0002 commented Dec 5, 2025

I am referring to the admission plugins that run in this part of the code: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/requestcontrol/director.go#L177-L180. The proposal describe above is to handle the scenario when only flow control is not enable.

Yes, I guess my question is why create a separate plugin that handles request buffering only for the FC disabled path rather than simply enabling FC? What is the value in duplicating here? If FC has gaps that don't satisfy your use case, doesn't it make more sense to prioritize closing those?

I am referring to the admission plugins that run in this part of the code: https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/requestcontrol/director.go#L177-L180. The proposal describe above is to handle the scenario when only flow control is not enable.

Yes, I guess my question is why create a separate plugin that handles request buffering only for the FC disabled path rather than simply enabling FC? What is the value in duplicating here? If FC has gaps that don't satisfy your use case, doesn't it make more sense to prioritize closing those?

Your approach will work, and this PR already sets the stage for the prerequisites you need. You will just need to implement your plugin externally. I will update the release note to clarify this is only for FC. I am just trying to better understand your use case.

You raised a very important question: do we need to support scale from zero if FC is disable? I started with the assumption of also providing scale from zero support even if FC is disable. However, I agree that more evaluation of this scenario is needed indeed. Sounds good, updating the release note highlighting the FC focus will help to avoid any confusion. Thanks!

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 6, 2025
@LukeAVanDrie LukeAVanDrie force-pushed the feat/scale-from-zero-support branch from 8191661 to 55180b3 Compare December 8, 2025 18:46
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 8, 2025
@LukeAVanDrie
Copy link
Contributor Author

/remove-hold

@LukeAVanDrie LukeAVanDrie marked this pull request as ready for review December 8, 2025 18:50
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 8, 2025
@LukeAVanDrie LukeAVanDrie changed the title feat: Enable Scale-from-Zero feat: Enable Scale-from-Zero with Flow Control enabled Dec 8, 2025
@LukeAVanDrie
Copy link
Contributor Author

/assign @ahg-g

@ahg-g
Copy link
Contributor

ahg-g commented Dec 8, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 8, 2025
This defines the contract for resolving candidate pods based on request
metadata, decoupling the resolution logic from the storage layer.
Refactors the Director to use the injected PodLocator interface instead
of the private getCandidatePodsForScheduling method. This prepares the
Director for lazy resolution without changing current behavior.
Updates the FlowControlRequest interface to carry request metadata
instead of a pre-resolved list of candidate pods. This prepares the
system for lazy pod resolution.

- Adds GetMetadata() to FlowControlRequest.
- Removes CandidatePodsForScheduling() from FlowControlRequest.
- Updates mocks in flowcontrol/types and contracts.
Refactors the request processing flow to support queuing when no
backends are available.

- Inverts Director flow: Admission is now called before Pod Resolution.
- Updates AdmissionController interface to remove eager pod list.
- LegacyAdmissionController now resolves pods internally via PodLocator.
- ShardProcessor (Flow Control) now resolves pods lazily via PodLocator
  during the dispatch cycle.
- Updates Runner wiring to inject PodLocator where needed.
@LukeAVanDrie LukeAVanDrie force-pushed the feat/scale-from-zero-support branch from 55180b3 to 72c3ae1 Compare December 9, 2025 00:54
@ahg-g
Copy link
Contributor

ahg-g commented Dec 9, 2025

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 9, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, LukeAVanDrie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 9, 2025
@LukeAVanDrie
Copy link
Contributor Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 9, 2025
@k8s-ci-robot k8s-ci-robot merged commit 4738bee into kubernetes-sigs:main Dec 9, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants