feat: add HPA autoscaling for InferenceService#260
Merged
Conversation
…rics Add native Kubernetes HPA (autoscaling/v2) support to InferenceService. When an autoscaling block is configured in the CRD spec, the controller creates and manages an HPA targeting the inference Deployment. CRD changes: - Add AutoscalingSpec with minReplicas, maxReplicas, and metrics array - Add MetricSpec supporting Pods (custom metrics) and Resource types - Default metric: llamacpp:requests_processing with target average of 2 Controller changes: - reconcileHPA() creates/updates/deletes HPA following existing patterns - constructHPA() builds HPA v2 spec with custom or default metrics - reconcileDeployment() sets replicas to nil when HPA manages scaling - SetupWithManager() owns HPA resources for reconciliation triggers - Metal accelerator workloads skip HPA (no Deployment to scale) - RBAC updated for autoscaling API group CLI changes: - --min-replicas, --max-replicas, --autoscale-metric deploy flags - Deploy summary shows autoscale range when configured Includes 11 new tests (unit + envtest) covering HPA creation, deletion, Metal skip, default metrics, custom metrics, and replica conflict avoidance. Controller coverage increased to 80.3%. Fixes #240 Signed-off-by: Christopher Maher <chris@mahercode.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds native Kubernetes HPA (autoscaling/v2) support to InferenceService. This was the single biggest functional gap vs every competitor (KubeAI, KAITO, KServe, llm-d, AIBrix). When an autoscaling block is configured, the controller creates and manages an HPA targeting the inference Deployment.
Usage:
CLI:
How it works:
llamacpp:requests_processingwith target average of 2 per podDesign decisions:
Changes
Test plan
make testpasses (186 tests, 11 new)--max-replicas 3, verify HPA createdFixes #240