Skip to content

feat: add HPA autoscaling for InferenceService#260

Merged
Defilan merged 1 commit intomainfrom
feat/hpa-autoscaling
Apr 3, 2026
Merged

feat: add HPA autoscaling for InferenceService#260
Defilan merged 1 commit intomainfrom
feat/hpa-autoscaling

Conversation

@Defilan
Copy link
Copy Markdown
Member

@Defilan Defilan commented Apr 2, 2026

Summary

Adds native Kubernetes HPA (autoscaling/v2) support to InferenceService. This was the single biggest functional gap vs every competitor (KubeAI, KAITO, KServe, llm-d, AIBrix). When an autoscaling block is configured, the controller creates and manages an HPA targeting the inference Deployment.

Usage:

spec:
  modelRef: my-model
  replicas: 1
  autoscaling:
    minReplicas: 1
    maxReplicas: 5
    metrics:
      - type: Pods
        name: "llamacpp:requests_processing"
        targetAverageValue: "2"

CLI:

llmkube deploy llama-3.1-8b --gpu --max-replicas 5

How it works:

  • Controller creates an HPA v2 resource owned by the InferenceService (garbage collected on delete)
  • Default metric: llamacpp:requests_processing with target average of 2 per pod
  • When HPA is active, controller sets Deployment replicas to nil (lets HPA manage)
  • Metal accelerator workloads skip HPA (no Deployment to scale)
  • Requires Prometheus Adapter for custom metrics (documented)

Design decisions:

  • Native HPA v2, not KEDA (CNCF-aligned, no external dependencies)
  • Structured autoscaling block (extensible for future behavior policies, scale-to-zero)
  • MinReplicas minimum is 1 (no scale-to-zero in v1)

Changes

  • 967 lines added across 8 files
  • 11 new tests (unit + envtest), controller coverage 78.9% -> 80.3%
  • CRD: AutoscalingSpec, MetricSpec types
  • Controller: reconcileHPA(), constructHPA(), replica conflict avoidance
  • CLI: --min-replicas, --max-replicas, --autoscale-metric flags
  • Helm: autoscaling RBAC in ClusterRole

Test plan

  • make test passes (186 tests, 11 new)
  • Deploy with --max-replicas 3, verify HPA created
  • Deploy without autoscaling, verify no HPA
  • Remove autoscaling from spec (unit test), verify HPA deleted
  • Metal deployment (unit test), verify HPA skipped
  • Verify Deployment replicas (unit test) set to nil when HPA active

Fixes #240

…rics

Add native Kubernetes HPA (autoscaling/v2) support to InferenceService.
When an autoscaling block is configured in the CRD spec, the controller
creates and manages an HPA targeting the inference Deployment.

CRD changes:
- Add AutoscalingSpec with minReplicas, maxReplicas, and metrics array
- Add MetricSpec supporting Pods (custom metrics) and Resource types
- Default metric: llamacpp:requests_processing with target average of 2

Controller changes:
- reconcileHPA() creates/updates/deletes HPA following existing patterns
- constructHPA() builds HPA v2 spec with custom or default metrics
- reconcileDeployment() sets replicas to nil when HPA manages scaling
- SetupWithManager() owns HPA resources for reconciliation triggers
- Metal accelerator workloads skip HPA (no Deployment to scale)
- RBAC updated for autoscaling API group

CLI changes:
- --min-replicas, --max-replicas, --autoscale-metric deploy flags
- Deploy summary shows autoscale range when configured

Includes 11 new tests (unit + envtest) covering HPA creation, deletion,
Metal skip, default metrics, custom metrics, and replica conflict
avoidance. Controller coverage increased to 80.3%.

Fixes #240

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan merged commit 2d16502 into main Apr 3, 2026
16 checks passed
@Defilan Defilan deleted the feat/hpa-autoscaling branch April 3, 2026 02:41
@github-actions github-actions bot mentioned this pull request Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HPA autoscaling for InferenceService based on llama.cpp metrics

1 participant