feat: add HPA autoscaling for InferenceService by Defilan · Pull Request #260 · defilantech/LLMKube

Defilan · 2026-04-02T17:23:55Z

Summary

Adds native Kubernetes HPA (autoscaling/v2) support to InferenceService. This was the single biggest functional gap vs every competitor (KubeAI, KAITO, KServe, llm-d, AIBrix). When an autoscaling block is configured, the controller creates and manages an HPA targeting the inference Deployment.

Usage:

spec:
  modelRef: my-model
  replicas: 1
  autoscaling:
    minReplicas: 1
    maxReplicas: 5
    metrics:
      - type: Pods
        name: "llamacpp:requests_processing"
        targetAverageValue: "2"

CLI:

llmkube deploy llama-3.1-8b --gpu --max-replicas 5

How it works:

Controller creates an HPA v2 resource owned by the InferenceService (garbage collected on delete)
Default metric: llamacpp:requests_processing with target average of 2 per pod
When HPA is active, controller sets Deployment replicas to nil (lets HPA manage)
Metal accelerator workloads skip HPA (no Deployment to scale)
Requires Prometheus Adapter for custom metrics (documented)

Design decisions:

Native HPA v2, not KEDA (CNCF-aligned, no external dependencies)
Structured autoscaling block (extensible for future behavior policies, scale-to-zero)
MinReplicas minimum is 1 (no scale-to-zero in v1)

Changes

967 lines added across 8 files
11 new tests (unit + envtest), controller coverage 78.9% -> 80.3%
CRD: AutoscalingSpec, MetricSpec types
Controller: reconcileHPA(), constructHPA(), replica conflict avoidance
CLI: --min-replicas, --max-replicas, --autoscale-metric flags
Helm: autoscaling RBAC in ClusterRole

Test plan

make test passes (186 tests, 11 new)
Deploy with --max-replicas 3, verify HPA created
Deploy without autoscaling, verify no HPA
Remove autoscaling from spec (unit test), verify HPA deleted
Metal deployment (unit test), verify HPA skipped
Verify Deployment replicas (unit test) set to nil when HPA active

Fixes #240

…rics Add native Kubernetes HPA (autoscaling/v2) support to InferenceService. When an autoscaling block is configured in the CRD spec, the controller creates and manages an HPA targeting the inference Deployment. CRD changes: - Add AutoscalingSpec with minReplicas, maxReplicas, and metrics array - Add MetricSpec supporting Pods (custom metrics) and Resource types - Default metric: llamacpp:requests_processing with target average of 2 Controller changes: - reconcileHPA() creates/updates/deletes HPA following existing patterns - constructHPA() builds HPA v2 spec with custom or default metrics - reconcileDeployment() sets replicas to nil when HPA manages scaling - SetupWithManager() owns HPA resources for reconciliation triggers - Metal accelerator workloads skip HPA (no Deployment to scale) - RBAC updated for autoscaling API group CLI changes: - --min-replicas, --max-replicas, --autoscale-metric deploy flags - Deploy summary shows autoscale range when configured Includes 11 new tests (unit + envtest) covering HPA creation, deletion, Metal skip, default metrics, custom metrics, and replica conflict avoidance. Controller coverage increased to 80.3%. Fixes #240 Signed-off-by: Christopher Maher <chris@mahercode.io>

Defilan merged commit 2d16502 into main Apr 3, 2026
16 checks passed

Defilan deleted the feat/hpa-autoscaling branch April 3, 2026 02:41

github-actions bot mentioned this pull request Apr 3, 2026

chore: release 0.5.4 #263

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add HPA autoscaling for InferenceService#260

feat: add HPA autoscaling for InferenceService#260
Defilan merged 1 commit intomainfrom
feat/hpa-autoscaling

Defilan commented Apr 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Defilan commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Defilan commented Apr 2, 2026 •

edited

Loading