Skip to content

Autoscaling

Evan Nemerson edited this page Jan 6, 2026 · 6 revisions

Experimental Feature: The autoscaling capabilities described in this document are experimental, introduced in version 1.2.9. While we are not aware of any problems with them, you should be aware that behavior may change in future releases. If you choose to enable autoscaling, we would greatly appreciate it if you inform CloudZero so we can assist with any issues as they arise.

This document explains horizontal pod autoscaling (HPA) for CloudZero Agent components. Autoscaling allows the agent to automatically scale up to handle high-demand clusters and scale down to minimize resource usage in development environments or during quiet periods.

Background: The Scaling Challenge

The CloudZero Agent has four primary components:

  1. Webhook Server - Processes admission requests for label and resource identifier capture
  2. CloudZero Agent (Prometheus) - Scrapes and collects metrics
  3. Aggregator - Processes and ships collected data
  4. KSM (Kube State Metrics) - Exposes Kubernetes object state as metrics

Of these, the webhook server and KSM are the most common to experience resource pressure in large or high-activity clusters.

Historical Scaling Challenges

The CloudZero Agent (in its default configuration) uses Prometheus as its metrics collection engine. Historically, scaling the agent meant manually tuning resource requests and limits on a per-cluster basis. This works, but every cluster is different - the defaults are relatively high for most clusters yet may be insufficient for extremely large ones.

When scaling up has been needed:

  • Large clusters with many nodes and/or pods
  • High-churn environments where pods are frequently created and destroyed

When scaling down would help:

  • Development and staging clusters that mirror production configuration but have fraction of the workload
  • Ephemeral clusters used for testing or CI/CD

A key limitation is that Prometheus doesn't scale horizontally by simply adding replicas. Setting replicas: 3 doesn't distribute the work - it causes each replica to do all the work independently, effectively tripling resource consumption with no benefit. The supported path to horizontal Prometheus scaling is sharding, where different Prometheus instances scrape different targets. This isn't feasible for CloudZero because it would require per-cluster configuration that we cannot automate.

Enter Alloy: True Horizontal Scaling

Version 1.2.9 introduces experimental support for Alloy as an alternative to Prometheus. One of Alloy's key advantages is native horizontal scalability - replicas can dynamically join and leave a cluster, with work automatically distributed among them. As demand increases or decreases, the HPA can start up or shut down replicas as needed, and the workload rebalances automatically.

Scaling Down: The Overlooked Benefit

While scaling up to meet high demand is the problem most people think about, scaling down to meet low demand is equally valuable. The default CloudZero Agent configuration runs 3 replicas of the webhook server. This default provides redundancy (surviving node failures), spreads load across availability zones, and handles typical production workloads - but it's overkill for development environments, staging clusters, or any cluster with low activity.

With autoscaling enabled, the HPA can scale components down to a single replica during quiet periods. This means:

  • Development clusters don't waste resources running 3 replicas that sit mostly idle
  • Off-hours scaling automatically reduces footprint when cluster activity drops
  • One configuration works everywhere - the same Helm values can deploy to both large production clusters and small dev clusters, with the HPA adjusting replica counts appropriately

Because Alloy supports automatic scaling, we've been able to significantly reduce Alloy's default resource allocation compared to Prometheus:

Resource Prometheus Defaults Alloy Defaults
Memory request 512Mi 256Mi
Memory limit 1024Mi 512Mi
CPU request 250m 50m

Alloy's defaults are half the memory and one-fifth the CPU of Prometheus. If demand increases, the HPA automatically adds replicas. This approach provides better resource efficiency across the full range of cluster sizes.

Components That Support Autoscaling

Webhook Server

The webhook server processes admission requests for every resource create, update, and delete operation in your cluster. In high-churn environments with frequent pod turnover, the webhook server can experience significant load spikes.

Status: Experimental (introduced in 1.2.9)

How it works: The HPA monitors CPU and memory utilization, targeting 80% for both. When utilization exceeds this threshold, additional replicas are created. When utilization drops, replicas are removed.

Scaling behavior: This is the component CloudZero controls directly, so scaling is straightforward. Work is distributed across replicas, so horizontal scaling provides real benefits.

Alloy (Experimental)

Alloy is an alternative to Prometheus for metrics collection, available when components.agent.mode is set to clustered.

Status: Experimental (may change in future releases)

How it works: Same as the webhook server - the HPA targets 80% CPU and memory utilization. Unlike Prometheus, Alloy replicas form a cluster and automatically distribute scraping work among themselves.

Scaling behavior: Replicas can create and join the cluster as needed. When a new replica starts, it takes over a portion of the scraping work. When a replica shuts down, its work is redistributed to remaining replicas.

Enabling Autoscaling

Enable Both at Once

To enable autoscaling for both the webhook server and Alloy with a single configuration:

defaults:
  autoscaling:
    enabled: true
components:
  agent:
    mode: clustered

This enables the global autoscaling default (which applies to the webhook server) and switches to Alloy (which also picks up the autoscaling default).

Webhook Server Autoscaling

Enable autoscaling for the webhook server with minimal configuration:

components:
  webhookServer:
    autoscaling:
      enabled: true

This uses the default autoscaling settings:

  • minReplicas: 1
  • maxReplicas: 10
  • targetCPUUtilizationPercentage: 80
  • targetMemoryUtilizationPercentage: 80

Alloy Autoscaling

To use Alloy with autoscaling, enable clustered mode:

components:
  agent:
    mode: clustered
    autoscaling:
      enabled: true

Note: Clustered mode is experimental. Test thoroughly in non-production environments before deploying to production clusters.

Advanced Configuration

Global Defaults

You can configure default autoscaling settings that apply to all components:

defaults:
  autoscaling:
    enabled: true
    minReplicas: 1
    maxReplicas: 10
    targetCPUUtilizationPercentage: 80
    targetMemoryUtilizationPercentage: 80

Individual components can override these settings:

defaults:
  autoscaling:
    enabled: true
    minReplicas: 1

components:
  webhookServer:
    autoscaling:
      minReplicas: 2 # Override: never scale below 2
      maxReplicas: 5 # Override: cap at 5 replicas

Full Configuration Options

components:
  webhookServer:
    autoscaling:
      # Whether to enable horizontal pod autoscaling.
      enabled: true

      # Minimum number of replicas.
      # With minReplicas: 1, the HPA can scale down to a single instance
      # during quiet periods.
      minReplicas: 1

      # Maximum number of replicas.
      # Prevents runaway scaling during extreme load spikes.
      maxReplicas: 10

      # Target CPU utilization percentage for autoscaling.
      # When average CPU exceeds this threshold, the HPA adds replicas.
      # Set to null to disable CPU-based scaling.
      targetCPUUtilizationPercentage: 80

      # Target memory utilization percentage for autoscaling.
      # When average memory exceeds this threshold, the HPA adds replicas.
      # Set to null to disable memory-based scaling.
      targetMemoryUtilizationPercentage: 80

Scaling Behavior

The HPA uses carefully tuned scaling policies to prevent thrashing:

Scale Up:

  • Stabilization window: 0 seconds (responds immediately to increased load)
  • Scales by up to 100% of current replicas or 4 pods per 30 seconds
  • Uses maximum of the two policies for fast response

Scale Down:

  • Stabilization window: 300 seconds (5 minutes) to prevent flapping
  • Scales down by 50% of current replicas or 2 pods per 60 seconds
  • Uses minimum of the two policies for gradual reduction

These policies are built into the HPA template and cannot be customized via values.yaml.

Combining Autoscaling with Resource Tuning

Even with autoscaling enabled, you can (and sometimes should) still tune resource requests and limits. The HPA scales the number of replicas, while resource settings control the size of each replica.

When to Tune Resources

Small/dev clusters: The defaults may be more than you need. Even if the default request is 256 MB of memory, a quiet development cluster might only need 64 MB. In this case, lower the requests so the single replica isn't oversized:

components:
  webhookServer:
    resources:
      requests:
        memory: "64Mi"
        cpu: "25m"
      limits:
        memory: "128Mi"
        cpu: "50m"
    autoscaling:
      enabled: true
      minReplicas: 1

Very large clusters: The defaults may be insufficient. Rather than increasing replica count, consider increasing per-replica resources first (see "Vertical vs Horizontal Scaling" below).

Webhook Server Resources

components:
  webhookServer:
    resources:
      requests:
        memory: "256Mi"
        cpu: "100m"
      limits:
        memory: "512Mi"
        cpu: "250m"
    autoscaling:
      enabled: true

Alloy Resources (Clustered Mode)

components:
  agent:
    mode: clustered
    resources:
      requests:
        memory: "512Mi"
        cpu: "250m"
      limits:
        memory: "1024Mi"
        cpu: "1000m"
    autoscaling:
      enabled: true

Vertical vs Horizontal Scaling

For Alloy: Prefer Vertical Scaling

For Alloy, vertical scaling (larger single instance) is more efficient than horizontal scaling when possible. Running a single larger instance avoids duplicating resources like caches and connection pools that each replica maintains. The HPA should be thought of as a safety net that kicks in when vertical limits are reached.

Recommendation: Start with appropriately-sized resource requests for your cluster, and let the HPA handle unexpected spikes:

components:
  agent:
    mode: clustered
    resources:
      requests:
        memory: "1Gi" # Sized for your cluster
        cpu: "500m"
      limits:
        memory: "2Gi"
        cpu: "1000m"
    autoscaling:
      enabled: true
      minReplicas: 1
      maxReplicas: 5 # Safety net for spikes

For Webhook Server: Horizontal Scaling is Fine

The webhook server is simpler - you can scale vertically or horizontally as needed. Just enabling autoscaling is probably enough for most clusters. The HPA will add replicas during activity spikes and remove them during quiet periods.

Recommendation: Don't increase resource requests to handle peak load. Instead, let the HPA add replicas:

components:
  webhookServer:
    # Use reasonable defaults, don't oversize
    resources:
      requests:
        memory: "256Mi"
        cpu: "100m"
      limits:
        memory: "512Mi"
        cpu: "250m"
    autoscaling:
      enabled: true # Let HPA handle scaling

The only scenario where you hit a wall is when scaled down to a single replica - at that point, if the single replica is still oversized for your needs, you'll need to lower the resource requests.

Trade-offs and Considerations

Single Replica Implications

When the HPA scales down to a single replica (minReplicas: 1):

Availability during node maintenance: With a single replica, node draining requires either accepting brief downtime or using a Pod Disruption Budget that blocks draining entirely. See Pod Disruption Budgets for details.

Webhook server latency: A single webhook server replica must handle all admission requests. During restart or upgrade, requests may queue briefly. The webhook's failurePolicy: Ignore ensures cluster operations continue even if the webhook is temporarily unavailable, but some metadata may be missed.

Recommendations:

  • For production clusters requiring high availability, consider minReplicas: 2
  • For development/staging clusters where brief gaps are acceptable, minReplicas: 1 is fine
  • Monitor the apiserver_admission_webhook_admission_duration_seconds metric to detect latency issues

Metrics Server Requirement

Horizontal Pod Autoscalers require the Kubernetes Metrics Server to function. Most managed Kubernetes distributions (EKS, GKE, AKS) include this by default.

Verify Metrics Server is running:

kubectl get deployment metrics-server -n kube-system

If not installed, the HPA will remain at minReplicas and log warnings about missing metrics.

Monitoring Autoscaling Behavior

Check HPA Status

# See current replicas and targets
kubectl get hpa -n cloudzero-agent

# Detailed HPA status including events
kubectl describe hpa -n cloudzero-agent

Example output:

NAME                              REFERENCE                                    TARGETS           MINPODS   MAXPODS   REPLICAS
cz-agent-cloudzero-agent-ws       Deployment/cz-agent-cloudzero-agent-ws       45%/80%, 30%/80%   1         10        2

Watch Scaling Events

kubectl get events -n cloudzero-agent --watch --field-selector reason=SuccessfulRescale

Prometheus Metrics

# Current replica count
kube_horizontalpodautoscaler_status_current_replicas{horizontalpodautoscaler=~".*cloudzero.*"}

# Desired replica count (what HPA wants)
kube_horizontalpodautoscaler_status_desired_replicas{horizontalpodautoscaler=~".*cloudzero.*"}

# Max replicas configured
kube_horizontalpodautoscaler_spec_max_replicas{horizontalpodautoscaler=~".*cloudzero.*"}

Example Configurations

Development/Staging Cluster

Minimize resource usage with lower requests and single-replica scaling:

components:
  webhookServer:
    resources:
      requests:
        memory: "128Mi"
        cpu: "50m"
      limits:
        memory: "256Mi"
        cpu: "100m"
    autoscaling:
      enabled: true
      minReplicas: 1
      maxReplicas: 3

Production Cluster with High Availability

Maintain redundancy while allowing scaling:

components:
  webhookServer:
    autoscaling:
      enabled: true
      minReplicas: 2 # Always maintain at least 2 replicas
      maxReplicas: 10

High-Churn Cluster with Alloy

For clusters with high pod turnover and large metric volumes:

components:
  agent:
    mode: clustered
    resources:
      requests:
        memory: "1Gi"
        cpu: "500m"
      limits:
        memory: "2Gi"
        cpu: "1000m"
    autoscaling:
      enabled: true
      minReplicas: 1
      maxReplicas: 10

  webhookServer:
    autoscaling:
      enabled: true
      minReplicas: 2
      maxReplicas: 10

Conclusion

Autoscaling in version 1.2.9 addresses a long-standing challenge: Prometheus cannot scale horizontally by simply adding replicas. The introduction of Alloy (which natively supports horizontal scaling) combined with HPAs for both Alloy and the webhook server provides automatic scaling that adapts to your cluster's actual workload.

The HPAs can scale from 1 replica up to 10 (by default), automatically adjusting to match actual demand. This means scaling down is just as important as scaling up - development clusters and quiet periods no longer require running oversized deployments.

Key takeaways:

  • Scaling works both directions - the HPA scales up to meet high demand AND scales down during quiet periods
  • Webhook server autoscaling is straightforward - just enable it and let the HPA handle scaling
  • Alloy enables lower defaults - because it can scale horizontally, Alloy's defaults are half the memory and 1/5th the CPU of Prometheus
  • One config fits all - the same Helm values can work across production and development clusters
  • Prefer vertical scaling for Alloy when possible - it's more efficient than running multiple replicas
  • Lower resource requests for small/dev clusters if even the scaled-down single replica is oversized
  • Both components target 80% CPU and memory utilization, scaling up and down to maintain this target

For related topics, see:

Clone this wiki locally