Sizing

Scaling CloudZero Agent

This document explains how to scale the CloudZero Agent, the limitations of horizontal scaling, and when to use federated mode for distributed deployments.

Overview

Can the CloudZero Agent scale horizontally?

The short answer is no - you cannot simply increase the replica count and expect multiple instances to work together.

The longer, more accurate answer is it's complicated.

The CloudZero Agent server component is essentially Prometheus running in agent mode, and Prometheus can scale horizontally, but not in the way most applications do. Prometheus requires explicit, manual configuration to distribute workload across instances through techniques like federation, functional sharding, or hashmod-based target distribution. There's no automatic load balancing or work distribution.

For CloudZero Agent, this means:

Traditional horizontal scaling doesn't work: You cannot run multiple replicas of the same agent configuration
Custom sharding is possible but impractical: You could manually configure multiple agents with different scrape targets, but this requires deep knowledge of your specific environment and would need to be customized per deployment
Federated mode is available: CloudZero Agent offers a federated deployment mode that implements Prometheus-style horizontal scaling using a DaemonSet approach

Key Principle: Unless you actually need to scale horizontally, you will see more efficient resource usage from a single instance than multiple agents. Vertical scaling (increasing memory and CPU) should always be your first approach.

Why Multiple Replicas Don't Work

The CloudZero Agent server component is essentially Prometheus running in agent mode. Prometheus was not designed to scale horizontally in the traditional sense - you cannot simply add more replicas and expect them to work together.

The Prometheus Scaling Problem

Horizontal scaling in Prometheus is complicated in a way that cannot be easily automated:

No built-in sharding: Prometheus doesn't automatically split work between multiple instances
Duplicate metrics: Multiple instances would collect the same metrics, causing data duplication
Configuration complexity: You must manually configure which metrics each instance collects
Customer-specific requirements: Workload distribution depends heavily on each customer's unique environment

How Prometheus Intends You to Scale

Prometheus was designed with a specific architecture for horizontal scaling[^1]:

Single centralized server: One Prometheus instance that contains the database but does NOT perform any scraping
Multiple agent instances: Multiple Prometheus servers running in agent mode that perform the actual scraping
Manual workload distribution: Manually split up the scraping jobs across agents

For example, you might manually configure the scraping so that:

One agent scrapes only pods with specific labels
Agents shard pods so each agent scrapes only a subset
Different agents handle different metric types

Problem: This approach doesn't work generically for CloudZero because there's no one-size-fits-all way to split the scraping that makes sense for all customers. Each customer's environment is unique, and the optimal distribution strategy would be highly customer-specific.

[^1]: For more details on Prometheus scaling strategies including federation, functional sharding, and hashmod-based distribution, see Scaling Prometheus: Handling Large-Scale Deployments

Vertical Scaling: The Recommended Approach

For most performance issues, vertical scaling (increasing resources for a single instance) is the better solution:

More efficient resource usage than multiple small instances
Simpler to manage and troubleshoot
No risk of data duplication or gaps
Easier to monitor and understand

See the sizing guide for detailed recommendations on resource allocation.

When to Scale Vertically

You should consider increasing memory and CPU when:

The agent is running out of memory
Metrics collection is slow or falling behind
Pod restarts due to OOMKilled errors
High CPU usage causing throttling

Federated Mode: Horizontal Scaling Alternative

If vertical scaling is not sufficient (e.g., you've hit resource limits or need better fault tolerance), CloudZero Agent offers federated mode - a horizontally scalable deployment option.

What is Federated Mode?

Federated mode implements the Prometheus-recommended architecture for horizontal scaling:

Runs Prometheus in agent mode as a DaemonSet
One instance runs on every node in your cluster
Each instance scrapes only the pods on its node
Significantly smaller resource footprint per instance

Benefits of Federated Mode

Horizontal scalability: Scales automatically with your cluster
Smaller resource footprint: Instead of one agent using 10GB of memory, you have many agents each using a few MB to a few hundred MB
Better fault tolerance: Failure of a single node's agent is less catastrophic than losing a single centralized instance
Automatic distribution: Work is naturally distributed by node topology

Drawbacks of Federated Mode

Higher total resource usage: e.g., 100MB multiplied across 100 nodes = 10GB of RAM total
More complex: More moving parts means more to monitor and troubleshoot
Redundancy challenges: Running multiple instances per node for redundancy quickly becomes infeasible due to resource multiplication
Less optimal efficiency: A single well-tuned instance is generally more efficient than many small instances

When to Use Federated Mode

Consider federated mode when:

You've exhausted vertical scaling options (memory/CPU limits reached)
You have a very large cluster (many nodes)
You need better fault tolerance and can accept higher total resource usage
You want automatic scaling as your cluster grows

Do NOT use federated mode when:

You just want "high availability" without actual scaling needs
You can solve the problem by increasing memory/CPU
You're trying to optimize resource usage (single instance is more efficient)

Enabling Federated Mode

Important: Before enabling federated mode, carefully consider whether you actually need it. In most cases, vertical scaling (increasing memory/CPU) is simpler, more efficient, and easier to manage. Only enable federated mode if you have exhausted vertical scaling options or have specific architectural requirements that demand it.

Understanding the Architecture Change

Federated mode fundamentally changes how cAdvisor metrics are collected:

Standard Mode (Default):

Single centralized Prometheus agent instance
Scrapes all metrics from across the entire cluster:
- cAdvisor metrics (container resource usage)
- Kube-state-metrics (Kubernetes object state)
- CloudZero observability metrics
Centralized resource usage (e.g., 10GB memory in one pod)
cAdvisor scraping is typically the most resource-intensive operation

Federated Mode:

Centralized Prometheus agent still exists
Per-node Prometheus instances deployed as a DaemonSet
Work is divided:
- Per-node agents: Scrape cAdvisor metrics only for pods on their local node
- Centralized agent: Scrapes kube-state-metrics and CloudZero observability metrics
Distributed resource usage for the heavy cAdvisor scraping (e.g., 100MB memory × 100 nodes = 10GB total)

Standard Mode (Default):

graph TB
    subgraph "Kubernetes Cluster"
        KSM[kube-state-metrics]
        CAD[cAdvisor]
        K8S_API[Kubernetes API]

        subgraph "CloudZero Agent"
            PROM[CloudZero Agent<br/>Prometheus in Agent Mode]
        end

        subgraph "Aggregator"
            COLLECTOR[Collector]
            SHIPPER[Shipper]
            SHARED_STORAGE[(Shared Storage)]
        end

        subgraph "Webhook Server"
            WEBHOOK[Webhook Server]
        end
    end

    API[CloudZero API]
    S3[(AWS S3)]

    %% Data flow from sources to agent
    KSM -->|metrics| PROM
    CAD -->|metrics| PROM
    COLLECTOR -->|metrics| PROM
    SHIPPER -->|metrics| PROM

    %% Agent to aggregator
    PROM -->|remote write| COLLECTOR

    %% Internal aggregator flow
    COLLECTOR -->|writes files| SHARED_STORAGE
    SHARED_STORAGE -->|files| SHIPPER

    %% To CloudZero platform
    SHIPPER -->|requests pre-signed URLs| API
    SHIPPER -->|uploads files| S3

    %% Webhook flow
    K8S_API -->|admission webhook calls| WEBHOOK
    WEBHOOK -->|remote write| COLLECTOR

Federated Mode:

graph TB
    subgraph "Kubernetes Cluster"
        KSM[kube-state-metrics]
        K8S_API[Kubernetes API]

        subgraph "Node 1"
            CAD1[cAdvisor]
            PROM_FED1[CloudZero Agent<br/>Prometheus in Agent Mode<br/>DaemonSet Instance]
            CAD1 -->|metrics| PROM_FED1
        end

        subgraph "Node 2"
            CAD2[cAdvisor]
            PROM_FED2[CloudZero Agent<br/>Prometheus in Agent Mode<br/>DaemonSet Instance]
            CAD2 -->|metrics| PROM_FED2
        end

        subgraph "Node N"
            CAD_N[cAdvisor]
            PROM_FED_N[CloudZero Agent<br/>Prometheus in Agent Mode<br/>DaemonSet Instance]
            CAD_N -->|metrics| PROM_FED_N
        end

        subgraph "CloudZero Agent Central"
            PROM_CENTRAL[CloudZero Agent<br/>Prometheus in Agent Mode<br/>Deployment]
        end

        subgraph "Aggregator"
            COLLECTOR[Collector]
            SHIPPER[Shipper]
            SHARED_STORAGE[(Shared Storage)]
        end

        subgraph "Webhook Server"
            WEBHOOK[Webhook Server]
        end
    end

    API[CloudZero API]
    S3[(AWS S3)]

    %% Data flow from sources to central agent
    KSM -->|metrics| PROM_CENTRAL
    COLLECTOR -->|metrics| PROM_CENTRAL
    SHIPPER -->|metrics| PROM_CENTRAL

    %% Federated agents to aggregator
    PROM_FED1 -->|remote write| COLLECTOR
    PROM_FED2 -->|remote write| COLLECTOR
    PROM_FED_N -->|remote write| COLLECTOR

    %% Central agent to aggregator
    PROM_CENTRAL -->|remote write| COLLECTOR

    %% Internal aggregator flow
    COLLECTOR -->|writes files| SHARED_STORAGE
    SHARED_STORAGE -->|files| SHIPPER

    %% To CloudZero platform
    SHIPPER -->|requests pre-signed URLs| API
    SHIPPER -->|uploads files| S3

    %% Webhook flow
    K8S_API -->|admission webhook calls| WEBHOOK
    WEBHOOK -->|remote write| COLLECTOR

Why Federated Mode is Generally Not Recommended

While federated mode provides horizontal scalability, it comes with significant drawbacks:

Higher total resource usage with substantial overhead: Consider a 100-node cluster:
- Single instance mode: Might use 10GB total (one instance handling all scraping)
- Federated mode: Default configuration requests 512Mi per node, resulting in 50GB total across a 100-node cluster
- Critical overhead consideration: Each Prometheus instance has base overhead (runtime, storage, networking) regardless of how much data it collects. Running 100 Prometheus instances incurs this overhead 100 times, whereas adding incremental scraping load to a single instance is much more efficient
- The actual per-node memory usage varies significantly based on pod count and metrics cardinality per node - the 512Mi default may be too high or too low for your specific environment
Increased operational complexity: Instead of managing one pod, you're managing N pods (one per node)
More points of failure: Each node's agent can fail independently, requiring more monitoring
Resource multiplication: Every resource allocation multiplies across all nodes
Less visibility: Troubleshooting requires checking logs across many pods instead of one

The trade-off: You get automatic horizontal scaling and better fault isolation, but at the cost of efficiency and operational simplicity. The overhead of running many Prometheus instances almost always exceeds the cost of scaling a single instance vertically.

How to Enable Federated Mode

Only enable federated mode after determining it's truly necessary. To enable:

# Enable federated mode
defaults:
  federation:
    enabled: true

What Happens When Federated Mode is Enabled

The agent is deployed as a DaemonSet instead of a Deployment
One instance automatically runs on every node in your cluster
Each instance scrapes metrics only from pods on its local node
Resources scale automatically as nodes are added/removed
Each agent operates independently

Sizing Recommendations

Single Instance (Default Mode)

Refer to the sizing guide for detailed memory and CPU recommendations based on:

Cluster size (number of nodes)
Pod count
Metrics cardinality
Scrape intervals

Federated Mode

In federated mode, each DaemonSet instance has default resource allocations:

Memory request: 512Mi per node (default)
Memory limit: 1024Mi (1GB) per node (default)
CPU request: 250m per node (default)
CPU limit: 1000m per node (default)

Important: These defaults are generic and unlikely to be optimal for your specific environment. The number of pods running on a given node is the primary factor in determining agent memory usage, and Kubernetes nodes can vary wildly in size and capacity:

Small nodes: May have only 1-2GB of memory total with a handful of pods
Large nodes: May have hundreds of CPU cores and terabytes of RAM running hundreds of pods
Mixed environments: Many clusters use different node types for different workload classes

Because of this variability, there is no one-size-fits-all resource allocation that works across all environments. You should:

Monitor actual usage after enabling federated mode
Adjust resource requests based on observed memory and CPU consumption per node type
Consider node heterogeneity: Use node selectors or taints/tolerations to apply different resource allocations to different node types if needed

Remember: Total resource usage = (per-instance allocation) × (number of nodes). This multiplication can make federated mode extremely expensive for large clusters, particularly if the resource requests/limits are not carefully selected.

To customize resource allocations:

defaults:
  federation:
    enabled: true

components:
  agent:
    federatedNode:
      resources:
        requests:
          memory: "256Mi" # Adjust based on your needs
          cpu: "100m" # Adjust based on your needs
        limits:
          memory: "512Mi" # Adjust based on your needs
          cpu: "500m" # Adjust based on your needs

Best Practices

Start with vertical scaling: Always try increasing memory/CPU before considering federated mode
Use sizing guide: Follow the official sizing recommendations for your cluster size
Monitor resource usage: Track memory and CPU to identify actual bottlenecks
Plan for growth: Consider future cluster growth when sizing resources
Test thoroughly: If using federated mode, test thoroughly before deploying to production

Conclusion

The CloudZero Agent cannot be scaled horizontally using traditional replica count increases. The recommended approach is:

First: Scale vertically by increasing memory and CPU (see sizing guide)
If needed: Enable federated mode for true horizontal scaling across nodes
Never: Simply increase replica count expecting better performance

Unless you have specific requirements that demand federated mode, a single well-resourced instance will provide the best performance and efficiency.

For more information on Prometheus scaling architecture, see the Prometheus Agent Mode documentation and this overview of scaling Prometheus for large-scale deployments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sizing

Scaling CloudZero Agent

Overview

Why Multiple Replicas Don't Work

The Prometheus Scaling Problem

How Prometheus Intends You to Scale

Vertical Scaling: The Recommended Approach

When to Scale Vertically

Federated Mode: Horizontal Scaling Alternative

What is Federated Mode?

Benefits of Federated Mode

Drawbacks of Federated Mode

When to Use Federated Mode

Enabling Federated Mode

Understanding the Architecture Change

Why Federated Mode is Generally Not Recommended

How to Enable Federated Mode

What Happens When Federated Mode is Enabled

Sizing Recommendations

Single Instance (Default Mode)

Federated Mode

Best Practices

Conclusion

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally