-
Notifications
You must be signed in to change notification settings - Fork 0
Sizing
This document explains how to scale the CloudZero Agent, the limitations of horizontal scaling, and when to use federated mode for distributed deployments.
Can the CloudZero Agent scale horizontally?
The short answer is no - you cannot simply increase the replica count and expect multiple instances to work together.
The longer, more accurate answer is it's complicated.
The CloudZero Agent server component is essentially Prometheus running in agent mode, and Prometheus can scale horizontally, but not in the way most applications do. Prometheus requires explicit, manual configuration to distribute workload across instances through techniques like federation, functional sharding, or hashmod-based target distribution. There's no automatic load balancing or work distribution.
For CloudZero Agent, this means:
- Traditional horizontal scaling doesn't work: You cannot run multiple replicas of the same agent configuration
- Custom sharding is possible but impractical: You could manually configure multiple agents with different scrape targets, but this requires deep knowledge of your specific environment and would need to be customized per deployment
- Federated mode is available: CloudZero Agent offers a federated deployment mode that implements Prometheus-style horizontal scaling using a DaemonSet approach
Key Principle: Unless you actually need to scale horizontally, you will see more efficient resource usage from a single instance than multiple agents. Vertical scaling (increasing memory and CPU) should always be your first approach.
The CloudZero Agent server component is essentially Prometheus running in agent mode. Prometheus was not designed to scale horizontally in the traditional sense - you cannot simply add more replicas and expect them to work together.
Horizontal scaling in Prometheus is complicated in a way that cannot be easily automated:
- No built-in sharding: Prometheus doesn't automatically split work between multiple instances
- Duplicate metrics: Multiple instances would collect the same metrics, causing data duplication
- Configuration complexity: You must manually configure which metrics each instance collects
- Customer-specific requirements: Workload distribution depends heavily on each customer's unique environment
Prometheus was designed with a specific architecture for horizontal scaling[^1]:
- Single centralized server: One Prometheus instance that contains the database but does NOT perform any scraping
- Multiple agent instances: Multiple Prometheus servers running in agent mode that perform the actual scraping
- Manual workload distribution: Manually split up the scraping jobs across agents
For example, you might manually configure the scraping so that:
- One agent scrapes only pods with specific labels
- Agents shard pods so each agent scrapes only a subset
- Different agents handle different metric types
Problem: This approach doesn't work generically for CloudZero because there's no one-size-fits-all way to split the scraping that makes sense for all customers. Each customer's environment is unique, and the optimal distribution strategy would be highly customer-specific.
[^1]: For more details on Prometheus scaling strategies including federation, functional sharding, and hashmod-based distribution, see Scaling Prometheus: Handling Large-Scale Deployments
For most performance issues, vertical scaling (increasing resources for a single instance) is the better solution:
- More efficient resource usage than multiple small instances
- Simpler to manage and troubleshoot
- No risk of data duplication or gaps
- Easier to monitor and understand
See the sizing guide for detailed recommendations on resource allocation.
You should consider increasing memory and CPU when:
- The agent is running out of memory
- Metrics collection is slow or falling behind
- Pod restarts due to OOMKilled errors
- High CPU usage causing throttling
If vertical scaling is not sufficient (e.g., you've hit resource limits or need better fault tolerance), CloudZero Agent offers federated mode - a horizontally scalable deployment option.
Federated mode implements the Prometheus-recommended architecture for horizontal scaling:
- Runs Prometheus in agent mode as a DaemonSet
- One instance runs on every node in your cluster
- Each instance scrapes only the pods on its node
- Significantly smaller resource footprint per instance
- Horizontal scalability: Scales automatically with your cluster
- Smaller resource footprint: Instead of one agent using 10GB of memory, you have many agents each using a few MB to a few hundred MB
- Better fault tolerance: Failure of a single node's agent is less catastrophic than losing a single centralized instance
- Automatic distribution: Work is naturally distributed by node topology
- Higher total resource usage: e.g., 100MB multiplied across 100 nodes = 10GB of RAM total
- More complex: More moving parts means more to monitor and troubleshoot
- Redundancy challenges: Running multiple instances per node for redundancy quickly becomes infeasible due to resource multiplication
- Less optimal efficiency: A single well-tuned instance is generally more efficient than many small instances
Consider federated mode when:
- You've exhausted vertical scaling options (memory/CPU limits reached)
- You have a very large cluster (many nodes)
- You need better fault tolerance and can accept higher total resource usage
- You want automatic scaling as your cluster grows
Do NOT use federated mode when:
- You just want "high availability" without actual scaling needs
- You can solve the problem by increasing memory/CPU
- You're trying to optimize resource usage (single instance is more efficient)
Important: Before enabling federated mode, carefully consider whether you actually need it. In most cases, vertical scaling (increasing memory/CPU) is simpler, more efficient, and easier to manage. Only enable federated mode if you have exhausted vertical scaling options or have specific architectural requirements that demand it.
Federated mode fundamentally changes how cAdvisor metrics are collected:
Standard Mode (Default):
- Single centralized Prometheus agent instance
- Scrapes all metrics from across the entire cluster:
- cAdvisor metrics (container resource usage)
- Kube-state-metrics (Kubernetes object state)
- CloudZero observability metrics
- Centralized resource usage (e.g., 10GB memory in one pod)
- cAdvisor scraping is typically the most resource-intensive operation
Federated Mode:
- Centralized Prometheus agent still exists
- Per-node Prometheus instances deployed as a DaemonSet
- Work is divided:
- Per-node agents: Scrape cAdvisor metrics only for pods on their local node
- Centralized agent: Scrapes kube-state-metrics and CloudZero observability metrics
- Distributed resource usage for the heavy cAdvisor scraping (e.g., 100MB memory × 100 nodes = 10GB total)
Standard Mode (Default):
graph TB
subgraph "Kubernetes Cluster"
KSM[kube-state-metrics]
CAD[cAdvisor]
K8S_API[Kubernetes API]
subgraph "CloudZero Agent"
PROM[CloudZero Agent<br/>Prometheus in Agent Mode]
end
subgraph "Aggregator"
COLLECTOR[Collector]
SHIPPER[Shipper]
SHARED_STORAGE[(Shared Storage)]
end
subgraph "Webhook Server"
WEBHOOK[Webhook Server]
end
end
API[CloudZero API]
S3[(AWS S3)]
%% Data flow from sources to agent
KSM -->|metrics| PROM
CAD -->|metrics| PROM
COLLECTOR -->|metrics| PROM
SHIPPER -->|metrics| PROM
%% Agent to aggregator
PROM -->|remote write| COLLECTOR
%% Internal aggregator flow
COLLECTOR -->|writes files| SHARED_STORAGE
SHARED_STORAGE -->|files| SHIPPER
%% To CloudZero platform
SHIPPER -->|requests pre-signed URLs| API
SHIPPER -->|uploads files| S3
%% Webhook flow
K8S_API -->|admission webhook calls| WEBHOOK
WEBHOOK -->|remote write| COLLECTOR
Federated Mode:
graph TB
subgraph "Kubernetes Cluster"
KSM[kube-state-metrics]
K8S_API[Kubernetes API]
subgraph "Node 1"
CAD1[cAdvisor]
PROM_FED1[CloudZero Agent<br/>Prometheus in Agent Mode<br/>DaemonSet Instance]
CAD1 -->|metrics| PROM_FED1
end
subgraph "Node 2"
CAD2[cAdvisor]
PROM_FED2[CloudZero Agent<br/>Prometheus in Agent Mode<br/>DaemonSet Instance]
CAD2 -->|metrics| PROM_FED2
end
subgraph "Node N"
CAD_N[cAdvisor]
PROM_FED_N[CloudZero Agent<br/>Prometheus in Agent Mode<br/>DaemonSet Instance]
CAD_N -->|metrics| PROM_FED_N
end
subgraph "CloudZero Agent Central"
PROM_CENTRAL[CloudZero Agent<br/>Prometheus in Agent Mode<br/>Deployment]
end
subgraph "Aggregator"
COLLECTOR[Collector]
SHIPPER[Shipper]
SHARED_STORAGE[(Shared Storage)]
end
subgraph "Webhook Server"
WEBHOOK[Webhook Server]
end
end
API[CloudZero API]
S3[(AWS S3)]
%% Data flow from sources to central agent
KSM -->|metrics| PROM_CENTRAL
COLLECTOR -->|metrics| PROM_CENTRAL
SHIPPER -->|metrics| PROM_CENTRAL
%% Federated agents to aggregator
PROM_FED1 -->|remote write| COLLECTOR
PROM_FED2 -->|remote write| COLLECTOR
PROM_FED_N -->|remote write| COLLECTOR
%% Central agent to aggregator
PROM_CENTRAL -->|remote write| COLLECTOR
%% Internal aggregator flow
COLLECTOR -->|writes files| SHARED_STORAGE
SHARED_STORAGE -->|files| SHIPPER
%% To CloudZero platform
SHIPPER -->|requests pre-signed URLs| API
SHIPPER -->|uploads files| S3
%% Webhook flow
K8S_API -->|admission webhook calls| WEBHOOK
WEBHOOK -->|remote write| COLLECTOR
While federated mode provides horizontal scalability, it comes with significant drawbacks:
-
Higher total resource usage with substantial overhead: Consider a 100-node cluster:
- Single instance mode: Might use 10GB total (one instance handling all scraping)
- Federated mode: Default configuration requests 512Mi per node, resulting in 50GB total across a 100-node cluster
- Critical overhead consideration: Each Prometheus instance has base overhead (runtime, storage, networking) regardless of how much data it collects. Running 100 Prometheus instances incurs this overhead 100 times, whereas adding incremental scraping load to a single instance is much more efficient
- The actual per-node memory usage varies significantly based on pod count and metrics cardinality per node - the 512Mi default may be too high or too low for your specific environment
-
Increased operational complexity: Instead of managing one pod, you're managing N pods (one per node)
-
More points of failure: Each node's agent can fail independently, requiring more monitoring
-
Resource multiplication: Every resource allocation multiplies across all nodes
-
Less visibility: Troubleshooting requires checking logs across many pods instead of one
The trade-off: You get automatic horizontal scaling and better fault isolation, but at the cost of efficiency and operational simplicity. The overhead of running many Prometheus instances almost always exceeds the cost of scaling a single instance vertically.
Only enable federated mode after determining it's truly necessary. To enable:
# Enable federated mode
defaults:
federation:
enabled: true- The agent is deployed as a DaemonSet instead of a Deployment
- One instance automatically runs on every node in your cluster
- Each instance scrapes metrics only from pods on its local node
- Resources scale automatically as nodes are added/removed
- Each agent operates independently
Refer to the sizing guide for detailed memory and CPU recommendations based on:
- Cluster size (number of nodes)
- Pod count
- Metrics cardinality
- Scrape intervals
In federated mode, each DaemonSet instance has default resource allocations:
- Memory request: 512Mi per node (default)
- Memory limit: 1024Mi (1GB) per node (default)
- CPU request: 250m per node (default)
- CPU limit: 1000m per node (default)
Important: These defaults are generic and unlikely to be optimal for your specific environment. The number of pods running on a given node is the primary factor in determining agent memory usage, and Kubernetes nodes can vary wildly in size and capacity:
- Small nodes: May have only 1-2GB of memory total with a handful of pods
- Large nodes: May have hundreds of CPU cores and terabytes of RAM running hundreds of pods
- Mixed environments: Many clusters use different node types for different workload classes
Because of this variability, there is no one-size-fits-all resource allocation that works across all environments. You should:
- Monitor actual usage after enabling federated mode
- Adjust resource requests based on observed memory and CPU consumption per node type
- Consider node heterogeneity: Use node selectors or taints/tolerations to apply different resource allocations to different node types if needed
Remember: Total resource usage = (per-instance allocation) × (number of nodes). This multiplication can make federated mode extremely expensive for large clusters, particularly if the resource requests/limits are not carefully selected.
To customize resource allocations:
defaults:
federation:
enabled: true
components:
agent:
federatedNode:
resources:
requests:
memory: "256Mi" # Adjust based on your needs
cpu: "100m" # Adjust based on your needs
limits:
memory: "512Mi" # Adjust based on your needs
cpu: "500m" # Adjust based on your needs- Start with vertical scaling: Always try increasing memory/CPU before considering federated mode
- Use sizing guide: Follow the official sizing recommendations for your cluster size
- Monitor resource usage: Track memory and CPU to identify actual bottlenecks
- Plan for growth: Consider future cluster growth when sizing resources
- Test thoroughly: If using federated mode, test thoroughly before deploying to production
The CloudZero Agent cannot be scaled horizontally using traditional replica count increases. The recommended approach is:
- First: Scale vertically by increasing memory and CPU (see sizing guide)
- If needed: Enable federated mode for true horizontal scaling across nodes
- Never: Simply increase replica count expecting better performance
Unless you have specific requirements that demand federated mode, a single well-resourced instance will provide the best performance and efficiency.
For more information on Prometheus scaling architecture, see the Prometheus Agent Mode documentation and this overview of scaling Prometheus for large-scale deployments.