# Pod Disruption Budgets (PDBs) in CloudZero Agent

This document explains why CloudZero Agent uses Pod Disruption Budgets (PDBs) and how to handle them during cluster maintenance operations.

## What are Pod Disruption Budgets?

Pod Disruption Budgets are Kubernetes resources that limit the number of pods that can be voluntarily disrupted during cluster maintenance operations. They help ensure application availability during:

- Node maintenance
- Cluster upgrades
- Pod evictions
- Voluntary disruptions

For more information, see the [Kubernetes documentation on Pod Disruption Budgets](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/).

## Why CloudZero Agent Uses PDBs

The CloudZero Agent uses PDBs on **all components** with `minAvailable: 1` to prevent accidental scaling down to 0 replicas during cluster operations. Without PDBs, these components could be accidentally scaled down to 0 replicas during cluster operations, causing:

- Complete loss of metrics collection
- Loss of cost attribution data
- Service unavailability

## The Single-Replica "Problem"

You may have noticed that some CloudZero Agent components (like the main agent and kube-state-metrics) run with only 1 replica but still have PDBs with `minAvailable: 1`. This is intentional and follows Kubernetes best practices for single-instance stateful applications.

As noted in the [Kubernetes documentation on single-instance stateful applications](https://kubernetes.io/docs/tasks/run-application/configure-pdb/#think-about-how-your-application-reacts-to-disruptions), single-instance stateful applications may want to set `minAvailable: 1` in the PDB to prevent the pod from being evicted, but this will also prevent the node from being drained.

## Why This Matters

The PDB prevents:

- **Accidental scaling to 0**: Prevents the component from being accidentally scaled down during cluster operations
- **Data loss**: Ensures continuous metrics collection and cost attribution
- **Service disruption**: Maintains critical CloudZero functionality

However, this also means:

- **Node draining is blocked**: The PDB prevents pods from being evicted during node maintenance
- **Cluster upgrades may be delayed**: The autoscaler cannot drain nodes with these pods

## Autoscaling: A Partial Solution

If you're using [autoscaling](Autoscaling.md) with clustered mode (`components.agent.mode: clustered`), the metrics collection component (Alloy) can scale horizontally and handle pod disruptions gracefully. This eliminates the PDB concern for metrics collection.

However, Kube State Metrics (KSM) currently cannot scale horizontally in the same way, so the KSM PDB remains a consideration. We are actively working on improving this situation.

## Choosing Your Approach

There are two valid approaches to handling PDBs, depending on your environment and tolerance for brief data gaps:

### Option 1: Keep PDBs Enabled, Disable Temporarily for Maintenance (Default)

This is the default configuration and is appropriate when:

- You require uninterrupted metrics collection
- Brief gaps in data are unacceptable for your use case
- You can coordinate maintenance windows with CloudZero Agent updates

With this approach, temporarily disable PDBs before planned maintenance, then re-enable them afterwards. The [Kubernetes documentation](https://kubernetes.io/docs/tasks/run-application/configure-pdb/#think-about-how-your-application-reacts-to-disruptions) recommends this pattern: establish an understanding that the cluster operator needs to consult you before termination, and when contacted, prepare for downtime and delete the PDB to indicate readiness for disruption, then recreate it afterwards.

#### Step 1: Disable PDBs

You can disable PDBs globally or individually:

**Disable all PDBs globally:**

```yaml
defaults:
  podDisruptionBudget:
    enabled: false
kubeStateMetrics:
  podDisruptionBudget: null
```

**Or disable PDBs individually:**

```yaml
components:
  agent:
    podDisruptionBudget:
      enabled: false
  aggregator:
    podDisruptionBudget:
      enabled: false
  webhookServer:
    podDisruptionBudget:
      enabled: false
kubeStateMetrics:
  podDisruptionBudget: null
```

#### Step 2: Apply and perform maintenance

```bash
helm upgrade <release-name> ./helm -f your-values.yaml
```

Then perform your maintenance operation (cluster upgrades, node patching, node draining, etc.).

#### Step 3: Re-enable PDBs

Remove or comment out the PDB overrides from your values file, then apply again:

```bash
helm upgrade <release-name> ./helm -f your-values.yaml
```

### Option 2: Permanently Disable PDBs

Permanently disabling some or all PDBs is a **valid choice** for many environments. This is particularly appropriate when:

- You use node autoscalers like **Karpenter** where nodes are frequently added and removed
- You prefer automated cluster operations over manual coordination
- **You can tolerate brief gaps in data collection** (typically a few minutes during pod migrations)

**Understanding the trade-offs:**

When a pod is evicted without a PDB, there will be a brief window where that component is unavailable. The impact varies by component:

- **KSM metrics**: KSM collects point-in-time Kubernetes object state. Brief gaps are generally less impactful because this data represents current state rather than sampled measurements. The main concern is short-lived pods that might be missed entirely during the gap.

- **cAdvisor metrics**: These are time-series samples collected at regular intervals. Missing samples can affect accuracy of resource utilization calculations. If you enable clustered mode with autoscaling, this concern is eliminated since Alloy handles disruptions gracefully.

**Recommended configuration for environments with frequent node churn:**

```yaml
# Use clustered mode with autoscaling for metrics collection
components:
  agent:
    mode: clustered
    autoscaling:
      enabled: true

# Disable the KSM PDB to allow automated node draining
kubeStateMetrics:
  podDisruptionBudget: null
```

This configuration gives you the best of both worlds: horizontally-scalable metrics collection that handles disruptions gracefully, combined with automated KSM pod migration at the cost of occasional brief gaps in KSM data.

## Troubleshooting

### PDB Preventing Node Draining

If you encounter issues where PDBs prevent node draining:

1. **Check which PDBs are blocking**:

   ```bash
   kubectl get pdb
   ```

2. **Temporarily disable the blocking PDB**:

   ```bash
   kubectl patch pdb <pdb-name> -p '{"spec":{"minAvailable":0}}'
   ```

3. **Perform your maintenance operation**

4. **Re-enable the PDB**:

   ```bash
   kubectl patch pdb <pdb-name> -p '{"spec":{"minAvailable":1}}'
   ```

## Best Practices

**If using Option 1 (temporary PDB disabling):**

1. **Plan maintenance windows**: Coordinate with your team before disabling PDBs
2. **Re-enable promptly**: Always re-enable PDBs after maintenance to prevent accidental data gaps
3. **Monitor during maintenance**: Ensure CloudZero functionality is restored after PDBs are re-enabled
4. **Document the process**: Keep track of when PDBs are disabled and re-enabled

**If using Option 2 (permanent PDB disabling):**

1. **Use clustered mode**: Enable `components.agent.mode: clustered` with autoscaling to minimize the impact of pod disruptions on metrics collection
2. **Monitor for data gaps**: Be aware that KSM data may have brief gaps during node migrations
3. **Consider your workload**: If you have many short-lived pods, the brief KSM gaps may cause some pods to be missed entirely

## References

- [Kubernetes Pod Disruption Budgets Documentation](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/)
- [Single-instance Stateful Applications](https://kubernetes.io/docs/tasks/run-application/configure-pdb/#think-about-how-your-application-reacts-to-disruptions)
- [Autoscaling](Autoscaling.md) - Horizontal pod autoscaling for CloudZero Agent components