Summary
Multiple GCE instances have reached or are approaching critical disk thresholds with no automated alerting in place. The 2026-03-15 mainnet validator-1 incident (auto-shutdown at 97% disk) and the current testnet-validator-3 situation (96% disk as of 2026-03-17) demonstrate an urgent need for proactive disk monitoring.
Current State (2026-03-17)
| Instance |
Disk |
Used |
Use% |
Status |
| numbers-mainnet-validator-1 |
3.4T |
2.8T |
84% |
Warning |
| numbers-mainnet-validator-a1 |
1.9T |
1.1T |
57% |
OK |
| numbers-mainnet-validator-a2 |
2.0T |
970G |
49% |
OK |
| numbers-testnet-validator-3 |
497G |
476G |
96% |
CRITICAL |
| testnet-explorer |
29G |
25G |
84% |
Warning |
| mainnet-explorer |
47G |
33G |
72% |
OK |
Proposed Implementation
- GCP Cloud Monitoring alerting policies: Create uptime/disk metric alerts that fire at 80% (warning) and 90% (critical) thresholds
- Notification channels: Configure email and/or Slack notifications for disk alerts
- Runbook documentation: Add a disk cleanup/expansion runbook to the repository covering:
- How to expand GCE persistent disks (online resize)
- Avalanchego chain data pruning options
- Blockscout/explorer database cleanup procedures
- Monitoring script: Add a cron-based disk check script that can be deployed to each instance as a fallback
Immediate Actions Needed
numbers-testnet-validator-3 at 96% needs immediate disk expansion or cleanup
testnet-explorer and numbers-mainnet-validator-1 at 84% should be monitored closely
Impact
High — without disk monitoring, validators will silently auto-shutdown when disk < 3% free, causing chain downtime and transaction mempool backlog (as seen in the 2026-03-15 incident).
Generated by Health Monitor with Omni
Summary
Multiple GCE instances have reached or are approaching critical disk thresholds with no automated alerting in place. The 2026-03-15 mainnet validator-1 incident (auto-shutdown at 97% disk) and the current testnet-validator-3 situation (96% disk as of 2026-03-17) demonstrate an urgent need for proactive disk monitoring.
Current State (2026-03-17)
Proposed Implementation
Immediate Actions Needed
numbers-testnet-validator-3at 96% needs immediate disk expansion or cleanuptestnet-explorerandnumbers-mainnet-validator-1at 84% should be monitored closelyImpact
High — without disk monitoring, validators will silently auto-shutdown when disk < 3% free, causing chain downtime and transaction mempool backlog (as seen in the 2026-03-15 incident).
Generated by Health Monitor with Omni