Skip to content

[Feature][High] Add disk usage monitoring and auto-alerting for all GCE instances #138

@numbers-official

Description

@numbers-official

Summary

Multiple GCE instances have reached or are approaching critical disk thresholds with no automated alerting in place. The 2026-03-15 mainnet validator-1 incident (auto-shutdown at 97% disk) and the current testnet-validator-3 situation (96% disk as of 2026-03-17) demonstrate an urgent need for proactive disk monitoring.

Current State (2026-03-17)

Instance Disk Used Use% Status
numbers-mainnet-validator-1 3.4T 2.8T 84% Warning
numbers-mainnet-validator-a1 1.9T 1.1T 57% OK
numbers-mainnet-validator-a2 2.0T 970G 49% OK
numbers-testnet-validator-3 497G 476G 96% CRITICAL
testnet-explorer 29G 25G 84% Warning
mainnet-explorer 47G 33G 72% OK

Proposed Implementation

  1. GCP Cloud Monitoring alerting policies: Create uptime/disk metric alerts that fire at 80% (warning) and 90% (critical) thresholds
  2. Notification channels: Configure email and/or Slack notifications for disk alerts
  3. Runbook documentation: Add a disk cleanup/expansion runbook to the repository covering:
    • How to expand GCE persistent disks (online resize)
    • Avalanchego chain data pruning options
    • Blockscout/explorer database cleanup procedures
  4. Monitoring script: Add a cron-based disk check script that can be deployed to each instance as a fallback

Immediate Actions Needed

  • numbers-testnet-validator-3 at 96% needs immediate disk expansion or cleanup
  • testnet-explorer and numbers-mainnet-validator-1 at 84% should be monitored closely

Impact

High — without disk monitoring, validators will silently auto-shutdown when disk < 3% free, causing chain downtime and transaction mempool backlog (as seen in the 2026-03-15 incident).

Generated by Health Monitor with Omni

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions