Skip to content

Add validator update rollback, health checks, backup verification, and disaster recovery runbook#105

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/add-validator-update-rollback
Draft

Add validator update rollback, health checks, backup verification, and disaster recovery runbook#105
Copilot wants to merge 2 commits intomainfrom
copilot/add-validator-update-rollback

Conversation

Copy link
Copy Markdown

Copilot AI commented Mar 1, 2026

Validator update scripts had no rollback path, no post-deploy health gate, no pre-flight guards, and backups were never verified — any failed update required manual recovery under pressure with no documented procedure.

Update Scripts (update-validator-mainnet.sh, update-validator-testnet.sh)

  • Pre-flight checks — validates required tools (wget, tar, sha256sum, tree, jq, curl) and ≥2 GB free disk space before touching anything
  • Versioned backup — snapshots current binary dir + subnet-evm plugin to ~/validator-backups/validator-backup-pre-<VERSION>-<timestamp>.tar.gz with immediate tar -tzf integrity check before overwriting
  • --rollback flag — restores the most recent pre-update backup (with integrity verification) and prompts manual restart
  • --dry-run flag — runs all checks and reports intended actions without downloading or modifying files
  • Automated health check — polls 127.0.0.1:9650/ext/health every 5 s for up to 60 s post-update using jq; triggers automatic rollback and non-zero exit on timeout
# Rollback a failed update
./chains/update-validator-mainnet.sh --rollback

# Dry-run to validate environment before maintenance window
./chains/update-validator-mainnet.sh --dry-run

Backup Script (backup-validator.sh)

Added tar -tzf integrity verification immediately after archive creation to surface silent corruption at backup time rather than recovery time.

DISASTER_RECOVERY.md (new)

Runbook covering:

  • Restoring validator state from backup
  • Manual rollback path when automated rollback fails
  • Staking key compromise: isolation, key rotation, Node ID replacement
  • Emergency subnet governance: removing/adding validators via subnet-cli
  • Incident communication templates and post-incident review process
Original prompt

This section details on the original issue you should resolve

<issue_title>[Feature][High] Add validator update rollback, post-deployment health checks, and disaster recovery runbook</issue_title>
<issue_description>## Summary

The validator update workflow lacks critical operational resilience features that could lead to prolonged outages during failed updates.

Findings

1. No Rollback Mechanism in Update Scripts

  • Files: chains/update-validator-mainnet.sh, chains/update-validator-testnet.sh
  • Issue: Scripts copy old run.sh for reference but provide no automated rollback if the new version fails to start or sync. The old binary is overwritten in place.
  • Impact: Manual recovery required if an update breaks the validator; potential prolonged downtime.
  • Recommendation: Before updating, snapshot the current binary and configs. Add a --rollback flag that restores the previous version. Verify node health after update before discarding old version.

2. No Automated Health Check After Validator Update

  • Files: chains/update-validator-mainnet.sh (lines 46-56), chains/update-validator-testnet.sh (lines 46-56)
  • Issue: Scripts print a reminder to check health manually but do not programmatically verify the node started successfully (e.g., calling health.health.sh and checking the response).
  • Impact: Broken validators may go undetected until external monitoring catches it.
  • Recommendation: Add a post-deployment health check loop that polls the local health endpoint with a timeout (e.g., 60s). Exit with error if health check fails, triggering rollback.

3. No Disaster Recovery Documentation

  • Issue: No documented procedures for restoring a validator from backup, handling key compromise, or performing emergency subnet governance actions.
  • Impact: During a real incident, operators must improvise recovery steps under pressure.
  • Recommendation: Add a DISASTER_RECOVERY.md runbook covering: backup restoration, key rotation, emergency governance, and communication procedures.

4. No Backup Verification

  • File: chains/backup-validator.sh (lines 1-11)
  • Issue: Creates tar.gz backup but never verifies archive integrity with tar -tzf.
  • Impact: Silent backup corruption could lead to unrecoverable validator state.

5. Missing Pre-flight Checks in Update Scripts

  • Files: Both update scripts
  • Issue: No checks for sufficient disk space, running processes that need stopping, or required tools (tree, jq, sha256sum).
  • Impact: Updates fail mid-process leaving the node in an inconsistent state.

Suggested Approach

  1. Refactor update scripts to create a versioned backup before overwriting
  2. Add --rollback and --dry-run flags
  3. Integrate health.health.sh as a post-deployment gate
  4. Add pre-flight dependency and disk space checks
  5. Write a comprehensive disaster recovery runbook

Generated by Health Monitor with Omni</issue_description>

Comments on the Issue (you are @copilot in this section)


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…verification, and disaster recovery runbook

Co-authored-by: numbers-official <181934381+numbers-official@users.noreply.github.com>
Copilot AI changed the title [WIP] Add rollback mechanism and health checks for validator updates Add validator update rollback, health checks, backup verification, and disaster recovery runbook Mar 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature][High] Add validator update rollback, post-deployment health checks, and disaster recovery runbook

2 participants