Add validator update rollback, health checks, backup verification, and disaster recovery runbook by Copilot · Pull Request #105 · numbersprotocol/numbers-network

Copilot · 2026-03-01T16:50:13Z

Validator update scripts had no rollback path, no post-deploy health gate, no pre-flight guards, and backups were never verified — any failed update required manual recovery under pressure with no documented procedure.

Update Scripts (`update-validator-mainnet.sh`, `update-validator-testnet.sh`)

Pre-flight checks — validates required tools (wget, tar, sha256sum, tree, jq, curl) and ≥2 GB free disk space before touching anything
Versioned backup — snapshots current binary dir + subnet-evm plugin to ~/validator-backups/validator-backup-pre-<VERSION>-<timestamp>.tar.gz with immediate tar -tzf integrity check before overwriting
--rollback flag — restores the most recent pre-update backup (with integrity verification) and prompts manual restart
--dry-run flag — runs all checks and reports intended actions without downloading or modifying files
Automated health check — polls 127.0.0.1:9650/ext/health every 5 s for up to 60 s post-update using jq; triggers automatic rollback and non-zero exit on timeout

# Rollback a failed update
./chains/update-validator-mainnet.sh --rollback

# Dry-run to validate environment before maintenance window
./chains/update-validator-mainnet.sh --dry-run

Backup Script (`backup-validator.sh`)

Added tar -tzf integrity verification immediately after archive creation to surface silent corruption at backup time rather than recovery time.

`DISASTER_RECOVERY.md` (new)

Runbook covering:

Restoring validator state from backup
Manual rollback path when automated rollback fails
Staking key compromise: isolation, key rotation, Node ID replacement
Emergency subnet governance: removing/adding validators via subnet-cli
Incident communication templates and post-incident review process

Original prompt

This section details on the original issue you should resolve

<issue_title>[Feature][High] Add validator update rollback, post-deployment health checks, and disaster recovery runbook</issue_title>
<issue_description>## Summary

The validator update workflow lacks critical operational resilience features that could lead to prolonged outages during failed updates.

Findings

1. No Rollback Mechanism in Update Scripts

Files: chains/update-validator-mainnet.sh, chains/update-validator-testnet.sh

Issue: Scripts copy old run.sh for reference but provide no automated rollback if the new version fails to start or sync. The old binary is overwritten in place.

Impact: Manual recovery required if an update breaks the validator; potential prolonged downtime.

Recommendation: Before updating, snapshot the current binary and configs. Add a --rollback flag that restores the previous version. Verify node health after update before discarding old version.

2. No Automated Health Check After Validator Update

Files: chains/update-validator-mainnet.sh (lines 46-56), chains/update-validator-testnet.sh (lines 46-56)

Issue: Scripts print a reminder to check health manually but do not programmatically verify the node started successfully (e.g., calling health.health.sh and checking the response).

Impact: Broken validators may go undetected until external monitoring catches it.

Recommendation: Add a post-deployment health check loop that polls the local health endpoint with a timeout (e.g., 60s). Exit with error if health check fails, triggering rollback.

3. No Disaster Recovery Documentation

Issue: No documented procedures for restoring a validator from backup, handling key compromise, or performing emergency subnet governance actions.

Impact: During a real incident, operators must improvise recovery steps under pressure.

Recommendation: Add a DISASTER_RECOVERY.md runbook covering: backup restoration, key rotation, emergency governance, and communication procedures.

4. No Backup Verification

File: chains/backup-validator.sh (lines 1-11)

Issue: Creates tar.gz backup but never verifies archive integrity with tar -tzf.

Impact: Silent backup corruption could lead to unrecoverable validator state.

5. Missing Pre-flight Checks in Update Scripts

Files: Both update scripts

Issue: No checks for sufficient disk space, running processes that need stopping, or required tools (tree, jq, sha256sum).

Impact: Updates fail mid-process leaving the node in an inconsistent state.

Suggested Approach

Refactor update scripts to create a versioned backup before overwriting

Add --rollback and --dry-run flags

Integrate health.health.sh as a post-deployment gate

Add pre-flight dependency and disk space checks

Write a comprehensive disaster recovery runbook

Generated by Health Monitor with Omni</issue_description>

Comments on the Issue (you are @copilot in this section)

Fixes [Feature][High] Add validator update rollback, post-deployment health checks, and disaster recovery runbook #103

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

…verification, and disaster recovery runbook Co-authored-by: numbers-official <181934381+numbers-official@users.noreply.github.com>

Initial plan

4d095c8

Copilot AI assigned Copilot and numbers-official Mar 1, 2026

Copilot started work on behalf of numbers-official March 1, 2026 16:50 View session

feat: add rollback, dry-run, pre-flight checks, health check, backup …

eb2dd1a

…verification, and disaster recovery runbook Co-authored-by: numbers-official <181934381+numbers-official@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add rollback mechanism and health checks for validator updates~~ Add validator update rollback, health checks, backup verification, and disaster recovery runbook Mar 1, 2026

Copilot finished work on behalf of numbers-official March 1, 2026 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add validator update rollback, health checks, backup verification, and disaster recovery runbook#105

Add validator update rollback, health checks, backup verification, and disaster recovery runbook#105
Copilot wants to merge 2 commits intomainfrom
copilot/add-validator-update-rollback

Copilot AI commented Mar 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update Scripts (update-validator-mainnet.sh, update-validator-testnet.sh)

Backup Script (backup-validator.sh)

DISASTER_RECOVERY.md (new)

Findings

1. No Rollback Mechanism in Update Scripts

2. No Automated Health Check After Validator Update

3. No Disaster Recovery Documentation

4. No Backup Verification

5. Missing Pre-flight Checks in Update Scripts

Suggested Approach

Comments on the Issue (you are @copilot in this section)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Mar 1, 2026 •

edited

Loading

Update Scripts (`update-validator-mainnet.sh`, `update-validator-testnet.sh`)

Backup Script (`backup-validator.sh`)

`DISASTER_RECOVERY.md` (new)