Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -67,3 +67,7 @@ target/
.claude/
.claude_consciousness.m8
.opencode/
.firecrawl/

# skill-creator eval/optimization artifacts
skills/*-workspace/
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ AI agent skills for the [NodeOps](https://nodeops.network) ecosystem. Works with
| **createos** | Deploy anything to production on CreateOS cloud platform | `npx skills add https://github.com/NodeOps-app/skills --skill createos` |
| **vercel-to-createos** | Migrate Next.js, Vite, React, Vue, Svelte apps from Vercel to CreateOS | `npx skills add https://github.com/NodeOps-app/skills --skill vercel-to-createos` |
| **claude-code-to-codex** | Migrate Claude Code CLI hooks, MCP servers, plugins, instructions, and sessions to Codex CLI | `npx skills add https://github.com/NodeOps-app/skills --skill claude-code-to-codex` |
| **avail-validator-setup** | Stand up and activate an Avail DA validator (Docker-first) — day-0 provisioning through day-1 staking and going active, on Mainnet or Turing testnet | `npx skills add https://github.com/NodeOps-app/skills --skill avail-validator-setup` |
| **avail-validator-operate** | Day-2 ops for a live Avail DA validator — monitoring, slash-safe upgrades, key backup, chill/unbond, disaster recovery without equivocation | `npx skills add https://github.com/NodeOps-app/skills --skill avail-validator-operate` |

### Migration skills

Expand All @@ -18,6 +20,10 @@ AI agent skills for the [NodeOps](https://nodeops.network) ecosystem. Works with

`claude-code-to-codex` migrates Claude Code CLI setups to Codex CLI, with focused coverage for hooks, Claude Code CLI MCP servers, plugins, and session handoff.

### Avail validator skills

`avail-validator-setup` and `avail-validator-operate` cover the full lifecycle of an [Avail DA](https://docs.availproject.org/docs/da/operate/become-a-validator) validator, Docker-first and network-parameterized (Mainnet / Turing testnet). `avail-validator-setup` handles day-0 provisioning through day-1 session keys, bonding, and going active; `avail-validator-operate` handles day-2 monitoring, slash-safe upgrades, encrypted key backup, chill/unbond, and disaster recovery — every procedure built around avoiding equivocation/double-signing.

## CreateOS Authentication

The `createos` skill can be used in two modes:
Expand Down
111 changes: 111 additions & 0 deletions skills/avail-validator-operate/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
name: avail-validator-operate
description: >-
Run, maintain, and protect an already-active Avail DA validator — day-2 operations.
Use this whenever the user needs to monitor an Avail validator (telemetry, Prometheus,
Grafana, alerting on missed blocks/peers/sync/era points), upgrade the node to a new
availj/avail image or release without getting slashed, back up the keystore/node key,
restore or migrate a validator after a server loss WITHOUT double-signing, chill /
stop validating cleanly (staking.chill), unbond, or handle equivocation/slashing risk
and disaster recovery. Triggers on phrases like "monitor my avail validator", "set up
grafana for avail", "upgrade avail node safely", "avail validator slashed", "back up
avail keystore", "migrate avail validator to new server", "stop validating avail",
"chill my avail validator", "avail node equivocation", "restore avail validator".
For first-time setup, session-key generation, bonding and going active, use the
avail-validator-setup skill instead.
---

# Avail Validator — Operate (Day 2)

Keep an active Avail validator healthy and **avoid the one class of mistake that gets
you slashed: equivocation (double-signing)**. Equivocation slashes the validator *and*
its nominators, so every procedure here is shaped around the rule:

> **The same session keystore must never be active on two running nodes at once.**

Docker-first. One parameterized path covers Mainnet and Turing. Network-specific URLs
and economics (era ≈ 24 h, 28-day unbond, reward lag) are in
`avail-validator-setup/references/networks.md` — reuse it; don't restate values.

## What day-2 covers

| Task | Read |
|---|---|
| Monitoring & alerting | `references/monitoring.md` |
| Node upgrade (safe vs fast) | `references/upgrade.md` |
| Backup of secrets | `references/backup-recovery.md` |
| Disaster recovery / server migration | `references/backup-recovery.md` |
| Chill / unbond / withdraw | `references/chill-unbond.md` |
| Slashing & equivocation model | this file + `references/chill-unbond.md` |

Always identify the network and the running container first:

```bash
CID=$(docker ps -lq)
docker exec "$CID" ls /da/node-data/chains # confirms chain dir / network
docker logs --tail 30 "$CID"
```

## Monitoring (do this on day 1 of day-2)

A validator you can't observe is a validator you can't protect. Stand up the metrics
stack and alerts before anything else. Full configs (telemetry flag, `prometheus.yml`,
Grafana install, the official dashboard JSON) and the alert thresholds that actually
matter are in `references/monitoring.md`.

Alert, at minimum, on: node down / not on telemetry, **finalized height not
advancing**, **peer count low**, sync falling behind tip, **missed blocks / era points
dropping**, and version drift from the latest release. A full session unresponsive →
involuntary chill; >10 % of validators offline together in an epoch → all slashed.

## Upgrades — the equivocation trap

`docker pull` + recreate is fine for a **full/RPC node**. For an **active validator**
it risks: (a) DB corruption → prolonged downtime → ejection from the active set, and
(b) — if you "just spin up the new one alongside the old" — **double-signing**.

Two procedures, in `references/upgrade.md`:

- **Fast (acceptable downtime, single box):** stop container → recreate on the new
pinned tag with the same volume → verify it resumes authoring. Brief downtime, no
equivocation because the old node is stopped first.
- **Slow & safe (zero downtime, two boxes):** stand up Node B on the new version,
`author_rotateKeys` on **B**, submit the new keys via **Set Session Key**, wait for
block production to move to B (confirm by **logs**, not the UI), *then* and only then
stop Node A. Never have both authoring with the same keys.

`scripts/safe-upgrade.sh` walks the fast path with the stop-before-start ordering
enforced. Read `references/upgrade.md` before using it.

## Backups

`db` is re-syncable and holds no secret — don't fixate on it. The only irreplaceable
on-box material is `keystore/` (session keys) and `network/` (node key). Back them up
**encrypted and off-box**, immediately and after any key rotation.
`scripts/backup-keys.sh` produces an encrypted archive. Procedure + restore in
`references/backup-recovery.md`.

## Disaster recovery — without slashing yourself

Losing the server is survivable; **restoring keys onto a new box while the old one
might still be running is not** — that double-signs. The safe recovery paths
(old-node-definitively-dead vs rotate-to-new-keys) are in
`references/backup-recovery.md`. When in doubt, rotate to **new** session keys via
`setKeys` rather than restoring the old keystore — new keys can't equivocate against
the old.

## Chill / unbond / exit

Stopping cleanly is `staking.chill` (UI or extrinsic), **signed by the controller**,
effective **next era**; funds stay bonded. Unbond → **28-day** lock → withdraw.
Step-by-step, plus the difference between voluntary and involuntary chill and the
slashing conditions, in `references/chill-unbond.md`.

## Slashing facts to act on

- Equivocation (two blocks same slot, or conflicting GRANDPA votes) → slash for
validator **and** nominators. Usually self-inflicted by running duplicate keys.
- Slash shows immediately on the staking UI's slashes page, but the **financial
deduction is delayed days** (governance can reverse it). "Not deducted yet" ≠ "safe".
- Involuntary chill (offline, <10 % of set) → no slash; ≥10 % offline together →
slash. Uptime monitoring is a slashing-prevention control, not a nicety.
35 changes: 35 additions & 0 deletions skills/avail-validator-operate/evals/evals.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
{
"skill_name": "avail-validator-operate",
"evals": [
{
"id": 0,
"name": "safe-upgrade-no-equivocation",
"prompt": "My Avail validator on mainnet is active and producing blocks (availj/avail in Docker). A new release just dropped and I need to upgrade. I'm paranoid about getting slashed for double-signing. What's the safe upgrade procedure, and how is it different from just pulling the new image and restarting?",
"expected_output": "Equivocation explanation, fast stop-before-start path, slow two-box rotate-keys path, anti-patterns, high-stake recommendation, tag verification.",
"files": [],
"assertions": [
"Explicitly explains that double-signing/equivocation slashes the validator and its nominators, and why naive pull+restart is risky",
"Fast single-box path stops and confirms the old container is down BEFORE starting the new one, reusing the same volume and node name",
"Slow zero-downtime path: Node B on new version, author_rotateKeys for new keys, setKeys via controller, migrate confirmed via logs not UI, then stop Node A",
"Names the anti-pattern of running the old and new nodes simultaneously with the same/copied keystore",
"Recommends the slow two-box path for a high-stake active mainnet validator",
"Says to pin/verify the new image tag and not use :latest"
]
},
{
"id": 1,
"name": "disaster-recovery-no-double-sign",
"prompt": "Disaster: the server running my active Avail mainnet validator just died (cloud instance gone). I have an encrypted backup of the keystore and network folders. How do I get back to validating WITHOUT equivocating? I'm not 100% sure the old instance is truly dead.",
"expected_output": "Equivocation rule, rotate-to-new-keys Path B due to uncertain old node, Path A only if old definitively dead, db re-syncable vs keystore/network secrets, stash/controller from seed.",
"files": [],
"assertions": [
"States that restoring the keystore while the old node may still be running causes double-signing = slashing",
"Because old-node status is uncertain, prescribes rotating to NEW session keys (Path B) rather than restoring the old keystore",
"Says restoring the old keystore (Path A) is acceptable only if the old instance is definitively destroyed",
"Notes db is re-syncable and only keystore + network are the irreplaceable secrets",
"States stash/controller are wallet keys recovered from seed/hardware, not from the server backup",
"New keys are registered on-chain via setKeys signed by the controller, with activation confirmed via logs"
]
}
]
}
80 changes: 80 additions & 0 deletions skills/avail-validator-operate/references/backup-recovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
# Avail validator backup & disaster recovery

## What to back up (and what not to)

| Path (under `<base>/chains/<chainid>/`) | Back up? | Why |
|---|---|---|
| `keystore/` | **Yes, encrypted** | Session keys — irreplaceable, equivocation-critical |
| `network/` | **Yes** | Node key / libp2p identity |
| `db/` | No | Re-syncable from genesis or snapshot; contains no secret |

In Docker the base is `/da/node-data`; discover the chain dir
(`docker exec <CID> ls /da/node-data/chains`) — its name varies by node version.

## Backup procedure

Take a backup right after going active and after **every** key rotation. It must be
**encrypted** and stored **off the validator box**. `scripts/backup-keys.sh` does this:
it `tar`s `keystore/` + `network/` and encrypts with `age` (or `gpg` fallback).

Manual equivalent:

```bash
CID=$(docker ps -lq)
CHAIN_DIR=$(docker exec "$CID" sh -c 'ls -d /da/node-data/chains/*' | head -1)
docker exec "$CID" tar -C "$CHAIN_DIR" -czf - keystore network \
| age -r <your-age-recipient> > avail-keys-$(date +%F).tar.gz.age
# move the .age file off-box (it is the validator's identity — guard it)
```

Never store the archive unencrypted, and never store it on the same machine only.

## Re-sync the DB (no secrets involved)

If only the DB is bad (corruption, disk), you do **not** need keys back — keep the
keystore in place and rebuild state:

```bash
# stop node, then purge chain data and let it re-sync
avail purge-chain # binary form; in Docker: stop container, delete db/ in the volume, restart
```

Or restore from a trusted DB snapshot to skip a long genesis sync (warp sync is not
available). Trust the snapshot source.

## Disaster recovery — the rule that prevents self-slashing

> Restoring the keystore onto a new node **while the old node is or might still be
> running** double-signs → equivocation → slash (validator **and** nominators).

Choose the safe path:

### Path A — old node is definitively dead
Use only when you are *certain* the old machine can never produce blocks again
(destroyed/wiped, disk pulled, account access revoked — not merely "I think it's off").

1. Provision a fresh node (setup skill), same `--chain`/`--name`, let it sync.
2. Stop it. Restore `keystore/` + `network/` from the encrypted backup into the
volume's `chains/<chainid>/`.
3. Start it. It resumes the **same** validator identity. Confirm authoring via logs.

### Path B — old node status uncertain (preferred default)
If there is *any* doubt the old node is gone, do **not** restore the old keystore.
Instead rotate to **new** keys — new keys cannot equivocate against the old:

1. Provision a fresh node, sync it.
2. `author_rotateKeys` on the new node (new session keys).
3. **Set Session Key** to the new hex (controller-signed) via the staking UI.
4. Wait for authoring to move to the new node — confirm by **logs**, not the UI.
5. The old node, even if it later comes back, is signing with keys no longer
registered on-chain → it cannot equivocate. Decommission it when reachable.

Path B trades nothing meaningful (the validator account/stake is unchanged — only the
session keys rotate) for complete equivocation safety. Default to it.

## Stash / controller recovery

The stash and controller are **wallet** keys, never on the box — recover them from the
operator's seed/hardware wallet, not from server backups. If the controller seed is
compromised, the stash funds are still safe (separation), but rotate the controller and
re-`setKeys`/`validate` from the new controller promptly.
60 changes: 60 additions & 0 deletions skills/avail-validator-operate/references/chill-unbond.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Chill, unbond, exit — Avail validator

Stopping validation cleanly is a staking action, not a server action. Killing the
container alone does **not** chill you — you'd be an offline validator (involuntary
chill, possible slash if many are offline together). Always chill on-chain *first*,
then it's safe to stop the node.

## Chill (stop validating, keep funds bonded)

`staking.chill` removes you from the active/waiting set without unbonding.

- **Where:** staking actions UI (network URL in
`avail-validator-setup/references/networks.md`) → your account → **Stop**, or submit
the `staking.chill` extrinsic directly.
- **Signed by:** the **controller** account (not the stash).
- **Effective:** next era (~24 h). Funds remain bonded; you simply stop being
selectable for new/revised nominations.
- After chill takes effect (confirm you're out of the active set on the dashboard and
logs no longer show `🎁 Prepared block for proposing`), it is safe to stop/decommission
the node.

### Voluntary vs involuntary chill
- **Voluntary:** you called `chill`. Clean. No slash.
- **Involuntary:** the network chilled you for being unresponsive a full session. No
slash by itself — but if ≥10% of validators are offline together in an epoch, that
whole group is slashed. So "I'll just turn it off" is risky; chill explicitly.

## Unbond (start releasing the stake)

After chilling, to free the bonded funds:

1. `staking.unbond` the amount (controller-signed).
2. **28-day** unbonding lock — funds are non-transferable during this period.
3. After 28 days, `withdrawUnbonded` to make them transferable.

You can chill without unbonding (pause validating, keep stake) or unbond a partial
amount and keep validating with the rest (as long as you stay above the waiting-list
floor — see networks.md economics).

## Full exit checklist

1. `staking.chill` (controller) → wait one era, confirm out of active set via logs +
dashboard.
2. Stop & decommission the node container.
3. `staking.unbond` the full bonded amount (controller).
4. Wait 28 days.
5. `withdrawUnbonded` (controller). Funds now transferable from the stash.
6. Securely destroy the on-box `keystore/` only after you're certain you won't rejoin
with the same identity (otherwise keep the encrypted backup).

## Slashing context (why the order matters)

- Equivocation slashes regardless of chill status — it's about duplicate signing, so
don't run the old node again with live keys after migrating.
- Slash appears immediately on the staking UI slashes page; the **financial deduction
is delayed several days** and governance can reverse it. Don't assume safety from
"balance not changed yet."
- Chilling promptly when you know you'll be offline (maintenance, migration) converts a
potential slash scenario into a clean no-penalty exit. Treat chill as the standard
pre-maintenance step for anything that risks a full session of downtime.
Loading