NodeOps-app · pratikbin · May 19, 2026
diff --git a/.gitignore b/.gitignore
@@ -67,3 +67,7 @@ target/
 .claude/
 .claude_consciousness.m8
 .opencode/
+.firecrawl/
+
+# skill-creator eval/optimization artifacts
+skills/*-workspace/
diff --git a/README.md b/README.md
@@ -9,6 +9,8 @@ AI agent skills for the [NodeOps](https://nodeops.network) ecosystem. Works with
 | **createos** | Deploy anything to production on CreateOS cloud platform | `npx skills add https://github.com/NodeOps-app/skills --skill createos` |
 | **vercel-to-createos** | Migrate Next.js, Vite, React, Vue, Svelte apps from Vercel to CreateOS | `npx skills add https://github.com/NodeOps-app/skills --skill vercel-to-createos` |
 | **claude-code-to-codex** | Migrate Claude Code CLI hooks, MCP servers, plugins, instructions, and sessions to Codex CLI | `npx skills add https://github.com/NodeOps-app/skills --skill claude-code-to-codex` |
+| **avail-validator-setup** | Stand up and activate an Avail DA validator (Docker-first) — day-0 provisioning through day-1 staking and going active, on Mainnet or Turing testnet | `npx skills add https://github.com/NodeOps-app/skills --skill avail-validator-setup` |
+| **avail-validator-operate** | Day-2 ops for a live Avail DA validator — monitoring, slash-safe upgrades, key backup, chill/unbond, disaster recovery without equivocation | `npx skills add https://github.com/NodeOps-app/skills --skill avail-validator-operate` |
 
 ### Migration skills
 
@@ -18,6 +20,10 @@ AI agent skills for the [NodeOps](https://nodeops.network) ecosystem. Works with
 
 `claude-code-to-codex` migrates Claude Code CLI setups to Codex CLI, with focused coverage for hooks, Claude Code CLI MCP servers, plugins, and session handoff.
 
+### Avail validator skills
+
+`avail-validator-setup` and `avail-validator-operate` cover the full lifecycle of an [Avail DA](https://docs.availproject.org/docs/da/operate/become-a-validator) validator, Docker-first and network-parameterized (Mainnet / Turing testnet). `avail-validator-setup` handles day-0 provisioning through day-1 session keys, bonding, and going active; `avail-validator-operate` handles day-2 monitoring, slash-safe upgrades, encrypted key backup, chill/unbond, and disaster recovery — every procedure built around avoiding equivocation/double-signing.
+
 ## CreateOS Authentication
 
 The `createos` skill can be used in two modes:

diff --git a/skills/avail-validator-operate/SKILL.md b/skills/avail-validator-operate/SKILL.md
@@ -0,0 +1,111 @@
+---
+name: avail-validator-operate
+description: >-
+  Run, maintain, and protect an already-active Avail DA validator — day-2 operations.
+  Use this whenever the user needs to monitor an Avail validator (telemetry, Prometheus,
+  Grafana, alerting on missed blocks/peers/sync/era points), upgrade the node to a new
+  availj/avail image or release without getting slashed, back up the keystore/node key,
+  restore or migrate a validator after a server loss WITHOUT double-signing, chill /
+  stop validating cleanly (staking.chill), unbond, or handle equivocation/slashing risk
+  and disaster recovery. Triggers on phrases like "monitor my avail validator", "set up
+  grafana for avail", "upgrade avail node safely", "avail validator slashed", "back up
+  avail keystore", "migrate avail validator to new server", "stop validating avail",
+  "chill my avail validator", "avail node equivocation", "restore avail validator".
+  For first-time setup, session-key generation, bonding and going active, use the
+  avail-validator-setup skill instead.
+---
+
+# Avail Validator — Operate (Day 2)
+
+Keep an active Avail validator healthy and **avoid the one class of mistake that gets
+you slashed: equivocation (double-signing)**. Equivocation slashes the validator *and*
+its nominators, so every procedure here is shaped around the rule:
+
+> **The same session keystore must never be active on two running nodes at once.**
+
+Docker-first. One parameterized path covers Mainnet and Turing. Network-specific URLs
+and economics (era ≈ 24 h, 28-day unbond, reward lag) are in
+`avail-validator-setup/references/networks.md` — reuse it; don't restate values.
+
+## What day-2 covers
+
+| Task | Read |
+|---|---|
+| Monitoring & alerting | `references/monitoring.md` |
+| Node upgrade (safe vs fast) | `references/upgrade.md` |
+| Backup of secrets | `references/backup-recovery.md` |
+| Disaster recovery / server migration | `references/backup-recovery.md` |
+| Chill / unbond / withdraw | `references/chill-unbond.md` |
+| Slashing & equivocation model | this file + `references/chill-unbond.md` |
+
+Always identify the network and the running container first:
+
+```bash
+CID=$(docker ps -lq)
+docker exec "$CID" ls /da/node-data/chains   # confirms chain dir / network
+docker logs --tail 30 "$CID"
+```
+
+## Monitoring (do this on day 1 of day-2)
+
+A validator you can't observe is a validator you can't protect. Stand up the metrics
+stack and alerts before anything else. Full configs (telemetry flag, `prometheus.yml`,
+Grafana install, the official dashboard JSON) and the alert thresholds that actually
+matter are in `references/monitoring.md`.
+
+Alert, at minimum, on: node down / not on telemetry, **finalized height not
+advancing**, **peer count low**, sync falling behind tip, **missed blocks / era points
+dropping**, and version drift from the latest release. A full session unresponsive →
+involuntary chill; >10 % of validators offline together in an epoch → all slashed.
+
+## Upgrades — the equivocation trap
+
+`docker pull` + recreate is fine for a **full/RPC node**. For an **active validator**
+it risks: (a) DB corruption → prolonged downtime → ejection from the active set, and
+(b) — if you "just spin up the new one alongside the old" — **double-signing**.
+
+Two procedures, in `references/upgrade.md`:
+
+- **Fast (acceptable downtime, single box):** stop container → recreate on the new
+  pinned tag with the same volume → verify it resumes authoring. Brief downtime, no
+  equivocation because the old node is stopped first.
+- **Slow & safe (zero downtime, two boxes):** stand up Node B on the new version,
+  `author_rotateKeys` on **B**, submit the new keys via **Set Session Key**, wait for
+  block production to move to B (confirm by **logs**, not the UI), *then* and only then
+  stop Node A. Never have both authoring with the same keys.
+
+`scripts/safe-upgrade.sh` walks the fast path with the stop-before-start ordering
+enforced. Read `references/upgrade.md` before using it.
+
+## Backups
+
+`db` is re-syncable and holds no secret — don't fixate on it. The only irreplaceable
+on-box material is `keystore/` (session keys) and `network/` (node key). Back them up
+**encrypted and off-box**, immediately and after any key rotation.
+`scripts/backup-keys.sh` produces an encrypted archive. Procedure + restore in
+`references/backup-recovery.md`.
+
+## Disaster recovery — without slashing yourself
+
+Losing the server is survivable; **restoring keys onto a new box while the old one
+might still be running is not** — that double-signs. The safe recovery paths
+(old-node-definitively-dead vs rotate-to-new-keys) are in
+`references/backup-recovery.md`. When in doubt, rotate to **new** session keys via
+`setKeys` rather than restoring the old keystore — new keys can't equivocate against
+the old.
+
+## Chill / unbond / exit
+
+Stopping cleanly is `staking.chill` (UI or extrinsic), **signed by the controller**,
+effective **next era**; funds stay bonded. Unbond → **28-day** lock → withdraw.
+Step-by-step, plus the difference between voluntary and involuntary chill and the
+slashing conditions, in `references/chill-unbond.md`.
+
+## Slashing facts to act on
+
+- Equivocation (two blocks same slot, or conflicting GRANDPA votes) → slash for
+  validator **and** nominators. Usually self-inflicted by running duplicate keys.
+- Slash shows immediately on the staking UI's slashes page, but the **financial
+  deduction is delayed days** (governance can reverse it). "Not deducted yet" ≠ "safe".
+- Involuntary chill (offline, <10 % of set) → no slash; ≥10 % offline together →
+  slash. Uptime monitoring is a slashing-prevention control, not a nicety.
diff --git a/skills/avail-validator-operate/evals/evals.json b/skills/avail-validator-operate/evals/evals.json
@@ -0,0 +1,35 @@
+{
+  "skill_name": "avail-validator-operate",
+  "evals": [
+    {
+      "id": 0,
+      "name": "safe-upgrade-no-equivocation",
+      "prompt": "My Avail validator on mainnet is active and producing blocks (availj/avail in Docker). A new release just dropped and I need to upgrade. I'm paranoid about getting slashed for double-signing. What's the safe upgrade procedure, and how is it different from just pulling the new image and restarting?",
+      "expected_output": "Equivocation explanation, fast stop-before-start path, slow two-box rotate-keys path, anti-patterns, high-stake recommendation, tag verification.",
+      "files": [],
+      "assertions": [
+        "Explicitly explains that double-signing/equivocation slashes the validator and its nominators, and why naive pull+restart is risky",
+        "Fast single-box path stops and confirms the old container is down BEFORE starting the new one, reusing the same volume and node name",
+        "Slow zero-downtime path: Node B on new version, author_rotateKeys for new keys, setKeys via controller, migrate confirmed via logs not UI, then stop Node A",
+        "Names the anti-pattern of running the old and new nodes simultaneously with the same/copied keystore",
+        "Recommends the slow two-box path for a high-stake active mainnet validator",
+        "Says to pin/verify the new image tag and not use :latest"
+      ]
+    },
+    {
+      "id": 1,
+      "name": "disaster-recovery-no-double-sign",
+      "prompt": "Disaster: the server running my active Avail mainnet validator just died (cloud instance gone). I have an encrypted backup of the keystore and network folders. How do I get back to validating WITHOUT equivocating? I'm not 100% sure the old instance is truly dead.",
+      "expected_output": "Equivocation rule, rotate-to-new-keys Path B due to uncertain old node, Path A only if old definitively dead, db re-syncable vs keystore/network secrets, stash/controller from seed.",
+      "files": [],
+      "assertions": [
+        "States that restoring the keystore while the old node may still be running causes double-signing = slashing",
+        "Because old-node status is uncertain, prescribes rotating to NEW session keys (Path B) rather than restoring the old keystore",
+        "Says restoring the old keystore (Path A) is acceptable only if the old instance is definitively destroyed",
+        "Notes db is re-syncable and only keystore + network are the irreplaceable secrets",
+        "States stash/controller are wallet keys recovered from seed/hardware, not from the server backup",
+        "New keys are registered on-chain via setKeys signed by the controller, with activation confirmed via logs"
+      ]
+    }
+  ]
+}
diff --git a/skills/avail-validator-operate/references/backup-recovery.md b/skills/avail-validator-operate/references/backup-recovery.md
@@ -0,0 +1,80 @@
+# Avail validator backup & disaster recovery
+
+## What to back up (and what not to)
+
+| Path (under `<base>/chains/<chainid>/`) | Back up? | Why |
+|---|---|---|
+| `keystore/` | **Yes, encrypted** | Session keys — irreplaceable, equivocation-critical |
+| `network/` | **Yes** | Node key / libp2p identity |
+| `db/` | No | Re-syncable from genesis or snapshot; contains no secret |
+
+In Docker the base is `/da/node-data`; discover the chain dir
+(`docker exec <CID> ls /da/node-data/chains`) — its name varies by node version.
+
+## Backup procedure
+
+Take a backup right after going active and after **every** key rotation. It must be
+**encrypted** and stored **off the validator box**. `scripts/backup-keys.sh` does this:
+it `tar`s `keystore/` + `network/` and encrypts with `age` (or `gpg` fallback).
+
+Manual equivalent:
+
+```bash
+CID=$(docker ps -lq)
+CHAIN_DIR=$(docker exec "$CID" sh -c 'ls -d /da/node-data/chains/*' | head -1)
+docker exec "$CID" tar -C "$CHAIN_DIR" -czf - keystore network \
+  | age -r <your-age-recipient> > avail-keys-$(date +%F).tar.gz.age
+# move the .age file off-box (it is the validator's identity — guard it)
+```
+
+Never store the archive unencrypted, and never store it on the same machine only.
+
+## Re-sync the DB (no secrets involved)
+
+If only the DB is bad (corruption, disk), you do **not** need keys back — keep the
+keystore in place and rebuild state:
+
+```bash
+# stop node, then purge chain data and let it re-sync
+avail purge-chain        # binary form; in Docker: stop container, delete db/ in the volume, restart
+```
+
+Or restore from a trusted DB snapshot to skip a long genesis sync (warp sync is not
+available). Trust the snapshot source.
+
+## Disaster recovery — the rule that prevents self-slashing
+
+> Restoring the keystore onto a new node **while the old node is or might still be
+> running** double-signs → equivocation → slash (validator **and** nominators).
+
+Choose the safe path:
+
+### Path A — old node is definitively dead
+Use only when you are *certain* the old machine can never produce blocks again
+(destroyed/wiped, disk pulled, account access revoked — not merely "I think it's off").
+
+1. Provision a fresh node (setup skill), same `--chain`/`--name`, let it sync.
+2. Stop it. Restore `keystore/` + `network/` from the encrypted backup into the
+   volume's `chains/<chainid>/`.
+3. Start it. It resumes the **same** validator identity. Confirm authoring via logs.
+
+### Path B — old node status uncertain (preferred default)
+If there is *any* doubt the old node is gone, do **not** restore the old keystore.
+Instead rotate to **new** keys — new keys cannot equivocate against the old:
+
+1. Provision a fresh node, sync it.
+2. `author_rotateKeys` on the new node (new session keys).
+3. **Set Session Key** to the new hex (controller-signed) via the staking UI.
+4. Wait for authoring to move to the new node — confirm by **logs**, not the UI.
+5. The old node, even if it later comes back, is signing with keys no longer
+   registered on-chain → it cannot equivocate. Decommission it when reachable.
+
+Path B trades nothing meaningful (the validator account/stake is unchanged — only the
+session keys rotate) for complete equivocation safety. Default to it.
+
+## Stash / controller recovery
+
+The stash and controller are **wallet** keys, never on the box — recover them from the
+operator's seed/hardware wallet, not from server backups. If the controller seed is
+compromised, the stash funds are still safe (separation), but rotate the controller and
+re-`setKeys`/`validate` from the new controller promptly.
diff --git a/skills/avail-validator-operate/references/chill-unbond.md b/skills/avail-validator-operate/references/chill-unbond.md
@@ -0,0 +1,60 @@
+# Chill, unbond, exit — Avail validator
+
+Stopping validation cleanly is a staking action, not a server action. Killing the
+container alone does **not** chill you — you'd be an offline validator (involuntary
+chill, possible slash if many are offline together). Always chill on-chain *first*,
+then it's safe to stop the node.
+
+## Chill (stop validating, keep funds bonded)
+
+`staking.chill` removes you from the active/waiting set without unbonding.
+
+- **Where:** staking actions UI (network URL in
+  `avail-validator-setup/references/networks.md`) → your account → **Stop**, or submit
+  the `staking.chill` extrinsic directly.
+- **Signed by:** the **controller** account (not the stash).
+- **Effective:** next era (~24 h). Funds remain bonded; you simply stop being
+  selectable for new/revised nominations.
+- After chill takes effect (confirm you're out of the active set on the dashboard and
+  logs no longer show `🎁 Prepared block for proposing`), it is safe to stop/decommission
+  the node.
+
+### Voluntary vs involuntary chill
+- **Voluntary:** you called `chill`. Clean. No slash.
+- **Involuntary:** the network chilled you for being unresponsive a full session. No
+  slash by itself — but if ≥10% of validators are offline together in an epoch, that
+  whole group is slashed. So "I'll just turn it off" is risky; chill explicitly.
+
+## Unbond (start releasing the stake)
+
+After chilling, to free the bonded funds:
+
+1. `staking.unbond` the amount (controller-signed).
+2. **28-day** unbonding lock — funds are non-transferable during this period.
+3. After 28 days, `withdrawUnbonded` to make them transferable.
+
+You can chill without unbonding (pause validating, keep stake) or unbond a partial
+amount and keep validating with the rest (as long as you stay above the waiting-list
+floor — see networks.md economics).
+
+## Full exit checklist
+
+1. `staking.chill` (controller) → wait one era, confirm out of active set via logs +
+   dashboard.
+2. Stop & decommission the node container.
+3. `staking.unbond` the full bonded amount (controller).
+4. Wait 28 days.
+5. `withdrawUnbonded` (controller). Funds now transferable from the stash.
+6. Securely destroy the on-box `keystore/` only after you're certain you won't rejoin
+   with the same identity (otherwise keep the encrypted backup).
+
+## Slashing context (why the order matters)
+
+- Equivocation slashes regardless of chill status — it's about duplicate signing, so
+  don't run the old node again with live keys after migrating.
+- Slash appears immediately on the staking UI slashes page; the **financial deduction
+  is delayed several days** and governance can reverse it. Don't assume safety from
+  "balance not changed yet."
+- Chilling promptly when you know you'll be offline (maintenance, migration) converts a
+  potential slash scenario into a clean no-penalty exit. Treat chill as the standard
+  pre-maintenance step for anything that risks a full session of downtime.