Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ jobs:
libslirp-dev

- name: Install Rust toolchain
uses: dtolnay/rust-toolchain@stable
uses: dtolnay/rust-toolchain@1.91.1
with:
toolchain: ${{ env.RUST_TOOLCHAIN }}
components: rustfmt, clippy
Expand All @@ -55,6 +55,9 @@ jobs:
- name: Clippy
run: cargo clippy --workspace --all-targets --all-features --locked -- -D warnings

- name: Watchdog Lua tests
run: lua watchdog/tests/run.lua

- name: Test
timeout-minutes: 15
run: cargo test --workspace --all-targets --all-features --locked
Expand Down
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
/target
.deps/
watchdog-e2e-*/
.env
.env.fish
sequencer.db
sequencer.db-shm
sequencer.db-wal
/out/
examples/canonical-app/out/
/.DS_Store
.vscode/
soljson-latest.js
**/states/
2 changes: 2 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

175 changes: 175 additions & 0 deletions docs/watchdog/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# Watchdog

The watchdog is an off-chain safety process that compares sequencer API state
against state produced by the canonical Cartesi Machine at an L1 safe block.

## V1 Shape

The implementation lives in `watchdog/` and is intentionally split into small
Lua modules:

- `http.lua`: HTTP adapter (`lua-curl` / `lcurl` when installed, otherwise `curl` CLI via `new_auto()`).
- `jsonrpc.lua`: JSON-RPC request/response validation.
- `l1.lua`: partitioned `eth_getLogs` scanning and strict L1 log ordering.
- `abi.lua`: decoding for the `InputAdded` / `EvmAdvance` envelope.
- `machine.lua`: narrow adapter boundary for Cartesi Machine bindings.
- `machine_cli.lua`: `cartesi-machine` CLI adapter for loading snapshot
directories, writing raw input files, advancing, inspecting, and saving snapshots.
- `compare.lua`: raw byte comparison.
- `checkpoint.lua`: manifest-backed checkpoint persistence.
- `alarm.lua`: webhook alarm delivery.
- `retry.lua`: bounded retry helper used by the runtime.
- `runner.lua`: one-shot orchestration across checkpoint load, sequencer poll,
L1 fetch, CM replay, raw compare, alarm, and checkpoint write.
- `main.lua`: compare or advance loop (daemon or `WATCHDOG_ONCE=1`).

The L1 reader follows the Rust partition strategy from
`sequencer/src/partition.rs`: if an RPC provider rejects a large range, the
range is split recursively and retried. Lua decodes and validates input
envelopes, but it does not classify payload tags. Direct input vs batch
submission remains scheduler logic inside the canonical machine.

`l1.lua` has the `InputAdded(address,uint256,bytes)` event topic baked in and
filters logs by `topic0 = InputAdded` and `topic1 = app address`, matching the
Rust reader's app-filtered InputBox scan.

## Runtime Contract

The sequencer exposes `GET /get_state` for byte-exact state comparison. The
endpoint is generic over app state bytes, even though the toy wallet app
currently returns deterministic JSON:

```json
{
"safe_block": 123,
"state": "{\"balances\":{},\"nonces\":{}}"
}
```

`state` must be the exact bytes produced by the bare-metal app serializer
for the app state anchored at `safe_block`. The watchdog compares those raw
bytes with the bytes returned by CM inspect. It must not canonicalize both
values before deciding pass/fail.

`get_state` reconstructs a safe-only app state by replaying the persisted
scheduler-accepted safe batch prefix into a fresh app instance. It intentionally
excludes the current soft-confirmed Tip and any valid closed batches that have
not been accepted by the L1 scheduler view yet.

The canonical scheduler answers `RollupRequest::Inspect` with query `state` by
calling `Application::export_state()` (see `examples/canonical-app`).

## Checkpoints

V1 persists only the resulting Cartesi Machine checkpoint, not the fetched L1
inputs.

```text
checkpoint_dir/
current.json
checkpoints/
00000000000001234567/
snapshot/
manifest.json
```

`manifest.json` records `safe_block`, timestamp, and optionally the CM image
hash. A new checkpoint directory is written first, then `current.json` is
atomically replaced to point at it.

When bootstrapping without an existing checkpoint, the operator provides both:

- `WATCHDOG_CM_SNAPSHOT_DIR`
- `WATCHDOG_CM_SNAPSHOT_SAFE_BLOCK`

## Modes

The default `WATCHDOG_MODE` is `advance`. In this mode the watchdog does not
poll the sequencer. It:

1. Loads the latest checkpoint, or the bootstrap snapshot directory.
2. Reads the L1 safe block from the RPC (or `WATCHDOG_TARGET_SAFE_BLOCK` when
provided for tests/manual runs).
3. Fetches and decodes `InputAdded` logs for the block range.
4. Feeds the raw InputBox input bytes into the CM adapter.
5. Saves a new snapshot directory and advances `current.json`.

`WATCHDOG_MODE=compare` replays safe L1 inputs into the CM, calls
`--cmio-inspect-state` with the `state` query, and compares the returned report
bytes against `GET /get_state`.

Useful runtime knobs:

- `WATCHDOG_CM_EXECUTABLE`: Cartesi Machine executable, default `cartesi-machine`.
- `WATCHDOG_CM_WORK_DIR`: temporary directory for staged input files, default `/tmp`.
- `WATCHDOG_RETRY_ATTEMPTS`: bounded retry attempts per run, default `3`.
- `WATCHDOG_RETRY_DELAY_SEC`: delay between retry attempts, default `5`.
- `WATCHDOG_TARGET_SAFE_BLOCK`: manual/test override for the target safe block.

## Local Tests

| Command | What it exercises |
|---------|-------------------|
| `just test-watchdog` | Lua unit tests (fake HTTP/RPC/CM; no live chain) |
| `just test-watchdog-e2e` | Real CM: advance, inspect; optional live compare if `WATCHDOG_E2E_SEQUENCER_URL` set |
| `just test-watchdog-compare-harness` | **Full E2E**: Anvil + devnet sequencer + `GET /get_state` + CM inspect + Lua compare |
| `just test-watchdog-webhook-drill` | Webhook delivery smoke (`WATCHDOG_WEBHOOK_URL` required) |

Prerequisites for CM-backed tests:

```bash
just canonical-build-machine-image # once, if out/ image is missing
just watchdog-lua-deps # lua-cjson into .deps/lua (system pkg or gcc)
```

`cartesi-machine`, `lua`, and `curl` on PATH. `lua-curl` is optional (CLI fallback).

### Lua unit tests

```bash
just test-watchdog
```

Covers raw comparison, golden InputAdded ABI decoding, L1 ordering, recursive
range partitioning, config, checkpoints, advance/compare runner (fakes), CM CLI
staging, retry, and alarm webhook encoding.

### Lua CM end-to-end

```bash
just test-watchdog-e2e
```

Scenarios (verbose `step NN/NN` logging):

- `prerequisites` — `cartesi-machine` on PATH and machine image present.
- `advance-empty-range` — real CM advance + checkpoint write with zero new inputs.
- `cm-inspect-state-query` — real `--cmio-inspect-state` with query `state`.
- `compare-runner-with-sequencer` — skipped unless `WATCHDOG_E2E_SEQUENCER_URL` is set.

Rebuild the machine image after changing the canonical scheduler/dapp. A stale
image makes `cm-inspect-state-query` skip with `inspect endpoint not implemented`.

### Rust compare harness (most complete integration test)

```bash
just test-watchdog-compare-harness
```

Spawns Anvil + rollups devnet + `sequencer-devnet`, proves CM inspect JSON at
genesis, then runs `watchdog/tests/run_compare_once.lua` in compare mode with
matching `WATCHDOG_*` addresses. Requires `RUN_WATCHDOG_E2E=1` (set by the recipe).

### Staging / operator drills

See [`staging-drills.md`](staging-drills.md) for webhook smoke, synthetic
divergence POST, and manual compare env vars.

## Related sequencer tests

```bash
cargo test -p sequencer get_state -- --test-threads=1
```

HTTP integration for `GET /get_state` lives in `sequencer/tests/e2e_sequencer.rs`.
Storage/replay semantics are covered in `sequencer/src/egress/app_state.rs` unit tests.
95 changes: 95 additions & 0 deletions docs/watchdog/staging-drills.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Watchdog Staging Drills

Operator drills for webhook delivery and divergence detection. Local harness
steps live in [`README.md`](README.md); this document covers staging and manual
verification.

## Prerequisites

- Built canonical machine image: `just canonical-build-machine-image`
- `cartesi-machine`, `lua`, and `curl` on PATH
- `lua-cjson` (system package, or `just watchdog-lua-deps` copies/builds `.deps/lua/cjson.so` via `gcc` — no `make`)
- `lua-curl` optional — drills and compare harness fall back to `curl` CLI when absent
- Staging or local sequencer reachable at `WATCHDOG_SEQUENCER_URL`
- L1 RPC + InputBox + app addresses matching that deployment
- Webhook receiver URL (Slack incoming webhook, PagerDuty, or `https://httpbin.org/post` for smoke tests)

## Drill 1 — Webhook delivery (no sequencer)

Verifies the alarm transport reaches your receiver.

```bash
just watchdog-lua-deps
export WATCHDOG_WEBHOOK_URL="https://your-receiver.example/hook"
WATCHDOG_LUA_DEPS=.deps/lua lua watchdog/tests/drill_webhook.lua
# or: just test-watchdog-webhook-drill
```

Expected: HTTP 2xx for both `state_mismatch` and `safe_block_regressed` sample payloads.
Check the receiver shows JSON with `"kind"` and `"run_id"` fields.

## Drill 2 — Divergence webhook (synthetic mismatch, no CM)

Verifies the receiver gets a realistic `state_mismatch` payload without running compare mode:

```bash
export WATCHDOG_WEBHOOK_URL="https://your-receiver.example/hook"
WATCHDOG_LUA_DEPS=.deps/lua lua watchdog/tests/drill_divergence.lua
```

Expected: HTTP 2xx, receiver shows `kind=state_mismatch` and a non-zero `mismatch_offset`.

Unit coverage: `just test-watchdog` (`runner alarms on raw state mismatch`).

## Drill 3 — Happy compare (local Anvil harness)

Full stack: Anvil + devnet rollups + sequencer + CM inspect + `GET /get_state`.

```bash
just test-watchdog-compare-harness
# equivalent:
# just setup && just watchdog-lua-deps && just ensure-machine-image
# cargo build -p sequencer --bin sequencer-devnet -p rollups-e2e
# RUN_WATCHDOG_E2E=1 cargo run -p rollups-e2e -- watchdog_genesis_compare_test --exact
```

Or run the Lua compare pass manually after starting a devnet sequencer yourself:

```bash
export WATCHDOG_MODE=compare
export WATCHDOG_SEQUENCER_URL=http://127.0.0.1:<port>
export WATCHDOG_L1_RPC_URL=http://127.0.0.1:8545
export WATCHDOG_INPUTBOX_ADDRESS=<from Anvil deployments>
export WATCHDOG_APP_ADDRESS=<deployed app>
export WATCHDOG_CHECKPOINT_DIR=/tmp/watchdog-checkpoints
export WATCHDOG_CM_SNAPSHOT_DIR=examples/canonical-app/out/canonical-machine-image
export WATCHDOG_CM_SNAPSHOT_SAFE_BLOCK=0
export WATCHDOG_LUA_DEPS=.deps/lua
lua watchdog/tests/run_compare_once.lua
```

Expected: exit 0, stdout `watchdog compare ok: safe_block=... input_count=...`, and genesis wallet state `{"balances":{},"nonces":{}}` on both sides.

## Drill 4 — Production compare daemon

Run the watchdog in compare mode against staging (daemon or cron):

```bash
export WATCHDOG_MODE=compare
export WATCHDOG_ONCE=1 # or 0 for daemon
export WATCHDOG_WEBHOOK_URL=...
# ... all WATCHDOG_* vars from config.lua ...
lua watchdog/main.lua
```

On mismatch: non-zero exit, webhook fired, logs show `state_mismatch` and byte offset.

## Triage checklist

| Symptom | Likely cause |
|---------|----------------|
| `inspect endpoint not implemented` | Stale CM image — rebuild |
| `state_mismatch` at genesis | Checkpoint not aligned with sequencer history |
| Webhook 4xx | Wrong URL or auth on receiver |
| Compare skipped in Lua e2e | Set `WATCHDOG_E2E_SEQUENCER_URL` to a live sequencer |
| Compare harness skipped | Set `RUN_WATCHDOG_E2E=1` (see `just test-watchdog-compare-harness`) |
Loading
Loading