Skip to content

Fix audit findings: reconnect watchdog, DO alarm contention, state-machine guards, registry bounding (v1.2.0)#9

Merged
adewale merged 4 commits into
mainfrom
claude/vibrant-hawking-6ubyus
Jun 10, 2026
Merged

Fix audit findings: reconnect watchdog, DO alarm contention, state-machine guards, registry bounding (v1.2.0)#9
adewale merged 4 commits into
mainfrom
claude/vibrant-hawking-6ubyus

Conversation

@adewale

@adewale adewale commented Jun 10, 2026

Copy link
Copy Markdown
Owner

What

Fixes the 16 defects surfaced by a full-system audit — a client-side reconnect watchdog that ejected players after any network blip, four Durable Object lifecycle bugs (alarm clobbering, room leaks, a match-reset exploit, unbounded registry growth), and a follow-up pass aligning the worker with Cloudflare's published DO best practices (RPC, alarm error boundary, hibernation-friendly heartbeats, waitUntil, region threading).

Why

No filed issue — these came from an internal audit of the system and the assumptions beneath it ("audit the seams, not the cores"). The bugs clustered at the boundaries the unit tests mock away: the WebSocket lifecycle, the single DO alarm, and the matchmaker's storage backing. Full audit trail in Lessons_learned.md §21–22 and CHANGELOG.md 1.2.0. The two worst, with reproduction sketches:

  1. Every mid-game reconnect ended in permanent ejection. useGameConnection's watchdog measured staleness from the last pong only; onopen never refreshed it. Repro: play, drop the network >5s, let the client reconnect → 30s later the watchdog closes the healthy socket (clean close → server removes the player → rejoin token now invalid → fresh join rejected with game_in_progress). Affects both frontends (TUI re-exports the hook).
  2. Any WS upgrade froze a live room for 5s. A DO has one alarm and setAlarm overwrites; ensureUnauthenticatedSocketAlarm set now+5s on every upgrade, clobbering the pending 33ms tick. Repro: reconnect (or open wss://…/room/CODE/ws?rejoin=1 with just a room code) against a playing room → the whole room's loop stalls ~5s; repeat every <5s for an indefinite freeze.

How

Test-first (red → green) throughout, with correctness-by-construction over spot patches:

Finding Fix
Watchdog kills reconnects Liveness is an explicit mark set on both socket-open and pong (client-core/src/connection/liveness.ts) — the reset can't be forgotten
Alarm clobber / freeze DoS scheduleWakeNoLaterThan min-merges via getAlarm() (Cloudflare's "Multiple Events, Single Alarm" pattern)
game_over room leak PLAYER_LEAVE permitted out of game_over; rooms drain to 0 and cleanup() unregisters
ready mid-match reset exploit ready/unready/checkStartConditions status-guarded to the lobby (matches the PBT hardening start_solo/join already had)
Registry grows to the 128 KiB KV ceiling /find sweeps all stale rooms (not just openRooms), game_over registrations drop, MAX_TRACKED_ROOMS cap refuses new rooms (503)
createRoom ignored failures Register-first with init-failure compensation; honest 503s, no phantom/orphan rooms
Cloudflare-idioms pass Matchmaker exposes typed RPC (fetch kept only for the WS upgrade, which needs a 101); alarm() error boundary with backoff + bounded give-up; setWebSocketAutoResponse answers heartbeat pings without waking the DO (reap reconciles liveness from getWebSocketAutoResponseTimestamp since auto-responses bypass webSocketMessage); fireAndForget protected by ctx.waitUntil; region threaded explicitly (RPC context + x-vaders-region header) replacing the isolate-invisible CF_REGION global
CI build failures (2) (a) The one error-level biome diagnostic (lint/a11y/useSemanticElements) fixed: the departure toast's <div role="status"> is now a semantic <output>, with getByRole('status') pinned in its test. (b) A latent flaky PBT the lint fix unmasked: the LaunchScreen menu-sound oracle counted keys linearly and missed that Enter on JOIN ROOM enters join mode (where menu keys are silent) — CI's seed found ['2','ArrowDown','Enter','ArrowUp']. The component was correct; the oracle now models the handler (joinMode + selectedIndex), asserts full call-sequence identity, and the counterexample is pinned as a deterministic regression case (model verified at 1500 fresh-seed runs)
Smaller fixes Symmetric player bounds (PLAYER_MAX_X 112→116; left-edge formula on a center coordinate left 4 dead columns); WS hook warns on malformed/unknown messages instead of swallowing; buildWsUrl uses the URL parser; TUI startup probe and MusicManager share one audio-player resolver; TUI default server URL matches the launcher; rejoin_sessions cleared on cleanup

Deliberately deferred (captured in docs/TODO.md with rationale): sharding the global-singleton Matchmaker and moving the registry to per-room SQLite rows — both need a storage/topology migration; the cap + sweep bounds the blast radius until then.

Testing

  • New/modified tests pass — ~40 new tests across the five packages (unit, integration-style against the real hook/DO classes, and PBT harness updates)
  • Tests fail when fix is reverted (regression guard) — verified explicitly for the headline fix: removed the one-line markAlive on onopen, connection-reconnect.test.tsx went red, restored it, green. The reducer/GameRoom guards were written red-first against the unfixed code, and the flaky-PBT fix was reproduced red on CI's exact seed/path before the oracle was corrected
  • Full test suite passes (no regressions) — shared 259, client-core 348, client 796, worker 555, web 650 (+1 pre-existing skip); typechecks clean in all packages
  • Manual testing completed — the alarm-clobber and reconnect-flap sequences were reproduced step-by-step in the GameRoom/hook test harnesses (mock WS + fake timers) before fixing; both-directions tests confirm the watchdog still closes genuinely dead sockets and known messages don't warn

CI is green on the head commit (cf659ad): Lint/Type Check/Test ✅, Build ✅, Spritesheet ✅ (E2E runs on push-to-main only, by design).

Risk

  • Breadth: 41 files, +2k/−350. It's one concern (audit-findings hardening) but it touches the connection hook, the DO alarm path, and the matchmaker — the multiplayer-critical core. Mitigation: every change is individually regression-guarded, and the cross-frontend contract tests + PBT invariant suites all pass unchanged.
  • Behavior changes to note: ready is now ignored outside the lobby (clients only ever sent it there); pong.serverTime is optional (verified no client reads it); registry capacity can return a new matchmaker_full 503 (saturation PBT updated to accept it); PLAYER_MAX_X widens reachable play by 4 columns; the departure toast renders as <output> instead of <div role="status"> (identical semantics and layout).
  • Lint: the repo's 600+ biome warnings remain (out of scope); the single error-level diagnostic that failed CI is fixed in this PR, so bun run lint exits 0.
  • Merge state: rebased onto main (f408fcc, the co-op balance/doc fixes from Balance co-op lives, wipe timings, and grid sizes #8) — linear history, no merge commits. The README version-badge conflict was resolved to 1.2.0 to match this branch's package.json.

https://claude.ai/code/session_01GS6cvaEYkJxvKXgZsmeD78

@adewale adewale changed the title Deep-audit hardening pass: fix reconnect watchdog, registry bounding, and state-machine gaps Fix audit findings: reconnect watchdog, DO alarm contention, state-machine guards, registry bounding (v1.2.0) Jun 10, 2026
claude added 2 commits June 10, 2026 17:33
Deep-audit hardening pass. Each fix shipped red→green with a regression
guard, applying correctness-by-construction, PBT, and Cloudflare DO best
practices. See Lessons_learned.md §21 and CHANGELOG 1.2.0.

Critical:
- Reconnect watchdog killed every recovery (both frontends): liveness is
  now an explicit mark set on socket-open AND pong (client-core/liveness.ts),
  with a real reconnect integration test that fails without the wiring.
- Single-alarm clobber froze live rooms 5s: ensureUnauthenticatedSocketAlarm
  now min-merges via getAlarm() (Cloudflare "Multiple Events / Single Alarm").
- game_over leaked rooms forever: PLAYER_LEAVE is permitted out of game_over
  so disconnects drain the room and cleanup() runs.
- ready bypassed the state machine (match-reset exploit): ready/unready/
  checkStartConditions are status-guarded to the lobby.
- Matchmaker registry grew unbounded toward the 128 KiB KV-value ceiling:
  /find sweeps ALL rooms by staleness, game_over registrations are dropped,
  and a MAX_TRACKED_ROOMS cap (sized for 128 KiB) refuses new rooms (503).

Also: createRoom now register-first with init-failure compensation (no
orphan rooms, honest 503s); cleanup() clears rejoin_sessions; PLAYER_MAX_X
corrected to 116 (symmetric bounds); WS hook warns on malformed/unknown
messages; buildWsUrl uses the URL parser; TUI audio backend + default
server URL unified. Version 1.2.0; README/CHANGELOG/Lessons updated.

https://claude.ai/code/session_01GS6cvaEYkJxvKXgZsmeD78
Five fixes shipped test-first (red→green), per the published Durable
Object guidance; see Lessons_learned.md §22 and CHANGELOG 1.2.0.

- Matchmaker RPC: typed register/unregister/find/getRoomInfo methods
  replace hand-rolled internal fetch routing + JSON parsing in Worker
  and GameRoom. Thin fetch adapter kept for tests; WS upgrade stays a
  fetch (RPC cannot return a 101).
- alarm() error boundary: a throwing tick is caught, logged
  (alarm_error), re-armed with 1s backoff, and gives up into game_over
  after 10 consecutive failures — no platform retry storm.
- Hibernation-friendly heartbeat: setWebSocketAutoResponse answers
  {type:'ping'} without waking the DO; the phantom-reap reconciles
  liveness from getWebSocketAutoResponseTimestamp since auto-responded
  pings bypass webSocketMessage. pong.serverTime now optional.
- ctx.waitUntil protects fireAndForget tasks so evictions cannot drop
  registry updates or cleanup-alarm scheduling.
- Region threaded explicitly (RPC log context + x-vaders-region header
  on WS upgrade) replacing the CF_REGION global that DO isolates never
  saw and concurrent requests clobbered; DO logs now carry region.

Deferred scaling items captured in docs/TODO.md: shard the
global-singleton Matchmaker; move the registry to per-room SQLite rows.

https://claude.ai/code/session_01GS6cvaEYkJxvKXgZsmeD78
@adewale adewale force-pushed the claude/vibrant-hawking-6ubyus branch from 1d50c9d to d877ce2 Compare June 10, 2026 17:34
claude added 2 commits June 10, 2026 18:26
…toast

The "Lint, Type Check & Test" job failed on biome's one error-level
diagnostic (lint/a11y/useSemanticElements) in PlayerDepartureNotice:
a <div role="status" aria-live="polite"> where the semantic <output>
element belongs. <output> carries both implicitly, and the inline
display:flex overrides its default inline display, so rendering and
screen-reader behavior are unchanged.

The component test now also pins the accessible role
(getByRole('status') resolves to the toast), so a revert to a bare div
fails a test as well as lint.

Verified locally: bun run lint exits 0 (607 warnings remain, biome only
fails on errors); web suite 650; web+client builds and the spritesheet
generator all pass.

https://claude.ai/code/session_01GS6cvaEYkJxvKXgZsmeD78
… keys

With lint fixed, CI's fast-check seed -366879321 exposed a latent oracle
bug in the LaunchScreen menu-sound property: it counted every non-repeat
"positive" key as one sound, special-casing only the literal '3' hotkey.
But Enter while JOIN ROOM (index 2) is selected ALSO enters join mode,
where every menu key is intentionally silent until Escape. Shrunk
counterexample: ['2', 'ArrowDown', 'Enter', 'ArrowUp'] — expected 4
sounds, component correctly plays 3. The component was right; the test
model was wrong.

Replaced the linear count with a faithful model of the keydown handler
(joinMode + selectedIndex, Escape exit, m/n toggles) and strengthened
the assertion from call COUNT to full call-sequence identity. The CI
counterexample is pinned as a deterministic REGRESSION case so it runs
on every seed. '3' is back in the generated alphabet (the model now
covers join-mode entry).

Red→green on CI's exact seed/path, then verified at 1500 fresh-seed
runs; committed budget stays at 40 runs to fit the 5s timeout. Web suite
651 tests, typecheck and lint exit 0.

https://claude.ai/code/session_01GS6cvaEYkJxvKXgZsmeD78
@adewale adewale merged commit 7c57ba4 into main Jun 10, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants