Fix audit findings: reconnect watchdog, DO alarm contention, state-machine guards, registry bounding (v1.2.0)#9
Merged
Conversation
Deep-audit hardening pass. Each fix shipped red→green with a regression guard, applying correctness-by-construction, PBT, and Cloudflare DO best practices. See Lessons_learned.md §21 and CHANGELOG 1.2.0. Critical: - Reconnect watchdog killed every recovery (both frontends): liveness is now an explicit mark set on socket-open AND pong (client-core/liveness.ts), with a real reconnect integration test that fails without the wiring. - Single-alarm clobber froze live rooms 5s: ensureUnauthenticatedSocketAlarm now min-merges via getAlarm() (Cloudflare "Multiple Events / Single Alarm"). - game_over leaked rooms forever: PLAYER_LEAVE is permitted out of game_over so disconnects drain the room and cleanup() runs. - ready bypassed the state machine (match-reset exploit): ready/unready/ checkStartConditions are status-guarded to the lobby. - Matchmaker registry grew unbounded toward the 128 KiB KV-value ceiling: /find sweeps ALL rooms by staleness, game_over registrations are dropped, and a MAX_TRACKED_ROOMS cap (sized for 128 KiB) refuses new rooms (503). Also: createRoom now register-first with init-failure compensation (no orphan rooms, honest 503s); cleanup() clears rejoin_sessions; PLAYER_MAX_X corrected to 116 (symmetric bounds); WS hook warns on malformed/unknown messages; buildWsUrl uses the URL parser; TUI audio backend + default server URL unified. Version 1.2.0; README/CHANGELOG/Lessons updated. https://claude.ai/code/session_01GS6cvaEYkJxvKXgZsmeD78
Five fixes shipped test-first (red→green), per the published Durable
Object guidance; see Lessons_learned.md §22 and CHANGELOG 1.2.0.
- Matchmaker RPC: typed register/unregister/find/getRoomInfo methods
replace hand-rolled internal fetch routing + JSON parsing in Worker
and GameRoom. Thin fetch adapter kept for tests; WS upgrade stays a
fetch (RPC cannot return a 101).
- alarm() error boundary: a throwing tick is caught, logged
(alarm_error), re-armed with 1s backoff, and gives up into game_over
after 10 consecutive failures — no platform retry storm.
- Hibernation-friendly heartbeat: setWebSocketAutoResponse answers
{type:'ping'} without waking the DO; the phantom-reap reconciles
liveness from getWebSocketAutoResponseTimestamp since auto-responded
pings bypass webSocketMessage. pong.serverTime now optional.
- ctx.waitUntil protects fireAndForget tasks so evictions cannot drop
registry updates or cleanup-alarm scheduling.
- Region threaded explicitly (RPC log context + x-vaders-region header
on WS upgrade) replacing the CF_REGION global that DO isolates never
saw and concurrent requests clobbered; DO logs now carry region.
Deferred scaling items captured in docs/TODO.md: shard the
global-singleton Matchmaker; move the registry to per-room SQLite rows.
https://claude.ai/code/session_01GS6cvaEYkJxvKXgZsmeD78
1d50c9d to
d877ce2
Compare
…toast
The "Lint, Type Check & Test" job failed on biome's one error-level
diagnostic (lint/a11y/useSemanticElements) in PlayerDepartureNotice:
a <div role="status" aria-live="polite"> where the semantic <output>
element belongs. <output> carries both implicitly, and the inline
display:flex overrides its default inline display, so rendering and
screen-reader behavior are unchanged.
The component test now also pins the accessible role
(getByRole('status') resolves to the toast), so a revert to a bare div
fails a test as well as lint.
Verified locally: bun run lint exits 0 (607 warnings remain, biome only
fails on errors); web suite 650; web+client builds and the spritesheet
generator all pass.
https://claude.ai/code/session_01GS6cvaEYkJxvKXgZsmeD78
… keys With lint fixed, CI's fast-check seed -366879321 exposed a latent oracle bug in the LaunchScreen menu-sound property: it counted every non-repeat "positive" key as one sound, special-casing only the literal '3' hotkey. But Enter while JOIN ROOM (index 2) is selected ALSO enters join mode, where every menu key is intentionally silent until Escape. Shrunk counterexample: ['2', 'ArrowDown', 'Enter', 'ArrowUp'] — expected 4 sounds, component correctly plays 3. The component was right; the test model was wrong. Replaced the linear count with a faithful model of the keydown handler (joinMode + selectedIndex, Escape exit, m/n toggles) and strengthened the assertion from call COUNT to full call-sequence identity. The CI counterexample is pinned as a deterministic REGRESSION case so it runs on every seed. '3' is back in the generated alphabet (the model now covers join-mode entry). Red→green on CI's exact seed/path, then verified at 1500 fresh-seed runs; committed budget stays at 40 runs to fit the 5s timeout. Web suite 651 tests, typecheck and lint exit 0. https://claude.ai/code/session_01GS6cvaEYkJxvKXgZsmeD78
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes the 16 defects surfaced by a full-system audit — a client-side reconnect watchdog that ejected players after any network blip, four Durable Object lifecycle bugs (alarm clobbering, room leaks, a match-reset exploit, unbounded registry growth), and a follow-up pass aligning the worker with Cloudflare's published DO best practices (RPC, alarm error boundary, hibernation-friendly heartbeats,
waitUntil, region threading).Why
No filed issue — these came from an internal audit of the system and the assumptions beneath it ("audit the seams, not the cores"). The bugs clustered at the boundaries the unit tests mock away: the WebSocket lifecycle, the single DO alarm, and the matchmaker's storage backing. Full audit trail in
Lessons_learned.md§21–22 andCHANGELOG.md1.2.0. The two worst, with reproduction sketches:useGameConnection's watchdog measured staleness from the lastpongonly;onopennever refreshed it. Repro: play, drop the network >5s, let the client reconnect → 30s later the watchdog closes the healthy socket (clean close → server removes the player → rejoin token now invalid → fresh join rejected withgame_in_progress). Affects both frontends (TUI re-exports the hook).setAlarmoverwrites;ensureUnauthenticatedSocketAlarmsetnow+5son every upgrade, clobbering the pending 33ms tick. Repro: reconnect (or openwss://…/room/CODE/ws?rejoin=1with just a room code) against aplayingroom → the whole room's loop stalls ~5s; repeat every <5s for an indefinite freeze.How
Test-first (red → green) throughout, with correctness-by-construction over spot patches:
client-core/src/connection/liveness.ts) — the reset can't be forgottenscheduleWakeNoLaterThanmin-merges viagetAlarm()(Cloudflare's "Multiple Events, Single Alarm" pattern)game_overroom leakPLAYER_LEAVEpermitted out ofgame_over; rooms drain to 0 andcleanup()unregistersreadymid-match reset exploitready/unready/checkStartConditionsstatus-guarded to the lobby (matches the PBT hardeningstart_solo/joinalready had)/findsweeps all stale rooms (not justopenRooms),game_overregistrations drop,MAX_TRACKED_ROOMScap refuses new rooms (503)createRoomignored failuresalarm()error boundary with backoff + bounded give-up;setWebSocketAutoResponseanswers heartbeat pings without waking the DO (reap reconciles liveness fromgetWebSocketAutoResponseTimestampsince auto-responses bypasswebSocketMessage);fireAndForgetprotected byctx.waitUntil; region threaded explicitly (RPC context +x-vaders-regionheader) replacing the isolate-invisibleCF_REGIONgloballint/a11y/useSemanticElements) fixed: the departure toast's<div role="status">is now a semantic<output>, withgetByRole('status')pinned in its test. (b) A latent flaky PBT the lint fix unmasked: the LaunchScreen menu-sound oracle counted keys linearly and missed that Enter on JOIN ROOM enters join mode (where menu keys are silent) — CI's seed found['2','ArrowDown','Enter','ArrowUp']. The component was correct; the oracle now models the handler (joinMode+selectedIndex), asserts full call-sequence identity, and the counterexample is pinned as a deterministic regression case (model verified at 1500 fresh-seed runs)PLAYER_MAX_X112→116; left-edge formula on a center coordinate left 4 dead columns); WS hook warns on malformed/unknown messages instead of swallowing;buildWsUrluses the URL parser; TUI startup probe andMusicManagershare one audio-player resolver; TUI default server URL matches the launcher;rejoin_sessionscleared on cleanupDeliberately deferred (captured in
docs/TODO.mdwith rationale): sharding the global-singleton Matchmaker and moving the registry to per-room SQLite rows — both need a storage/topology migration; the cap + sweep bounds the blast radius until then.Testing
markAliveononopen,connection-reconnect.test.tsxwent red, restored it, green. The reducer/GameRoom guards were written red-first against the unfixed code, and the flaky-PBT fix was reproduced red on CI's exact seed/path before the oracle was correctedCI is green on the head commit (
cf659ad): Lint/Type Check/Test ✅, Build ✅, Spritesheet ✅ (E2E runs on push-to-main only, by design).Risk
readyis now ignored outside the lobby (clients only ever sent it there);pong.serverTimeis optional (verified no client reads it); registry capacity can return a newmatchmaker_full503 (saturation PBT updated to accept it);PLAYER_MAX_Xwidens reachable play by 4 columns; the departure toast renders as<output>instead of<div role="status">(identical semantics and layout).bun run lintexits 0.main(f408fcc, the co-op balance/doc fixes from Balance co-op lives, wipe timings, and grid sizes #8) — linear history, no merge commits. The README version-badge conflict was resolved to 1.2.0 to match this branch'spackage.json.https://claude.ai/code/session_01GS6cvaEYkJxvKXgZsmeD78