diff --git a/CHANGELOG.md b/CHANGELOG.md index 1236437..4a53f43 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -2,6 +2,36 @@ All notable changes to Vaders are documented in this file. +## [1.2.0] — 2026-06-10 + +A deep-audit hardening pass. Every fix below shipped test-first (red → green) with +a regression guard; see `Lessons_learned.md` §21 for the full postmortem. + +### Fixed + +- **Reconnection watchdog killed every recovery** (`client-core`, both frontends) — the heartbeat watchdog measured liveness from the last `pong` only, and the `onopen` handler never refreshed it, so a freshly reconnected socket inherited the dead socket's stale timestamp and was force-closed ~30 s later, every time. Liveness is now an explicit mark set on **both** socket-open and pong (`client-core/src/connection/liveness.ts`), so the reset cannot be forgotten. Regression: `web/src/connection-reconnect.test.tsx` drives the real hook through a reconnect. +- **Single-alarm clobber froze live rooms for 5 s** (`worker`) — a Durable Object has one alarm and `setAlarm` overwrites; `ensureUnauthenticatedSocketAlarm` fired on every WS upgrade and set `now + 5 s` unconditionally, so a mid-match reconnect pushed the pending 33 ms game-tick alarm out by 5 s. Now min-merges via `getAlarm()` (Cloudflare's documented "Multiple Events / Single Alarm" pattern). +- **`game_over` leaked rooms forever** (`worker`) — `PLAYER_LEAVE` was blocked in `game_over`, so a disconnect at the game-over screen never removed the player; `playerCount` never reached 0, `cleanup()` never ran, and the room + its matchmaker registry entry persisted indefinitely. `PLAYER_LEAVE` is now permitted out of `game_over`. +- **`ready` bypassed the state machine** (`worker`) — the `ready` handler had no status guard, so a scripted client could send `ready` mid-match and, once every live player's id was collected, force a countdown whose completion wiped the live match (tick/score/wave reset). `ready`/`unready`/`checkStartConditions` are now status-guarded to the lobby. +- **Matchmaker registry grew without bound** (`worker`) — `/find` only swept rooms in `openRooms`, so created-but-never-joined rooms (and other non-open entries) lived in the single-value `rooms` blob forever, heading for the **128 KiB KV-value ceiling** (this DO is KV-backed) that would break matchmaking for everyone. `/find` now sweeps **all** rooms by staleness, `game_over` registrations are dropped, and a `MAX_TRACKED_ROOMS` cap (sized to fit 128 KiB) refuses new rooms past the limit (the Worker surfaces 503). +- **Room creation ignored downstream failures** (`worker`) — `createRoom` ignored the `/init` and `/register` responses, so `POST /room` could hand a client a roomCode for a room the registry had rejected. It now checks both and propagates the failure. +- **Rejoin-token table leaked per room** (`worker`) — `cleanup()` dropped `game_state` but left `rejoin_sessions` behind, so each dead room's tokens persisted in SQLite. Cleanup now clears them too. +- **Asymmetric player bounds** (`shared`) — `PLAYER_MAX_X` was computed with the left-edge formula (`120 − 7 − 1 = 112`) on a center-based coordinate, leaving the rightmost 4 columns unreachable while the ship could touch the left wall. Corrected to `116` so the reachable margins mirror. +- **Malformed / unknown server messages were swallowed silently** (`client-core`) — the WebSocket hook now emits a dev-visible `console.warn` instead of dropping unparseable frames and unknown message types with no signal. +- **`buildWsUrl` used substring surgery** (`web`) — scheme derivation now uses the URL parser (`deriveWsUrl`), which handles ports, path prefixes, and scheme-case correctly. +- **TUI audio backend mismatch** (`client`) — the startup probe and `MusicManager` disagreed on the Linux audio binary (probe accepted `aplay`; music hardcoded `mpv`), so music silently failed on `aplay`-only hosts. Both now resolve through one source of truth (`client/src/audio/audioPlayers.ts`) with graceful fallback. +- **TUI default server URL** (`client`) — running the client directly defaulted to `localhost:8787` while the launcher defaulted to production, silently targeting a dead server. Both now agree on the production default. + +### Changed + +Platform-idiom hardening pass against Cloudflare's published Durable Object best practices (see `Lessons_learned.md` §22). Deferred scaling items (sharding the global-singleton Matchmaker; per-room SQLite registry rows) are captured in `docs/TODO.md`. + +- **Matchmaker exposes typed RPC** (`register` / `unregister` / `find` / `getRoomInfo`) — Worker and GameRoom call methods on the stub instead of hand-rolled `fetch(new Request('https://internal/…'))` routing and JSON parsing. A thin `fetch` adapter remains for tests. The WebSocket upgrade stays a fetch (RPC cannot return a 101). +- **`alarm()` error boundary** — a throwing tick no longer triggers Cloudflare's blind alarm retry (a 30 Hz retry storm against poisoned state). Failures are caught, logged as `alarm_error` wide events, re-armed with a 1 s backoff, and after 10 consecutive failures the room ends the game (`alarm_error_giving_up`) instead of spinning forever. +- **Heartbeat pings answered by the runtime** — `setWebSocketAutoResponse` now answers the client's `{type:'ping'}` without waking the DO, so idle lobbies hibernate through keepalives (Cloudflare: ping/pong does not interrupt hibernation). The phantom-reap reconciles per-socket liveness from `getWebSocketAutoResponseTimestamp` since auto-responded pings bypass `webSocketMessage`; `pong.serverTime` is now optional (no client read it). +- **Background tasks protected by `ctx.waitUntil`** — `fireAndForget` registry updates and cleanup-alarm scheduling now extend the DO's lifetime until they settle, so an eviction can't silently drop a registry update. +- **Region threaded explicitly into logs** — the edge colo travels as an RPC log-context argument and a `x-vaders-region` header on the WS upgrade, replacing the `globalThis.CF_REGION` global that was invisible inside DO isolates (DO logs had no region) and clobbered across concurrent requests. DO wide events now carry `region`. + ## [1.1.1] — 2026-04-13 ### Fixed diff --git a/Lessons_learned.md b/Lessons_learned.md index 3633f34..fcefc76 100644 --- a/Lessons_learned.md +++ b/Lessons_learned.md @@ -1437,3 +1437,101 @@ The first B tests initially used natural solo-game flow and failed — aliens ki 3. **Matchmakers need a progress signal, not just an update signal.** `updatedAt` refreshes on any registration — including churn. `lastStatusChangeAt` refreshes only on productive transitions. They measure different things, and you need both. 4. **Wide events made the undebuggable debuggable.** With per-message `console.log` the phantom state would have been invisible. 5. **Write the reproducer before the fix, but pick the right harness.** A 2400-tick natural gameplay sim is too brittle; direct mutation is more honest. + +--- + +## 21. The Deep-Audit Hardening Pass (v1.2.0) + +A whole-system audit — "audit this entire system and the assumptions beneath it" — surfaced a cluster of defects that share a root theme: **the parts of the system with the strongest correctness story (the pure reducer, the PBT harness) were not the parts where correctness actually broke.** The bugs lived at the boundaries the unit tests mock away — the single Durable Object alarm, the WebSocket lifecycle, the matchmaker registry's storage backing. Every fix shipped test-first (red → green) with a regression guard, leaning on the `testing-best-practices` skill (correctness by construction, PBT, both-directions, reproducer-before-fix) and Cloudflare's published Durable Object best practices. + +### Liveness is a mark, not a timestamp comparison + +The highest-impact bug was three lines: the reconnect watchdog (`useGameConnection`) compared `Date.now()` against `lastPongRef`, and `onopen` never refreshed `lastPongRef`. After any reconnect, the new socket inherited the dead socket's pong timestamp and the watchdog force-closed it ~30 s later — *every* reconnect, forever, on both frontends. The clean close then denied the player rejoin grace server-side, so one transient blip ejected them for good. + +The patch ("set `lastPongRef` in `onopen`") works, but the **correct-by-construction** fix is to model the thing that was implicit. Two events prove a connection is alive — a pong *and* a socket opening — so "alive" became an explicit mark (`markAlive`) set on both, with staleness measured from the last mark (`client-core/src/connection/liveness.ts`). You cannot forget the reset because open is one of the two callers, and the unit test asserts it. + +**The lesson: when a bug is "X forgot to update Y", ask what Y *represents*. Usually Y is an implicit model of some real-world fact ("the connection is alive"). Make the model explicit and the forgetting becomes structurally impossible.** + +### A pure-model test cannot catch a wiring bug + +We had `liveness.ts` unit-tested both directions in `bun:test` — but a pure model *cannot* have the bug, because the bug was whether the *hook* calls `markAlive` on open. The honest regression had to drive the real wiring: `web/src/connection-reconnect.test.tsx` renders the actual hook against a mock WebSocket and fake timers, simulates drop → reconnect with a stale clock, and asserts the watchdog does not self-close the healthy socket. It goes red if you delete the one-line reset. A second test asserts the watchdog *still* closes a genuinely silent socket — the both-directions guard against "fixing" the bug by disabling the feature. + +**The lesson: unit-test the model for the math, integration-test the wiring for the behavior. The skill's "smallest useful test tier" is not always the smallest — for a wiring bug it's the boundary.** + +### One Durable Object, one alarm — it is a shared resource + +A DO has a single alarm and `setAlarm` overwrites. Four schedulers shared it (33 ms game tick, 1 s countdown, 5 min cleanup, 5 s unauthenticated-socket timeout), and `ensureUnauthenticatedSocketAlarm` — called on *every* WS upgrade, including mid-match reconnects — set `now + 5 s` unconditionally, clobbering the pending tick and freezing the whole room's loop for 5 s. Cloudflare's docs name this exact trap ("One alarm limit → use the event-queue pattern") and prescribe the fix: **min-merge via `getAlarm()`** — never replace a sooner pending alarm. The soft deadlines (unauth timeout, cleanup) now schedule through `scheduleWakeNoLaterThan`; the hard-cadence tick/countdown stay direct because they are always the soonest in their phase and re-arm themselves. + +**The lesson: a singleton timer is a shared mutable resource. Centralise writes through a min-merge helper; never let a soft deadline overwrite a hard cadence. Read the platform's gotchas doc — this one was a named, documented footgun.** + +### Terminal states still need an exit edge + +`game_over` was modelled as fully terminal (`TRANSITIONS.game_over = {}`), which silently blocked `PLAYER_LEAVE`. A disconnect at the game-over screen never removed the player, so `playerCount` never hit 0, `cleanup()` never ran, and the room + matchmaker entry leaked forever — the *third* home of the phantom-player disease (§20), reached by a path A/B/C didn't cover. "Terminal" means *no progression back into gameplay*, not *no transitions at all*. Players must always be able to leave. + +The test that encoded this bug was worse than no test: `expect(canTransition('game_over', 'PLAYER_LEAVE')).toBe(false)` asserted the buggy behavior as if it were a spec (the §8 "tests that document bugs mask problems" anti-pattern, again). Flipping it to `toBe(true)` was step one of the fix. + +### Every state-changing handler needs the state-machine's permission + +The `ready` handler mutated `readyPlayerIds` and called `checkStartConditions` with no status guard, bypassing the reducer's `TRANSITIONS` table entirely. A scripted client (the protocol is plain JSON over an unauthenticated WS) could send `ready` mid-match; once every live player's id was collected, a countdown fired from `playing` and its completion called `startGame()`, wiping the live match. The PBT harness had already hardened `start_solo` and `join` against this exact class — `ready` was the one that slipped through. Fixed with a `waiting`-only guard on the handler *and* defense-in-depth in `checkStartConditions`. + +**The lesson: if you have a state machine, every mutation must pass through it or replicate its guard. An allowlist of transitions is worthless if handlers can mutate around it. Audit *all* handlers against the table, not just the ones a bug report named.** + +### Registries need a structural bound, not opportunistic pruning + +The matchmaker pruned stale rooms only by iterating `openRooms` — so a created-but-never-joined room (`playerCount 0`, never in `openRooms`) lived in the single-value `rooms` blob forever. Opportunistic pruning of a *subset* is not a bound. The fix sweeps **all** rooms by staleness on `/find`, drops `game_over` registrations eagerly, and adds a hard `MAX_TRACKED_ROOMS` cap that refuses new rooms past the limit (surfaced as a 503). + +And the cap's *size* is where reading the docs paid off twice. The first draft set it to 5000 "comfortably under the 2 MB limit." But the Cloudflare storage gotchas spell out that **KV-backed DO values cap at 128 KiB, not 2 MB** — and this DO is KV-backed (migration `new_classes`, async `storage.get/put`). At ~130 bytes/room, 5000 rooms is ~550 KB, which would *itself* blow the 128 KiB ceiling the cap was meant to defend. Corrected to 500 (~65 KB). The re-audit caught a bug in the fix. + +**The lesson: a "structural bound" is only as correct as the limit it's sized against. Read the exact limit for the exact backend you're on — SQLite row (2 MB) vs KV value (128 KiB) is a 16× difference, and the migration tag (`new_sqlite_classes` vs `new_classes`) silently decides which one applies.** The deeper fix is per-room SQLite rows (no single-value ceiling, no whole-blob rewrite per register), which Cloudflare's gotchas recommend directly ("store records as rows… batch writes") — deferred because it needs a storage-backend migration. + +**The lesson: any registry an adversary (or just churn) can grow needs a structural cap and a sweep that covers the whole keyspace — not just the slice your happy path reads.** + +### Coordinate-convention drift is a recurring class, not an incident + +`PLAYER_MAX_X` was `120 − 7 − 1 = 112` — the *left-edge* formula — applied to a *center*-based coordinate, leaving the rightmost 4 columns unreachable while the ship could touch the left wall. This is the same center-vs-left-edge confusion §8 and the original Lessons catalogued for bullets and barriers, resurfacing in a movement bound. The contract test even encoded the wrong invariant (`PLAYER_MAX_X + PLAYER_WIDTH ≤ WIDTH`, the left-edge rule). The fix corrected the constant to `116` (symmetric margins) and rewrote the invariant in terms of `PLAYER_HALF_WIDTH` and the *right edge*. + +**The lesson: a bug class you've fixed before will reappear in a surface you didn't think to check. The asymmetry was invisible until a test asserted the property directly — `leftMargin === rightMargin` — rather than a one-sided bound. Prefer property assertions ("the playfield is symmetric") over incidental ones ("max is 112").** + +### Don't swallow what you can't parse + +The WebSocket hook dropped malformed JSON (`catch {}`) and unknown message types (fall-through) with zero signal — exactly what makes a server-rollout protocol mismatch undebuggable. A `console.warn` at both points is the lightest honest fix; the both-directions test asserts a *known* message (pong) does **not** warn, so the diagnostic can't degrade into noise. + +### Single source of truth, again (TUI audio) + +The startup audio probe accepted `aplay` while `MusicManager` hardcoded `mpv`, so on an `aplay`-only Linux box startup reported "audio OK" and music silently failed. The fix routes both the probe and playback through one resolver (`audioPlayers.ts`) — the same "greppability / one source of truth" discipline §15 applied to cross-surface contracts, here applied to a capability check and the code that depends on it. **A startup check that doesn't share its resolution logic with the runtime is theater.** + +### The meta-lesson + +Audit the seams, not the cores. The reducer had 270 tests and zero of these bugs; the alarm scheduler, the socket lifecycle, the registry's storage backing, and the client/launcher URL defaults had thin coverage and all of the bugs. **A green core is not a green system. Point the next audit at whatever the unit tests mock — that mock is a list of the assumptions nobody is checking.** + +--- + +## 22. Aligning With the Platform's Idioms (the Cloudflare best-practices round) + +The §21 re-audit, run against Cloudflare's published Durable Object best practices, surfaced a second tier of issues — not correctness bugs but *idiom and cost* gaps where the code fought the platform instead of using it. Five were fixed test-first; two (sharding the global-singleton Matchmaker, moving the registry to SQLite rows) were captured in `docs/TODO.md` as deferred scaling work. The throughline: **the platform already solved most of these; the fix is to stop hand-rolling around it.** + +### RPC over fetch — and keep fetch only where the platform forces it + +Every Worker→DO and DO→DO call was `stub.fetch(new Request('https://internal/…'))` with hand-rolled path routing and JSON parsing — the legacy pattern. With `compatibility_date >= 2024-04-03` (this project: 2026-04-29), DOs expose typed RPC methods directly. The Matchmaker became `register/unregister/find/getRoomInfo` RPC methods; callers invoke them on the stub with full type-checking and no JSON surgery. The one call that *stayed* fetch is the WebSocket upgrade — RPC can't return a 101 — so GameRoom keeps a fetch handler for that and only that. + +**Migration tactic that bounded the blast radius:** keep a thin `fetch` adapter on the Matchmaker that delegates to the RPC methods. Production switched to RPC; the 30 existing unit tests and the PBT harness helpers kept driving the same logic over a Request, unchanged. The only test churn was the *binding stubs* — `env.MATCHMAKER.get()` had to return the RPC surface instead of `{ fetch }` — because those simulate the platform boundary the production code now crosses differently. **When you change how a seam is crossed, the mocks of that seam are exactly what breaks; nothing else needs to.** + +### A throwing alarm is a platform retry-storm waiting to happen + +Cloudflare *retries* an alarm whose handler throws. With a 30Hz alarm, a single reducer exception becomes a retry storm against poisoned state. The fix wraps the alarm body in an error boundary that logs a wide event, re-arms *deliberately* with a 1s backoff (not the 33ms cadence, not the platform's blind retry), and after 10 consecutive failures gives up and ends the game. **Bounded failure beats both an unhandled throw and an infinite hot loop. If the platform's default on error is "retry," catching is how you choose the recovery policy instead of inheriting it.** + +### Auto-response is free hibernation — but it bypasses your handler + +The app sent a `{type:'ping'}` *data* message every 30s, and the code comment admitted it "intentionally wakes a hibernated DO." Cloudflare's `setWebSocketAutoResponse` answers a fixed ping with a fixed pong *in the runtime, without waking the DO* — so idle lobbies hibernate through keepalives. Switching to it was easy; the trap was the **cross-feature interaction**: auto-responded pings never reach `webSocketMessage`, so they stopped bumping the phantom-reap's `lastActiveTick` (Lesson §20, Option B). An idle-but-alive player in active play would have been wrongly reaped. The runtime stamps each socket's last auto-response (`getWebSocketAutoResponseTimestamp`), so the reap now reconciles liveness from that before culling. **A hibernation optimization silently changed which code path observes liveness. When you let the platform handle something your code used to see, find everything that depended on seeing it.** (This is why the pong's now-unused `serverTime` could be dropped — confirming a field is dead before relying on a static response.) + +### `waitUntil` is the difference between fire-and-forget and fire-and-hope + +`fireAndForget` caught rejections but didn't extend the DO's lifetime, so an eviction between the handler returning and the task settling would silently drop it — a lost registry update reintroducing the very matchmaker/reality drift §20 fought. `this.ctx.waitUntil(task)` keeps the DO alive until it completes. **On a platform that evicts aggressively, an un-awaited promise with no `waitUntil` is not "background work" — it's "work that probably won't happen."** + +### Per-request context belongs in the request, never in a module global + +Region (the edge colo) was stashed in `globalThis.CF_REGION` at the Worker entry. Two bugs in one line: a DO runs in its *own isolate*, so it never saw the Worker isolate's global (every DO log lacked region); and within one isolate, concurrent requests clobber a shared global (mis-attributed region). The fix threads region explicitly — as an RPC argument and as a WS-upgrade header — and the logger reads it only from the per-call data. **A module-level global is per-isolate mutable state; it is never a safe place for per-request values, and across an isolate boundary it is invisible. Thread the context; don't stash it.** + +### The mock is still the list of untested assumptions + +Every one of these five fixes required widening the hand-rolled `cloudflare:workers` mock toward the real platform: `getAlarm` (min-merge), `setWebSocketAutoResponse` + `getWebSocketAutoResponseTimestamp`, `waitUntil`, and RPC method stubs on the bindings. That list — the methods the mock *didn't* have — is precisely the set of platform behaviours nothing was exercising. It reinforces §21's standing recommendation: a `vitest-pool-workers` (workerd) smoke suite would have made every one of these fixes testable against the real runtime instead of a mock we keep teaching, one incident at a time, what the platform already does. diff --git a/README.md b/README.md index 14a5d82..b917bfe 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ Multiplayer Space Invaders clone (1–4 players) built with OpenTUI and Cloudflare Durable Objects. **Play in your terminal or your browser** — one authoritative server, two frontends, same game. -![Version: 1.1.1](https://img.shields.io/badge/version-1.1.1-blue) +![Version: 1.2.0](https://img.shields.io/badge/version-1.2.0-blue) ![Terminal Size: 120x36](https://img.shields.io/badge/terminal-120x36-blue) ![License: MIT](https://img.shields.io/badge/license-MIT-green) diff --git a/client-core/src/connection/liveness.test.ts b/client-core/src/connection/liveness.test.ts new file mode 100644 index 0000000..129b1dd --- /dev/null +++ b/client-core/src/connection/liveness.test.ts @@ -0,0 +1,78 @@ +// client-core/src/connection/liveness.test.ts +// Unit tests for the connection liveness model. +// +// Regression home for the "ping watchdog kills every reconnection" bug: +// the watchdog measured staleness from the last pong only, and the onopen +// handler never refreshed it — so a freshly reconnected socket inherited a +// stale pong timestamp from the dead socket and was closed ~30s later, every +// time, forever. The fix makes "alive" a first-class mark set on BOTH events +// that prove liveness (socket open AND pong), so the reset cannot be forgotten. + +import { describe, test, expect } from 'bun:test' +import { + PING_INTERVAL, + PONG_TIMEOUT, + LIVENESS_TIMEOUT, + createLiveness, + markAlive, + isConnectionStale, +} from './liveness' + +describe('liveness constants', () => { + test('PING_INTERVAL is 30s and PONG_TIMEOUT is 5s', () => { + expect(PING_INTERVAL).toBe(30000) + expect(PONG_TIMEOUT).toBe(5000) + }) + + test('LIVENESS_TIMEOUT is the sum (35s)', () => { + expect(LIVENESS_TIMEOUT).toBe(35000) + expect(LIVENESS_TIMEOUT).toBe(PING_INTERVAL + PONG_TIMEOUT) + }) +}) + +describe('isConnectionStale — both directions', () => { + test('fresh liveness is not stale at the first ping check', () => { + const l = createLiveness(1000) + // The watchdog runs one PING_INTERVAL after the mark; that is NOT stale. + expect(isConnectionStale(l, 1000 + PING_INTERVAL)).toBe(false) + }) + + test('not stale exactly at the timeout boundary', () => { + const l = createLiveness(0) + expect(isConnectionStale(l, LIVENESS_TIMEOUT)).toBe(false) + }) + + test('stale once strictly past the timeout', () => { + const l = createLiveness(0) + expect(isConnectionStale(l, LIVENESS_TIMEOUT + 1)).toBe(true) + }) +}) + +describe('markAlive resets the staleness clock', () => { + test('a pong refreshes liveness so an old socket stays open', () => { + const l = createLiveness(0) + expect(isConnectionStale(l, LIVENESS_TIMEOUT + 1)).toBe(true) // would close... + markAlive(l, LIVENESS_TIMEOUT + 1) // ...but a pong just arrived + expect(isConnectionStale(l, LIVENESS_TIMEOUT + 1)).toBe(false) + expect(isConnectionStale(l, 2 * LIVENESS_TIMEOUT)).toBe(false) + }) + + test('REGRESSION: reopening marks alive, so a stale prior pong cannot close the new socket', () => { + // Simulate the exact bug: a pong from a previous socket at t=1000, then a + // long outage, then a reconnect 5 minutes later. + const priorPong = 1000 + const reconnectAt = priorPong + 5 * 60_000 + + const l = createLiveness(priorPong) + // Without a reset, the watchdog would see the new socket as long-dead: + expect(isConnectionStale(l, reconnectAt)).toBe(true) + + // onopen marks the connection alive — this is the fix. + markAlive(l, reconnectAt) + + // The first watchdog tick is one ping interval after open: must NOT close. + expect(isConnectionStale(l, reconnectAt + PING_INTERVAL)).toBe(false) + // Only after a genuine silence past the timeout does it go stale again. + expect(isConnectionStale(l, reconnectAt + LIVENESS_TIMEOUT + 1)).toBe(true) + }) +}) diff --git a/client-core/src/connection/liveness.ts b/client-core/src/connection/liveness.ts new file mode 100644 index 0000000..085b895 --- /dev/null +++ b/client-core/src/connection/liveness.ts @@ -0,0 +1,47 @@ +// client-core/src/connection/liveness.ts +// Connection liveness model shared by the WebSocket hook's heartbeat watchdog. +// +// The watchdog must answer one question: "has this connection gone silent long +// enough that we should give up and reconnect?" The subtlety that caused a +// production bug is that *two* distinct events prove a connection is alive: +// 1. a `pong` arriving, and +// 2. the socket `open`ing (a brand-new socket has never received a pong, so +// the previous socket's pong timestamp says nothing about it). +// +// Modelling "alive" as an explicit mark set on BOTH events makes the reset +// impossible to forget: staleness is always measured from the last proof of +// life, never from a timestamp belonging to a socket that is already gone. + +/** Application-level heartbeat interval (browsers can't send protocol pings). */ +export const PING_INTERVAL = 30000 +/** Grace period after a missed ping before the connection is presumed dead. */ +export const PONG_TIMEOUT = 5000 +/** Total silence tolerated before the watchdog closes the socket. */ +export const LIVENESS_TIMEOUT = PING_INTERVAL + PONG_TIMEOUT + +/** Mutable liveness handle. `lastSeen` is the timestamp of the last proof of life. */ +export interface Liveness { + lastSeen: number +} + +/** Create a liveness handle marked alive as of `now` (call when a socket opens). */ +export function createLiveness(now: number): Liveness { + return { lastSeen: now } +} + +/** + * Record a proof of life. Call on socket `open` AND on every `pong`. Both are + * evidence the connection is reachable; either one resets the staleness clock. + */ +export function markAlive(liveness: Liveness, now: number): void { + liveness.lastSeen = now +} + +/** + * True when the connection has been silent strictly longer than LIVENESS_TIMEOUT. + * The boundary is inclusive-alive: exactly LIVENESS_TIMEOUT of silence is still + * considered alive (matches the pre-existing watchdog threshold). + */ +export function isConnectionStale(liveness: Liveness, now: number): boolean { + return now - liveness.lastSeen > LIVENESS_TIMEOUT +} diff --git a/client-core/src/connection/useGameConnection.ts b/client-core/src/connection/useGameConnection.ts index 8ea3a3b..6b4a26d 100644 --- a/client-core/src/connection/useGameConnection.ts +++ b/client-core/src/connection/useGameConnection.ts @@ -11,13 +11,14 @@ import type { InputState, } from '../../../shared/types' import { applyPlayerInput } from '../../../shared/types' +import { type Liveness, createLiveness, markAlive, isConnectionStale, PING_INTERVAL } from './liveness' // Application-level heartbeat. Browsers cannot send WebSocket protocol ping // frames, so this data message intentionally wakes a hibernated Durable Object // about every 30s. For Vaders' low traffic, reliable phantom-player detection // and user-visible reconnect behavior are worth that small idle-room cost. -const PING_INTERVAL = 30000 -const PONG_TIMEOUT = 5000 +// PING_INTERVAL / PONG_TIMEOUT / staleness live in ./liveness so the heartbeat +// reset on (re)open cannot drift from the staleness check that consumes it. const SYNC_INTERVAL_MS = 33 // Expected sync rate for lerp calculation function rejoinStorageKey(roomUrl: string): string { @@ -86,7 +87,10 @@ export function useGameConnection(roomUrl: string, playerName: string) { }) const wsRef = useRef(null) - const lastPongRef = useRef(Date.now()) + // Liveness is marked alive on BOTH socket-open and pong (see ./liveness). + // Resetting on open is what stops a stale pong from a previous socket from + // tearing down a freshly reconnected one. + const livenessRef = useRef(createLiveness(Date.now())) const pingIntervalRef = useRef | null>(null) const localInputRef = useRef({ left: false, right: false }) const rejoinTokenRef = useRef(null) @@ -113,6 +117,9 @@ export function useGameConnection(roomUrl: string, playerName: string) { ws.onopen = () => { // Successful connection (or reconnection) - reset attempt counter reconnectAttemptRef.current = 0 + // Mark alive on open: a brand-new socket has never received a pong, + // so the previous socket's timestamp must not count against it. + markAlive(livenessRef.current, Date.now()) setState((s) => ({ ...s, connected: true, reconnecting: false, error: null })) // Rejoin if we have a room-scoped token; otherwise perform a fresh join. @@ -129,7 +136,7 @@ export function useGameConnection(roomUrl: string, playerName: string) { clearInterval(pingIntervalRef.current) } pingIntervalRef.current = setInterval(() => { - if (Date.now() - lastPongRef.current > PING_INTERVAL + PONG_TIMEOUT) { + if (isConnectionStale(livenessRef.current, Date.now())) { ws.close() return } @@ -142,7 +149,7 @@ export function useGameConnection(roomUrl: string, playerName: string) { const msg: ServerMessage = JSON.parse(event.data) if (msg.type === 'pong') { - lastPongRef.current = Date.now() + markAlive(livenessRef.current, Date.now()) return } @@ -192,8 +199,15 @@ export function useGameConnection(roomUrl: string, playerName: string) { }) return } + + // Unknown message type: don't silently drop it — surface a dev + // signal so protocol drift (e.g. a server sending a new type before + // the client is updated) is visible instead of vanishing. + console.warn('[vaders] ignoring unknown server message type:', (msg as { type?: unknown })?.type) } catch { - // Invalid JSON + // Malformed/unparseable frame. Warn rather than swallow silently — + // corruption or a non-JSON payload would otherwise go unnoticed. + console.warn('[vaders] dropped unparseable server message') } } diff --git a/client/src/App.serverUrl.test.ts b/client/src/App.serverUrl.test.ts new file mode 100644 index 0000000..91ce7ca --- /dev/null +++ b/client/src/App.serverUrl.test.ts @@ -0,0 +1,39 @@ +// client/src/App.serverUrl.test.ts +// Tests for server URL resolution. +// +// Bug context: App.tsx defaulted SERVER_URL to 'http://localhost:8787', but the +// launcher bin/vaders.ts defaults to the production URL. Running the client +// directly (not via the launcher) silently targeted a dead localhost. The fix +// makes the in-app default coherent with the launcher's production default, +// while keeping the VADERS_SERVER override intact. Resolution is extracted into +// a pure helper so it can be tested in isolation (correctness by construction). +// +// The helper lives in its own module (./serverUrl) rather than App.tsx because +// App.tsx imports @opentui/react, which is not importable in the test runtime. + +import { describe, test, expect } from 'bun:test' +import { resolveServerUrl, PRODUCTION_SERVER_URL } from './serverUrl' + +describe('resolveServerUrl', () => { + test('PRODUCTION_SERVER_URL matches the launcher default (bin/vaders.ts)', () => { + expect(PRODUCTION_SERVER_URL).toBe('https://vaders.adewale-883.workers.dev') + }) + + test('falls back to the production URL when VADERS_SERVER is unset', () => { + expect(resolveServerUrl(undefined)).toBe(PRODUCTION_SERVER_URL) + }) + + test('falls back to the production URL when VADERS_SERVER is empty string', () => { + // An empty env var is effectively "unset" and must not become the URL. + expect(resolveServerUrl('')).toBe(PRODUCTION_SERVER_URL) + }) + + test('does NOT silently default to localhost', () => { + expect(resolveServerUrl(undefined)).not.toContain('localhost') + }) + + test('VADERS_SERVER override wins when set', () => { + expect(resolveServerUrl('http://localhost:8787')).toBe('http://localhost:8787') + expect(resolveServerUrl('https://example.com')).toBe('https://example.com') + }) +}) diff --git a/client/src/App.tsx b/client/src/App.tsx index 9c34be6..2d33b0f 100644 --- a/client/src/App.tsx +++ b/client/src/App.tsx @@ -16,9 +16,12 @@ import { debugLog, clearDebugLog } from './debug' import { useTerminalSize, STANDARD_WIDTH, STANDARD_HEIGHT } from './hooks/useTerminalSize' import { useGameAudio, playShootSound, playMenuNavigateSound, playMenuSelectSound } from './hooks/useGameAudio' import { AudioManager, MusicManager } from './audio' +import { resolveServerUrl } from './serverUrl' const VERSION = '1.0.0' -const SERVER_URL = process.env.VADERS_SERVER ?? 'http://localhost:8787' +// Default to the production server (matching the launcher in bin/vaders.ts) so +// running the client directly doesn't silently target a dead localhost. +const SERVER_URL = resolveServerUrl(process.env.VADERS_SERVER) const LOG_PATH = process.env.VADERS_LOG_PATH ?? '' /** Construct WebSocket URL for a room from the HTTP server URL */ diff --git a/client/src/audio/MusicManager.ts b/client/src/audio/MusicManager.ts index 44791a6..f78f8ca 100644 --- a/client/src/audio/MusicManager.ts +++ b/client/src/audio/MusicManager.ts @@ -6,6 +6,7 @@ import { join, dirname } from 'path' import { existsSync } from 'fs' import { fileURLToPath } from 'url' import { getUserConfig, setUserConfig } from '../config/userConfig' +import { resolveMusicPlayer, isPlayerAvailable } from './audioPlayers' // Get the directory of this file (works in both Bun and Node ESM) const __filename = fileURLToPath(import.meta.url) @@ -14,7 +15,8 @@ const MUSIC_PATH = join(__dirname, '../../sounds/background-music.mp3') /** * MusicManager - Handles background music playback with looping - * Uses system audio player (afplay on macOS, mpv/aplay on Linux) + * Resolves the platform audio player via resolveMusicPlayer() (the same source + * of truth the startup check uses), so detection and playback never disagree. */ class MusicManager { private static instance: MusicManager @@ -52,10 +54,12 @@ class MusicManager { return } - // Pre-flight: verify audio player binary exists - const player = process.platform === 'darwin' ? 'afplay' : 'mpv' - if (!MusicManager.isPlayerAvailable(player)) { - this.lastError_ = `Audio player not found: ${player}` + // Pre-flight: resolve an audio player that is ACTUALLY present on this + // system (same resolver the startup check uses). If none exist, degrade + // gracefully — set an error and no-op, exactly as SFX fall back to the bell. + const resolved = resolveMusicPlayer() + if (!resolved) { + this.lastError_ = 'Audio player not found for background music (tried mpv/ffplay/...)' return } @@ -69,12 +73,17 @@ class MusicManager { private async playLoop(): Promise { while (this.shouldLoop && this.isPlaying && !this.muted) { try { - // Use afplay on macOS, mpv or aplay on Linux - const player = process.platform === 'darwin' ? 'afplay' : 'mpv' - const args = process.platform === 'darwin' ? [MUSIC_PATH] : ['--no-video', '--really-quiet', MUSIC_PATH] + // Resolve the player+args from the shared source of truth every loop + // (cheap, and tolerant of a player being installed/removed mid-session). + const resolved = resolveMusicPlayer() + if (!resolved) { + this.lastError_ = 'Audio player not found for background music (tried mpv/ffplay/...)' + break + } + const { player, args } = resolved this.process = spawn({ - cmd: [player, ...args], + cmd: [player, ...args, MUSIC_PATH], stdout: 'ignore', stderr: 'pipe', }) @@ -172,17 +181,15 @@ class MusicManager { /** * Check whether an audio player binary is available on the system. - * Uses `which` to verify the binary exists in PATH. - * @param player - Binary name to check, defaults to platform default (afplay/mpv) + * Delegates to the shared isPlayerAvailable so detection logic lives in one + * place. When no binary is given, reports whether ANY music player resolves. + * @param player - Binary name to check; defaults to "any resolvable player". */ static isPlayerAvailable(player?: string): boolean { - const cmd = player ?? (process.platform === 'darwin' ? 'afplay' : 'mpv') - try { - const result = Bun.spawnSync({ cmd: ['which', cmd] }) - return result.exitCode === 0 - } catch { - return false + if (player !== undefined) { + return isPlayerAvailable(player) } + return resolveMusicPlayer() !== null } } diff --git a/client/src/audio/audioPlayers.test.ts b/client/src/audio/audioPlayers.test.ts new file mode 100644 index 0000000..2ccf0d9 --- /dev/null +++ b/client/src/audio/audioPlayers.test.ts @@ -0,0 +1,114 @@ +// client/src/audio/audioPlayers.test.ts +// Tests for the single source of truth that resolves the platform audio player(s) +// actually available at runtime. This module is shared by BOTH the startup probe +// and MusicManager so they can never disagree about which binary plays music. +// +// Bug context: startup probed `aplay` (via getAudioPlayer) but MusicManager +// hardcoded `mpv`. On a Linux box with aplay-but-not-mpv, startup passed yet +// music silently failed. The fix funnels both through resolveMusicPlayer(). + +import { describe, test, expect } from 'bun:test' +import { + isPlayerAvailable, + resolveMusicPlayer, + MUSIC_PLAYER_CANDIDATES, + type ResolvedPlayer, +} from './audioPlayers' + +// ─── isPlayerAvailable (shared `which` check) ──────────────────────────────── + +describe('isPlayerAvailable', () => { + test('returns true for a known binary (echo)', () => { + expect(isPlayerAvailable('echo')).toBe(true) + }) + + test('returns false for a nonexistent binary', () => { + expect(isPlayerAvailable('__nonexistent_player_xyz_9999__')).toBe(false) + }) + + test('returns false for empty string', () => { + expect(isPlayerAvailable('')).toBe(false) + }) +}) + +// ─── Candidate table ───────────────────────────────────────────────────────── + +describe('MUSIC_PLAYER_CANDIDATES', () => { + test('provides a non-empty candidate list for darwin and linux', () => { + expect(MUSIC_PLAYER_CANDIDATES.darwin.length).toBeGreaterThan(0) + expect(MUSIC_PLAYER_CANDIDATES.linux.length).toBeGreaterThan(0) + }) + + test('macOS candidates start with afplay', () => { + expect(MUSIC_PLAYER_CANDIDATES.darwin[0].player).toBe('afplay') + }) + + test('every candidate carries the args used to play an mp3', () => { + for (const list of Object.values(MUSIC_PLAYER_CANDIDATES)) { + for (const c of list) { + expect(typeof c.player).toBe('string') + expect(c.player.length).toBeGreaterThan(0) + expect(Array.isArray(c.args)).toBe(true) + } + } + }) +}) + +// ─── resolveMusicPlayer: present → used ────────────────────────────────────── + +describe('resolveMusicPlayer (present → used)', () => { + test('returns the first candidate whose binary is present in PATH', () => { + // Inject a candidate list whose first entry is missing and second is present. + const candidates: ResolvedPlayer[] = [ + { player: '__nope_aaa__', args: ['x'] }, + { player: 'echo', args: ['--played'] }, + ] + const resolved = resolveMusicPlayer(candidates) + expect(resolved).not.toBeNull() + expect(resolved!.player).toBe('echo') + expect(resolved!.args).toEqual(['--played']) + }) + + test('prefers the earliest available candidate (ordering is honored)', () => { + const candidates: ResolvedPlayer[] = [ + { player: 'echo', args: ['first'] }, + { player: 'echo', args: ['second'] }, + ] + const resolved = resolveMusicPlayer(candidates) + expect(resolved!.args).toEqual(['first']) + }) +}) + +// ─── resolveMusicPlayer: absent → fallback / degrade ───────────────────────── + +describe('resolveMusicPlayer (absent → fallback / degrade)', () => { + test('falls back past a missing preferred binary to an available one', () => { + // Models the real bug: preferred player (mpv) missing, fallback present. + const candidates: ResolvedPlayer[] = [ + { player: '__missing_mpv__', args: ['--no-video'] }, + { player: '__missing_ffplay__', args: ['-nodisp'] }, + { player: 'echo', args: ['fallback'] }, + ] + const resolved = resolveMusicPlayer(candidates) + expect(resolved!.player).toBe('echo') + }) + + test('returns null when NONE of the candidates exist (graceful no-op)', () => { + const candidates: ResolvedPlayer[] = [ + { player: '__missing_a__', args: [] }, + { player: '__missing_b__', args: [] }, + ] + expect(resolveMusicPlayer(candidates)).toBeNull() + }) + + test('default (no arg) resolution returns null or a present binary, never a phantom', () => { + // Whatever it resolves to on THIS machine must actually be present. + const resolved = resolveMusicPlayer() + if (resolved !== null) { + expect(isPlayerAvailable(resolved.player)).toBe(true) + } else { + // Null is acceptable (no player installed); that's the graceful path. + expect(resolved).toBeNull() + } + }) +}) diff --git a/client/src/audio/audioPlayers.ts b/client/src/audio/audioPlayers.ts new file mode 100644 index 0000000..8477e22 --- /dev/null +++ b/client/src/audio/audioPlayers.ts @@ -0,0 +1,85 @@ +// client/src/audio/audioPlayers.ts +// Single source of truth for resolving the platform audio player used to play +// BACKGROUND MUSIC (mp3). Both the startup probe (startup.ts) and the runtime +// player (MusicManager.ts) resolve through here so they can never disagree +// about which binary will actually play music. +// +// Why this exists: previously startup probed the SFX player (`aplay` on Linux, +// from terminal/getAudioPlayer) while MusicManager hardcoded `mpv`. On a Linux +// box that had `aplay` but not `mpv`, startup reported "music available" yet +// playback silently failed. Funnelling both paths through one resolver makes +// that class of mismatch structurally impossible. + +import { spawnSync } from 'bun' + +/** A music player binary plus the args needed to play an mp3 file with it. */ +export interface ResolvedPlayer { + player: string + args: string[] +} + +/** + * Check whether a binary is available on PATH. + * Shared by both the startup probe and MusicManager so they use one impl. + */ +export function isPlayerAvailable(player: string): boolean { + if (!player) return false + try { + const result = spawnSync({ cmd: ['which', player], stdout: 'ignore', stderr: 'ignore' }) + return result.exitCode === 0 + } catch { + return false + } +} + +/** + * Candidate music players per platform, in preference order. + * + * Ordering rationale (Linux): list mp3-capable players first. `aplay` is kept + * last as a best-effort entry — it only decodes WAV, but if it is the only + * thing installed we still resolve to *something present* rather than a phantom + * binary, and MusicManager surfaces any playback error via getLastError(). + * + * The args are tuned to keep playback quiet/headless (no video window, no + * verbose logging) where the player supports it. + */ +export const MUSIC_PLAYER_CANDIDATES: Record<'darwin' | 'linux', ResolvedPlayer[]> = { + darwin: [{ player: 'afplay', args: [] }], + linux: [ + { player: 'mpv', args: ['--no-video', '--really-quiet'] }, + { player: 'ffplay', args: ['-nodisp', '-autoexit', '-loglevel', 'quiet'] }, + { player: 'mpg123', args: ['-q'] }, + { player: 'cvlc', args: ['--play-and-exit', '--quiet'] }, + { player: 'paplay', args: [] }, + { player: 'aplay', args: ['-q'] }, + ], +} + +/** + * Get the ordered candidate list for the current platform. + * Returns an empty list on unsupported platforms (e.g. Windows). + */ +export function getMusicPlayerCandidates(): ResolvedPlayer[] { + if (process.platform === 'darwin') return MUSIC_PLAYER_CANDIDATES.darwin + if (process.platform === 'linux') return MUSIC_PLAYER_CANDIDATES.linux + return [] +} + +/** + * Resolve the music player to actually use: the first candidate whose binary is + * present on PATH. Returns null when none are installed (graceful no-op / the + * "degrade" path, mirroring how SFX fall back to the terminal bell). + * + * @param candidates - candidate list to resolve against. Defaults to the + * current platform's list; injectable for testing. + */ +export function resolveMusicPlayer( + candidates: ResolvedPlayer[] = getMusicPlayerCandidates(), +): ResolvedPlayer | null { + for (const candidate of candidates) { + if (isPlayerAvailable(candidate.player)) { + return candidate + } + } + return null +} diff --git a/client/src/serverUrl.ts b/client/src/serverUrl.ts new file mode 100644 index 0000000..4a35682 --- /dev/null +++ b/client/src/serverUrl.ts @@ -0,0 +1,29 @@ +// client/src/serverUrl.ts +// Single source of truth for resolving the game server URL. +// +// Bug context: App.tsx defaulted to 'http://localhost:8787' while the launcher +// bin/vaders.ts defaults to the production URL. Running the client directly +// (not via the launcher) silently targeted a dead localhost with no clear +// signal. The in-app default is now coherent with the launcher's production +// default. This lives in its own (OpenTUI-free) module so the resolution is +// unit-testable in isolation. + +/** + * Production server URL. MUST match the launcher default in bin/vaders.ts so a + * direct `bun run src/index.tsx` and `bun run vaders` point at the same server. + */ +export const PRODUCTION_SERVER_URL = 'https://vaders.adewale-883.workers.dev' + +/** + * Resolve the server URL from the VADERS_SERVER override, falling back to the + * production default. An unset OR empty override falls back (empty would + * otherwise yield a useless empty URL). + * + * @param override - typically process.env.VADERS_SERVER + */ +export function resolveServerUrl(override: string | undefined): string { + if (override && override.length > 0) { + return override + } + return PRODUCTION_SERVER_URL +} diff --git a/client/src/startup.test.ts b/client/src/startup.test.ts new file mode 100644 index 0000000..d2ccf6a --- /dev/null +++ b/client/src/startup.test.ts @@ -0,0 +1,48 @@ +// client/src/startup.test.ts +// Verifies the startup audio probe and the music subsystem agree on which +// binary plays background music — a single source of truth. +// +// Bug context: startup probed the SFX player (`aplay` on Linux via +// getAudioPlayer) while MusicManager hardcoded `mpv`. Startup could report +// "music available" while the binary music actually uses (mpv) was absent. +// The fix routes both through resolveMusicPlayer(). + +import { describe, test, expect } from 'bun:test' +import { resolveMusicPlayer, isPlayerAvailable } from './audio/audioPlayers' +import { runStartupChecks } from './startup' + +describe('startup music check uses the shared music-player resolver', () => { + test('reports musicAvailable=true only if the resolved music player is actually present', async () => { + const report = await runStartupChecks() + const resolved = resolveMusicPlayer() + + // The contract: startup must not claim music is available unless the + // binary music will actually use is present on this machine. + if (report.musicAvailable) { + expect(resolved).not.toBeNull() + expect(isPlayerAvailable(resolved!.player)).toBe(true) + } + }) + + test('musicAvailable is false when no music player binary exists', async () => { + const report = await runStartupChecks() + const resolved = resolveMusicPlayer() + if (resolved === null) { + // No usable player → must not advertise music as available. + expect(report.musicAvailable).toBe(false) + } + }) + + test('report includes a dedicated Music Player check naming the resolved binary (or "none")', async () => { + const report = await runStartupChecks() + const musicPlayerCheck = report.checks.find((c) => c.name === 'Music Player') + expect(musicPlayerCheck).toBeDefined() + + const resolved = resolveMusicPlayer() + // passed reflects presence; message names the binary so the two can't drift. + expect(musicPlayerCheck!.passed).toBe(resolved !== null) + if (resolved !== null) { + expect(musicPlayerCheck!.message).toContain(resolved.player) + } + }) +}) diff --git a/client/src/startup.ts b/client/src/startup.ts index 7958deb..55bf399 100644 --- a/client/src/startup.ts +++ b/client/src/startup.ts @@ -14,6 +14,7 @@ import { getTerminalRecommendation, shouldShowTerminalRecommendation, } from './terminal' +import { resolveMusicPlayer } from './audio/audioPlayers' const __filename = fileURLToPath(import.meta.url) const __dirname = dirname(__filename) @@ -115,6 +116,16 @@ export async function runStartupChecks(): Promise { message: musicExists ? 'background-music.mp3 found' : 'background-music.mp3 missing', }) + // Check 7b: Music player binary present. + // Use the SAME resolver MusicManager uses, so the startup report can never + // claim music works while the binary it will actually use is absent. + const resolvedMusicPlayer = resolveMusicPlayer() + checks.push({ + name: 'Music Player', + passed: resolvedMusicPlayer !== null, + message: resolvedMusicPlayer ? `${resolvedMusicPlayer.player} found` : 'no music player found (mpv/ffplay/...)', + }) + // Check 8: Audio playback test (actually play a short sound) let audioPlaybackWorks = false if (audioPlayerAvailable) { @@ -157,7 +168,9 @@ export async function runStartupChecks(): Promise { const allPassed = checks.every((c) => c.passed) const audioAvailable = audioPlayerAvailable && audioPlaybackWorks && missingSounds.length === 0 - const musicAvailable = musicExists && audioPlayerAvailable + // Music needs the file AND a music player binary that actually exists — gauged + // by the same resolver MusicManager plays through (not the SFX player probe). + const musicAvailable = musicExists && resolvedMusicPlayer !== null return { checks, diff --git a/docs/TODO.md b/docs/TODO.md index 35fb16b..808df12 100644 --- a/docs/TODO.md +++ b/docs/TODO.md @@ -257,3 +257,42 @@ Real hibernation has an alarm queue and can interleave alarm / ws-message / fetch tasks; the harness doesn't stress those races. Matters more as the game adds features that mutate DO state from multiple entry points. + +### Matchmaker is a global-singleton DO (scaling) + +`Matchmaker` is addressed by `idFromName('global')`, so every +`register` / `unregister` / `find` / `getRoomInfo` call across the whole +fleet funnels through one Durable Object. Cloudflare's DO guidance lists +this as an explicit anti-pattern ("Global singleton handling all traffic +→ Shard across multiple DOs"; "Chatty microservice (every request) → +Reconsider architecture"): a global singleton **never hibernates** +(continuous duration billing) and **bottlenecks at ~1000 req/s**. With +every active GameRoom re-registering ~every 60s during play (plus +join/leave/matchmake churn), a few hundred concurrent rooms approaches +that ceiling. + +**Recommended fix.** Shard the registry — e.g. N hash-bucketed +Matchmaker instances with a fan-in on `find()`, or colo-aware shards +(`request.cf.colo`) — OR move the registry to KV / D1, which are built +for read-heavy global lookup and don't hold a compute instance hot. +Deferred because current traffic is far below the bottleneck; the cap + +sweep (shipped in 1.2.0) bounds the blast radius until then. + +### Matchmaker registry write amplification (scaling) + +The registry persists as a SINGLE storage value: `register` / `find` / +`unregister` each rewrite the entire `rooms` object via +`storage.put('rooms', …)`. Cloudflare's storage guidance ("store records +as rows", "batch writes") makes the per-room approach the idiomatic one. +The `MAX_TRACKED_ROOMS` cap bounds the value *size* (against the 128 KiB +KV-value ceiling) but not the *write cost*, which is O(rooms) per +registration. + +**Recommended fix.** Move the registry to per-room SQLite rows +(`INSERT OR REPLACE INTO rooms …`, `DELETE …`, indexed `SELECT` for +`find`). This removes both the value-size ceiling and the whole-blob +rewrite, and pairs naturally with the sharding item above. Requires a +storage-backend migration (the DO is currently KV-backed via +`new_classes`); a new DO class with `new_sqlite_classes` plus a one-time +data migration is the clean path. Deferred for the same reason as +sharding. diff --git a/package.json b/package.json index a7b0585..58ce7ad 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "vaders", - "version": "1.1.1", + "version": "1.2.0", "description": "Multiplayer TUI Space Invaders clone (1-4 players) using OpenTUI and Cloudflare Durable Objects", "type": "module", "bin": { diff --git a/shared/protocol.ts b/shared/protocol.ts index a053a3b..2cd2f4b 100644 --- a/shared/protocol.ts +++ b/shared/protocol.ts @@ -51,7 +51,7 @@ export type ServerEvent = export type ServerMessage = | { type: 'sync'; state: GameState; playerId?: string; rejoinToken?: string; config?: GameConfig } | ServerEvent - | { type: 'pong'; serverTime: number } + | { type: 'pong'; serverTime?: number } // serverTime omitted by the runtime auto-response; no client reads it | { type: 'error'; code: ErrorCode; message: string } // Sync optimization: diff --git a/shared/tui-compat.contract.test.ts b/shared/tui-compat.contract.test.ts index 590e723..d3ea276 100644 --- a/shared/tui-compat.contract.test.ts +++ b/shared/tui-compat.contract.test.ts @@ -217,8 +217,10 @@ describe('Contract · layout boundaries', () => { expect(LAYOUT.PLAYER_MIN_X).toBeLessThanOrEqual(LAYOUT.PLAYER_MAX_X) const range = LAYOUT.PLAYER_MAX_X - LAYOUT.PLAYER_MIN_X expect(range).toBeGreaterThanOrEqual(100) - // Range must also leave room for the sprite — left + width <= screen - expect(LAYOUT.PLAYER_MAX_X + LAYOUT.PLAYER_WIDTH).toBeLessThanOrEqual(STANDARD_WIDTH) + // player.x is the sprite CENTER, so the right edge is PLAYER_MAX_X + half + // width. That edge must stay on screen (≤ last column), not PLAYER_MAX_X + + // full width (the old left-edge formula that left 4 columns unreachable). + expect(LAYOUT.PLAYER_MAX_X + HITBOX.PLAYER_HALF_WIDTH).toBeLessThanOrEqual(STANDARD_WIDTH - 1) }) test('LAYOUT player dimensions mirror SPRITE_SIZE.player', () => { diff --git a/shared/types.test.ts b/shared/types.test.ts index 71fdca3..0c5ffeb 100644 --- a/shared/types.test.ts +++ b/shared/types.test.ts @@ -5,6 +5,7 @@ import { describe, test, expect } from 'bun:test' import fc from 'fast-check' import { LAYOUT, + HITBOX, STANDARD_WIDTH, ALIEN_REGISTRY, constrainPlayerX, @@ -235,6 +236,23 @@ describe('LAYOUT Constants', () => { const range = LAYOUT.PLAYER_MAX_X - LAYOUT.PLAYER_MIN_X expect(range).toBeGreaterThan(50) // Should have decent movement range }) + + test('player bounds are symmetric — the playfield is not lopsided', () => { + // player.x is the CENTER of a PLAYER_WIDTH sprite (half-width 3). The + // reachable left margin (min center → left wall) must equal the right + // margin (right wall → max center), or the rightmost columns become + // unreachable. PLAYER_MAX_X was computed with the LEFT-edge formula + // (120 - 7 - 1 = 112), which left 4 columns dead on the right. + const half = HITBOX.PLAYER_HALF_WIDTH + const leftEdgeAtMin = LAYOUT.PLAYER_MIN_X - half + const rightEdgeAtMax = LAYOUT.PLAYER_MAX_X + half + expect(leftEdgeAtMin).toBe(0) // sprite can touch the left wall + expect(rightEdgeAtMax).toBe(STANDARD_WIDTH - 1) // ...and the right wall + // Symmetric margins: + const leftMargin = LAYOUT.PLAYER_MIN_X - 0 + const rightMargin = STANDARD_WIDTH - 1 - LAYOUT.PLAYER_MAX_X + expect(rightMargin).toBe(leftMargin) + }) }) // ─── seededRandom Property-Based Tests ────────────────────────────────────── diff --git a/shared/types.ts b/shared/types.ts index 2c1e568..fca3ad0 100644 --- a/shared/types.ts +++ b/shared/types.ts @@ -23,7 +23,7 @@ export const STANDARD_HEIGHT = 36 export const LAYOUT = { PLAYER_Y: 31, // Y position for player ships (5 rows from bottom) PLAYER_MIN_X: 3, // Left boundary for player movement (center-based: left edge = 3-3 = 0) - PLAYER_MAX_X: 112, // Right boundary for player movement (120 - 7 - 1) + PLAYER_MAX_X: 116, // Right boundary (center-based: right edge = 116+3 = 119 = last column). Mirrors PLAYER_MIN_X. PLAYER_WIDTH: 7, // Width of player sprite (2-line braille sprite) PLAYER_HEIGHT: 2, // Height of player sprite BULLET_SPAWN_OFFSET: 2, // Bullet spawns this far above player diff --git a/web/src/api/roomApi.test.ts b/web/src/api/roomApi.test.ts index ce451c9..94263a8 100644 --- a/web/src/api/roomApi.test.ts +++ b/web/src/api/roomApi.test.ts @@ -1,5 +1,5 @@ import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest' -import { createRoom, createSoloRoom, matchmake, getRoomInfo, buildWsUrl } from './roomApi' +import { createRoom, createSoloRoom, matchmake, getRoomInfo, buildWsUrl, deriveWsUrl } from './roomApi' // ─── Helpers ──────────────────────────────────────────────────────────────── @@ -126,6 +126,35 @@ describe('getRoomInfo', () => { // ─── 6. buildWsUrl converts https to wss ──────────────────────────────────── +describe('deriveWsUrl (scheme derivation, correctness by construction)', () => { + it('maps https→wss and http→ws and appends the room route', () => { + expect(deriveWsUrl('https://example.com', 'ROOM01')).toBe('wss://example.com/room/ROOM01/ws') + expect(deriveWsUrl('http://localhost:8787', 'ROOM01')).toBe('ws://localhost:8787/room/ROOM01/ws') + }) + + it('preserves the port', () => { + expect(deriveWsUrl('http://localhost:8787', 'ABC123')).toContain(':8787/') + }) + + it('preserves a path prefix on the server URL', () => { + expect(deriveWsUrl('https://example.com/api', 'ROOM01')).toBe('wss://example.com/api/room/ROOM01/ws') + }) + + it('normalizes an upper-case scheme (the string-replace approach did not)', () => { + // 'HTTPS://…'.replace('https://', …) is a no-op (case-sensitive), so the + // old implementation returned a non-ws URL. The URL parser normalizes it. + expect(deriveWsUrl('HTTPS://example.com', 'ROOM01')).toBe('wss://example.com/room/ROOM01/ws') + }) + + it('always yields a ws/wss URL ending in the room route', () => { + for (const base of ['https://a.b', 'http://a.b:1', 'https://a.b/x/y']) { + const url = deriveWsUrl(base, 'ZZ0099') + expect(url).toMatch(/^wss?:\/\//) + expect(url).toMatch(/\/room\/ZZ0099\/ws$/) + } + }) +}) + describe('buildWsUrl', () => { it('converts http(s) to ws(s)', () => { // SERVER_URL defaults to same-origin (window.location.origin in browsers). diff --git a/web/src/api/roomApi.ts b/web/src/api/roomApi.ts index 61beacd..4f97fdc 100644 --- a/web/src/api/roomApi.ts +++ b/web/src/api/roomApi.ts @@ -30,9 +30,23 @@ export async function getRoomInfo(code: string): Promise<{ status: string; playe return res.json() } +/** + * Derive the room WebSocket URL from an HTTP(S) server URL using the platform + * URL parser rather than substring surgery. `SERVER_URL.replace('http://', …)` + * is fragile: it is case-sensitive (an upper-case scheme silently fails) and + * would rewrite any later `http://` in the string. Parsing handles ports, + * path prefixes, and scheme normalization correctly. + */ +export function deriveWsUrl(serverUrl: string, roomCode: string): string { + const url = new URL(serverUrl) + url.protocol = url.protocol === 'https:' ? 'wss:' : 'ws:' + const prefix = url.pathname.replace(/\/$/, '') + url.pathname = `${prefix}/room/${roomCode}/ws` + return url.toString() +} + export function buildWsUrl(roomCode: string): string { - const base = SERVER_URL.replace('https://', 'wss://').replace('http://', 'ws://') - return `${base}/room/${roomCode}/ws` + return deriveWsUrl(SERVER_URL, roomCode) } /** diff --git a/web/src/components/LaunchScreen.test.tsx b/web/src/components/LaunchScreen.test.tsx index 41d69f8..6e7c1cc 100644 --- a/web/src/components/LaunchScreen.test.tsx +++ b/web/src/components/LaunchScreen.test.tsx @@ -294,7 +294,101 @@ describe('LaunchScreen menu sounds — property-based', () => { const POSITIVE_KEYS = ['ArrowUp', 'ArrowDown', 'Enter', '1', '2', '3', '4'] as const const NEGATIVE_KEYS = ['m', 'n', '?', 'Escape', 'a', 'z', 'Tab', 'Shift', 'Backspace'] as const - it('call count equals number of non-repeat positive key presses (PBT)', async () => { + /** + * Model of LaunchScreen's keydown handler, covering exactly the state that + * decides whether a key produces a menu sound: `joinMode` and + * `selectedIndex`. Join mode is entered by the '3' hotkey OR by Enter while + * JOIN ROOM (index 2) is selected; inside join mode NO key fires a sound, + * and Escape returns to the menu. + * + * The previous oracle was a linear count that special-cased only the + * literal key '3'. CI's fast-check seed -366879321 found the gap: + * ['2', 'ArrowDown', 'Enter', 'ArrowUp'] — '2' selects index 1, ArrowDown + * moves to index 2, Enter activates JOIN ROOM (entering join mode), and + * ArrowUp is then correctly silent. The component was right; the model was + * wrong. This oracle simulates the handler instead of counting keys. + */ + function expectedMenuSounds(events: ReadonlyArray<{ key: string; repeat: boolean }>): Array<'navigate' | 'select'> { + const expected: Array<'navigate' | 'select'> = [] + let joinMode = false + let selectedIndex = 0 + for (const e of events) { + if (e.repeat) continue + if (joinMode) { + // Room-code entry: every key is silent; Escape exits back to the menu. + if (e.key === 'Escape') joinMode = false + continue + } + const lowered = e.key.length === 1 ? e.key.toLowerCase() : e.key + if (lowered === 'm' || lowered === 'n') continue // audio toggles, no menu sound + switch (e.key) { + case 'ArrowDown': + expected.push('navigate') + selectedIndex = (selectedIndex + 1) % 4 + break + case 'ArrowUp': + expected.push('navigate') + selectedIndex = (selectedIndex + 3) % 4 + break + case 'Enter': + expected.push('select') + if (selectedIndex === 2) joinMode = true // JOIN ROOM + break + case '1': + expected.push('select') + selectedIndex = 0 + break + case '2': + expected.push('select') + selectedIndex = 1 + break + case '3': + expected.push('select') + selectedIndex = 2 + joinMode = true + break + case '4': + expected.push('select') + selectedIndex = 3 + break + // every other key: no sound + } + } + return expected + } + + function playSequence(events: ReadonlyArray<{ key: string; repeat: boolean }>): Array<'navigate' | 'select'> { + const calls: Array<'navigate' | 'select'> = [] + const { unmount } = render( + calls.push(kind)} + />, + ) + for (const e of events) { + fireEvent.keyDown(window, { key: e.key, repeat: e.repeat }) + } + unmount() + return calls + } + + it('REGRESSION: Enter on JOIN ROOM enters join mode and silences later menu keys (CI seed -366879321)', () => { + // The shrunk counterexample from the CI failure, pinned as a + // deterministic example so the case is exercised on every run + // regardless of the PBT's seed. + const calls = playSequence([ + { key: '2', repeat: false }, // select (CREATE ROOM hotkey), index -> 1 + { key: 'ArrowDown', repeat: false }, // navigate, index -> 2 (JOIN ROOM) + { key: 'Enter', repeat: false }, // select, activates JOIN ROOM -> join mode + { key: 'ArrowUp', repeat: false }, // join mode: silent + ]) + expect(calls).toEqual(['select', 'navigate', 'select']) + }) + + it('menu sound sequence matches the handler model for any key sequence (PBT)', async () => { const fc = await import('fast-check') await fc.assert( fc.asyncProperty( @@ -306,31 +400,16 @@ describe('LaunchScreen menu sounds — property-based', () => { { maxLength: 40 }, ), async (events) => { - const calls: Array<'navigate' | 'select'> = [] - const { unmount } = render( - calls.push(kind)} - />, - ) - // Pressing '3' opens join mode, which changes focus/handling — - // exclude it from the oracle's positive set for this property and - // skip generated events after join-mode opens. Simplest approach: - // filter out '3' entirely for this PBT (covered by explicit tests). - const filtered = events.filter((e) => e.key !== '3') - const expected = filtered.filter( - (e) => !e.repeat && (POSITIVE_KEYS as readonly string[]).includes(e.key), - ).length - for (const e of filtered) { - fireEvent.keyDown(window, { key: e.key, repeat: e.repeat }) - } - unmount() - return calls.length === expected + const calls = playSequence(events) + const expected = expectedMenuSounds(events) + // Full-sequence identity (kinds in order), not just a count — + // a navigate misreported as select would fail here. + return JSON.stringify(calls) === JSON.stringify(expected) }, ), + // 40 runs fits the 5s test timeout (each case renders + replays up to + // 40 events). Verified once at 1500 runs across fresh seeds; the known + // CI counterexample is pinned as the deterministic REGRESSION case above. { numRuns: 40 }, ) }) diff --git a/web/src/components/PlayerDepartureNotice.test.tsx b/web/src/components/PlayerDepartureNotice.test.tsx index 83e08f2..80bafcf 100644 --- a/web/src/components/PlayerDepartureNotice.test.tsx +++ b/web/src/components/PlayerDepartureNotice.test.tsx @@ -34,6 +34,11 @@ describe('PlayerDepartureNotice', () => { expect(toast).not.toBeNull() // Identify the departing slot/name in the message expect(toast.textContent).toMatch(/P2|Player 2|slot 2|left|disconnected/i) + // The toast is an accessible live status region: rendered as , + // which carries an implicit role="status" / aria-live="polite". Pins the + // semantics so a revert to a bare
fails here (and the biome + // lint/a11y/useSemanticElements error doesn't regress). + expect(screen.getByRole('status')).toBe(toast) }) it('auto-dismisses after 3 seconds', () => { diff --git a/web/src/components/PlayerDepartureNotice.tsx b/web/src/components/PlayerDepartureNotice.tsx index 59a9cc1..3cc50f1 100644 --- a/web/src/components/PlayerDepartureNotice.tsx +++ b/web/src/components/PlayerDepartureNotice.tsx @@ -78,10 +78,13 @@ export function PlayerDepartureNotice({ prevState, state }: PlayerDepartureNotic : `${departures.map(describeDeparture).join(', ')} left the game` return ( -
is the semantic element for a status region: it carries an + // implicit role="status" and aria-live="polite", so screen readers + // announce departures without an explicit ARIA role (biome + // lint/a11y/useSemanticElements). Layout is unaffected — the inline + // style sets display:flex. + {message} -
+ ) } diff --git a/web/src/connection-message-diagnostics.test.tsx b/web/src/connection-message-diagnostics.test.tsx new file mode 100644 index 0000000..dcd32f6 --- /dev/null +++ b/web/src/connection-message-diagnostics.test.tsx @@ -0,0 +1,77 @@ +// web/src/connection-message-diagnostics.test.tsx +// Regression: the WebSocket hook used to swallow malformed JSON and unknown +// server message types with an empty catch / no-op fall-through, giving zero +// signal when the protocol drifted (e.g. a server rollout sending a new type +// before the client understood it). The hook must not crash on bad input, and +// must emit a dev-visible warning rather than silently dropping it. + +import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest' +import { renderHook, act } from '@testing-library/react' +import { useGameConnection } from '../../client-core/src/connection/useGameConnection' + +class MockWebSocket { + static OPEN = 1 + static CLOSED = 3 + static instances: MockWebSocket[] = [] + readyState = 0 + onopen: ((e?: unknown) => void) | null = null + onclose: ((e?: unknown) => void) | null = null + onmessage: ((e: { data: string }) => void) | null = null + onerror: ((e?: unknown) => void) | null = null + send = vi.fn() + close = vi.fn() + constructor(public url: string) { + MockWebSocket.instances.push(this) + } + open() { + this.readyState = MockWebSocket.OPEN + this.onopen?.({}) + } + raw(data: string) { + this.onmessage?.({ data }) + } +} + +describe('useGameConnection message diagnostics', () => { + let warn: ReturnType + + beforeEach(() => { + MockWebSocket.instances = [] + vi.stubGlobal('WebSocket', MockWebSocket as unknown as typeof WebSocket) + warn = vi.spyOn(console, 'warn').mockImplementation(() => {}) + }) + + afterEach(() => { + warn.mockRestore() + vi.unstubAllGlobals() + }) + + it('warns (and does not throw) on malformed JSON', () => { + renderHook(() => useGameConnection('ws://localhost/room/ABC123/ws', 'Alice')) + const ws = MockWebSocket.instances[0] + act(() => ws.open()) + + expect(() => act(() => ws.raw('this is not json {{{'))).not.toThrow() + expect(warn).toHaveBeenCalled() + }) + + it('warns (and does not throw) on an unknown message type', () => { + renderHook(() => useGameConnection('ws://localhost/room/ABC123/ws', 'Bob')) + const ws = MockWebSocket.instances[0] + act(() => ws.open()) + warn.mockClear() + + expect(() => act(() => ws.raw(JSON.stringify({ type: 'totally_new_server_type', payload: 1 })))).not.toThrow() + expect(warn).toHaveBeenCalled() + }) + + it('does NOT warn on a well-formed known message (pong)', () => { + renderHook(() => useGameConnection('ws://localhost/room/ABC123/ws', 'Carol')) + const ws = MockWebSocket.instances[0] + act(() => ws.open()) + warn.mockClear() + + act(() => ws.raw(JSON.stringify({ type: 'pong', serverTime: 123 }))) + expect(warn).not.toHaveBeenCalled() + }) +}) diff --git a/web/src/connection-reconnect.test.tsx b/web/src/connection-reconnect.test.tsx new file mode 100644 index 0000000..1bc90df --- /dev/null +++ b/web/src/connection-reconnect.test.tsx @@ -0,0 +1,109 @@ +// web/src/connection-reconnect.test.tsx +// Integration regression for the "ping watchdog kills every reconnection" bug. +// +// This renders the REAL useGameConnection hook (from client-core) against a +// controllable mock WebSocket and fake timers, and asserts the hook does not +// inflict a spurious close on a freshly reconnected socket. A pure-math unit +// test (see client-core/src/connection/liveness.test.ts) cannot catch this — +// the bug was in the wiring (onopen forgetting to refresh liveness), so the +// regression must drive the wiring. + +import { describe, it, expect, vi, beforeEach, afterEach } from 'vitest' +import { renderHook, act } from '@testing-library/react' +import { useGameConnection } from '../../client-core/src/connection/useGameConnection' + +class MockWebSocket { + static CONNECTING = 0 + static OPEN = 1 + static CLOSING = 2 + static CLOSED = 3 + static instances: MockWebSocket[] = [] + + url: string + readyState = MockWebSocket.CONNECTING + onopen: ((e?: unknown) => void) | null = null + onclose: ((e?: unknown) => void) | null = null + onmessage: ((e: { data: string }) => void) | null = null + onerror: ((e?: unknown) => void) | null = null + send = vi.fn() + close = vi.fn(() => { + if (this.readyState === MockWebSocket.CLOSED) return + this.readyState = MockWebSocket.CLOSED + this.onclose?.({}) + }) + + constructor(url: string) { + this.url = url + MockWebSocket.instances.push(this) + } + + // Test drivers + open() { + this.readyState = MockWebSocket.OPEN + this.onopen?.({}) + } + message(data: unknown) { + this.onmessage?.({ data: JSON.stringify(data) }) + } +} + +describe('useGameConnection reconnect watchdog', () => { + beforeEach(() => { + MockWebSocket.instances = [] + vi.stubGlobal('WebSocket', MockWebSocket as unknown as typeof WebSocket) + vi.useFakeTimers() + }) + + afterEach(() => { + vi.useRealTimers() + vi.unstubAllGlobals() + }) + + it('does not self-close a freshly reconnected socket whose prior pong is stale', () => { + renderHook(() => useGameConnection('ws://localhost/room/ABC123/ws', 'Alice')) + + const first = MockWebSocket.instances[0] + expect(first).toBeDefined() + act(() => first.open()) + // A pong establishes the liveness baseline on the first socket. + act(() => first.message({ type: 'pong', serverTime: Date.now() })) + + // Outage: the socket drops after a long silence (well past the 35s window). + act(() => { + vi.advanceTimersByTime(60_000) + first.close() + }) + + // Reconnect backoff elapses and a second socket is created. + act(() => vi.advanceTimersByTime(2_000)) + const second = MockWebSocket.instances[1] + expect(second).toBeDefined() + + // The reconnect opens. With the fix, onopen refreshes liveness here; the + // stale baseline from `first` (now ~62s old) must NOT count against `second`. + act(() => second.open()) + second.close.mockClear() + + // One full ping interval passes. The watchdog must send a heartbeat ping, + // NOT close the healthy socket. + act(() => vi.advanceTimersByTime(30_000)) + + expect(second.close).not.toHaveBeenCalled() + expect(second.send).toHaveBeenCalledWith(JSON.stringify({ type: 'ping' })) + }) + + it('still closes a genuinely silent socket after the timeout (watchdog not disabled)', () => { + renderHook(() => useGameConnection('ws://localhost/room/ABC123/ws', 'Bob')) + + const ws = MockWebSocket.instances[0] + act(() => ws.open()) + ws.close.mockClear() + + // No pong ever arrives. After PING_INTERVAL the watchdog sends a ping; + // by the next interval (>35s of total silence) it must close. + act(() => vi.advanceTimersByTime(30_000)) // first tick: sends ping + expect(ws.close).not.toHaveBeenCalled() + act(() => vi.advanceTimersByTime(30_000)) // second tick: 60s silent -> close + expect(ws.close).toHaveBeenCalled() + }) +}) diff --git a/worker/src/GameRoom.test.ts b/worker/src/GameRoom.test.ts index 4410d86..90bae33 100644 --- a/worker/src/GameRoom.test.ts +++ b/worker/src/GameRoom.test.ts @@ -90,7 +90,13 @@ function createMockDurableObjectContext() { return { toArray: () => [] } } if (query.includes('DELETE FROM rejoin_sessions')) { - delete rejoinSessions[params[0] as string] + if (params.length > 0) { + // Targeted delete by token (consume/expire path). + delete rejoinSessions[params[0] as string] + } else { + // Bulk clear (cleanup path) — drop all tokens for the dead room. + for (const token of Object.keys(rejoinSessions)) delete rejoinSessions[token] + } return { toArray: () => [] } } if (query.includes('DELETE')) { @@ -103,6 +109,7 @@ function createMockDurableObjectContext() { setAlarm: vi.fn(async (time: number) => { alarm = time }), + getAlarm: vi.fn(async () => alarm), deleteAlarm: vi.fn(async () => { alarm = null }), @@ -116,6 +123,9 @@ function createMockDurableObjectContext() { webSockets.push(ws) }), getWebSockets: vi.fn(() => webSockets), + setWebSocketAutoResponse: vi.fn(), + getWebSocketAutoResponseTimestamp: vi.fn((_ws: MockWebSocket): Date | null => null), + waitUntil: vi.fn(), // Test helpers _sqlData: sqlData, _rejoinSessions: rejoinSessions, @@ -125,7 +135,14 @@ function createMockDurableObjectContext() { } function createMockEnv(): Env { - const matchmakerFetch = vi.fn(async () => new Response('OK')) + // GameRoom calls the Matchmaker over RPC (register / unregister). A stable + // stub object lets tests assert which RPC method was invoked. + const matchmakerStub = { + register: vi.fn(async () => ({ ok: true as const })), + unregister: vi.fn(async () => {}), + find: vi.fn(async () => null), + getRoomInfo: vi.fn(async () => null), + } return { GAME_ROOM: { @@ -134,9 +151,7 @@ function createMockEnv(): Env { } as any, MATCHMAKER: { idFromName: vi.fn((name: string) => ({ toString: () => `matchmaker-${name}` })), - get: vi.fn(() => ({ - fetch: matchmakerFetch, - })), + get: vi.fn(() => matchmakerStub), } as any, } } @@ -536,6 +551,46 @@ describe('WebSocket Message Handling', () => { // Alarm should be set expect(ctx.storage.setAlarm).toHaveBeenCalled() }) + + it('ignores ready during active play — cannot force a countdown that resets a live match', async () => { + // Security/state-machine regression: the ready handler had no status + // guard and did not route through the reducer's TRANSITIONS table. A + // scripted client could send `ready` mid-match; once every live player's + // id landed in readyPlayerIds, checkStartConditions() fired a countdown + // from `playing`, which on completion calls startGame() and wipes the + // match (tick/score/wave reset). Ready is only meaningful in `waiting`. + const { gameRoom, ctx } = await createInitializedGameRoom() + const ws1 = createMockWebSocket() + const ws2 = createMockWebSocket() + ctx._webSockets.push(ws1, ws2) + + await joinPlayer(gameRoom, ws1, 'Player1') + await joinPlayer(gameRoom, ws2, 'Player2') + + // A match is in progress. + const game = (gameRoom as unknown as { game: GameState }).game + game.status = 'playing' + game.tick = 500 + game.score = 1234 + game.readyPlayerIds = [] + ws1.send.mockClear() + ws2.send.mockClear() + + // Attempt the exploit: both live players send `ready`. + await gameRoom.webSocketMessage(ws1 as any, JSON.stringify({ type: 'ready' })) + await gameRoom.webSocketMessage(ws2 as any, JSON.stringify({ type: 'ready' })) + + const after = (gameRoom as unknown as { game: GameState }).game + expect(after.status).toBe('playing') // not flipped to countdown + expect(after.countdownRemaining).toBeNull() + expect(after.tick).toBe(500) // match progress untouched + expect(after.score).toBe(1234) // score not wiped + const startedCountdown = [...ws1.send.mock.calls, ...ws2.send.mock.calls].some((call: unknown[]) => { + const msg = JSON.parse(call[0] as string) + return msg.type === 'event' && msg.name === 'countdown_tick' + }) + expect(startedCountdown).toBe(false) + }) }) describe('unready message', () => { @@ -735,6 +790,206 @@ describe('WebSocket Close Handling', () => { expect(cancelCall).toBeDefined() }) + it('drains and cleans up the room when players disconnect at game_over (no leak)', async () => { + // Regression for the "game_over roach motel": disconnecting at the + // game-over screen must remove the player so the room reaches 0 players + // and cleanup() runs, unregistering from the matchmaker. Previously + // PLAYER_LEAVE was blocked in game_over, so the room + registry leaked. + const { gameRoom, ctx, env } = await createInitializedGameRoom() + const ws1 = createMockWebSocket() + const ws2 = createMockWebSocket() + ctx._webSockets.push(ws1, ws2) + + await joinPlayer(gameRoom, ws1, 'Player1') + await joinPlayer(gameRoom, ws2, 'Player2') + + // Force the room into game_over (terminal screen both players are viewing). + const game = (gameRoom as unknown as { game: GameState }).game + game.status = 'game_over' + + // First disconnect at game_over removes exactly one player. + await gameRoom.webSocketClose(ws1 as any, 1006, 'closed', false) + expect(Object.keys((gameRoom as unknown as { game: GameState }).game.players)).toHaveLength(1) + + // Second disconnect empties the room. + await gameRoom.webSocketClose(ws2 as any, 1006, 'closed', false) + expect(Object.keys((gameRoom as unknown as { game: GameState }).game.players)).toHaveLength(0) + + // The empty-room alarm fires cleanup, which unregisters from the matchmaker + // (via RPC) and tears down state — the room no longer leaks. + await gameRoom.alarm() + const matchmakerStub = (env.MATCHMAKER as any).get() + expect(matchmakerStub.unregister).toHaveBeenCalled() + expect((gameRoom as unknown as { game: GameState | null }).game).toBeNull() + }) + + it('clears rejoin tokens when the room is cleaned up (no cross-room token leak)', async () => { + // rejoin_sessions rows were only deleted on expiry/consume, and cleanup() + // dropped game_state but left the token table behind. Each dead room's + // SQLite kept its tokens forever. Cleanup must clear them too. + const { gameRoom, ctx } = await createInitializedGameRoom() + const ws = createMockWebSocket() + ctx._webSockets.push(ws) + + await joinPlayer(gameRoom, ws, 'Player1') // join mints a rejoin token + expect(Object.keys(ctx._rejoinSessions).length).toBeGreaterThan(0) + + // Player leaves → room empties → cleanup is scheduled; the alarm fires it. + await gameRoom.webSocketClose(ws as any, 1000, 'left', true) + await gameRoom.alarm() + + expect(Object.keys(ctx._rejoinSessions)).toHaveLength(0) + expect((gameRoom as unknown as { game: GameState | null }).game).toBeNull() + }) + + it('a mid-game reconnect does not clobber the game-tick alarm (no 5s freeze)', async () => { + // Regression for the single-alarm clobber: a Durable Object has ONE alarm + // and setAlarm overwrites. ensureUnauthenticatedSocketAlarm fired on every + // WS upgrade and set now+5s unconditionally — so a player reconnecting + // mid-match (the ?rejoin upgrade is allowed during `playing`) pushed the + // pending ~33ms game-tick alarm out to +5s, freezing the whole room's loop + // for 5 seconds. Cloudflare's documented "Multiple Events (Single Alarm)" + // pattern is to min-merge via getAlarm() — never replace a sooner alarm. + const { gameRoom, ctx } = await createInitializedGameRoom() + const ws = createMockWebSocket() + ctx._webSockets.push(ws) + await joinPlayer(gameRoom, ws, 'Player1') + await gameRoom.webSocketMessage(ws as any, JSON.stringify({ type: 'start_solo' })) + await completeWipePhases(gameRoom) + expect((gameRoom as unknown as { game: GameState }).game.status).toBe('playing') + + // A game-tick alarm is pending ~33ms out. + const tickDeadline = ctx._alarm() + expect(tickDeadline).not.toBeNull() + expect(tickDeadline! - Date.now()).toBeLessThan(1000) + + // A fresh, not-yet-authenticated socket upgrades mid-game (reconnect that + // hasn't sent rejoin/join yet), triggering the unauthenticated-socket alarm. + const pending = createMockWebSocket() // no attachment -> unauthenticated + ctx._webSockets.push(pending) + await (gameRoom as unknown as { ensureUnauthenticatedSocketAlarm(): Promise }).ensureUnauthenticatedSocketAlarm() + + // The sooner game-tick alarm must survive — not be pushed out to +5s. + const after = ctx._alarm() + expect(after).toBe(tickDeadline) + expect(after! - Date.now()).toBeLessThan(1000) + }) + + it('keeps the DO alive for background registry updates via ctx.waitUntil', async () => { + // fireAndForget tasks (registry refresh, cleanup-alarm scheduling) used to + // run unprotected — an eviction between handler return and task completion + // would silently drop them (e.g. a lost registry update → matchmaker drift). + // They must be registered with ctx.waitUntil so the runtime keeps the DO + // alive until they settle. + const { gameRoom, ctx } = await createInitializedGameRoom() + const ws = createMockWebSocket() + ctx._webSockets.push(ws) + await joinPlayer(gameRoom, ws, 'Player1') + await gameRoom.webSocketMessage(ws as any, JSON.stringify({ type: 'start_solo' })) + await completeWipePhases(gameRoom) + ;(ctx.waitUntil as Mock).mockClear() + + // forfeit → endGame → fireAndForget(registry update + cleanup alarm) + await gameRoom.webSocketMessage(ws as any, JSON.stringify({ type: 'forfeit' })) + + expect(ctx.waitUntil).toHaveBeenCalled() + }) + + describe('WebSocket auto-response (hibernation-friendly heartbeat)', () => { + it('registers a ping→pong auto-response so idle lobbies hibernate through keepalives', async () => { + const { gameRoom, ctx } = await createInitializedGameRoom() + void gameRoom + expect(ctx.setWebSocketAutoResponse).toHaveBeenCalled() + const pair = (ctx.setWebSocketAutoResponse as Mock).mock.calls[0][0] as { request: string; response: string } + expect(pair.request).toContain('ping') + expect(pair.response).toContain('pong') + }) + + it('reconciles heartbeat liveness from the auto-response timestamp so a pinging player is not reaped', async () => { + // Auto-responded pings never invoke webSocketMessage, so they cannot bump + // lastActiveTick directly. A player in active play who only pings (no + // input) for > IDLE_STALE_TICKS would be wrongly reaped unless the tick + // reconciles liveness from the runtime's per-socket auto-response stamp. + const { gameRoom, ctx } = await createInitializedGameRoom() + const ws = createMockWebSocket() + ctx._webSockets.push(ws) + await joinPlayer(gameRoom, ws, 'Player1') + await gameRoom.webSocketMessage(ws as any, JSON.stringify({ type: 'start_solo' })) + await completeWipePhases(gameRoom) + + const game = (gameRoom as unknown as { game: GameState }).game + const playerId = Object.keys(game.players)[0] + // Make the player look long-idle (well past the 2400-tick reap threshold)… + game.players[playerId].lastActiveTick = game.tick - 5000 + // …but their socket pinged just now via the auto-response path. + ;(ctx.getWebSocketAutoResponseTimestamp as Mock).mockReturnValue(new Date()) + + ;(gameRoom as unknown as { tick(): void }).tick() + + const after = (gameRoom as unknown as { game: GameState }).game + expect(after.players[playerId]).toBeDefined() // not reaped + // Reconciliation bumped it out of the stale zone (it ran before the + // TICK reducer incremented, so it trails `after.tick` by one). + expect(after.tick - after.players[playerId].lastActiveTick!).toBeLessThanOrEqual(1) + }) + }) + + describe('alarm() error boundary', () => { + async function reachPlaying() { + const { gameRoom, ctx } = await createInitializedGameRoom() + const ws = createMockWebSocket() + ctx._webSockets.push(ws) + await joinPlayer(gameRoom, ws, 'Player1') + await gameRoom.webSocketMessage(ws as any, JSON.stringify({ type: 'start_solo' })) + await completeWipePhases(gameRoom) + return { gameRoom, ctx } + } + + it('catches a fault inside a tick, logs it, and re-arms with backoff instead of rejecting', async () => { + // Cloudflare retries a throwing alarm() — a reducer exception would + // otherwise retry-storm against a poisoned room. The handler must catch, + // emit a wide event, and re-arm deliberately (backed off), never reject. + const { gameRoom, ctx } = await reachPlaying() + expect((gameRoom as unknown as { game: GameState }).game.status).toBe('playing') + ;(gameRoom as unknown as { tick: () => void }).tick = () => { + throw new Error('boom') + } + const logSpy = vi.spyOn(console, 'log').mockImplementation(() => {}) + + // Must resolve (not reject) even though the tick throws. + await expect(gameRoom.alarm()).resolves.toBeUndefined() + + const emittedAlarmError = logSpy.mock.calls.some( + (c) => typeof c[0] === 'string' && c[0].includes('"event":"alarm_error"'), + ) + expect(emittedAlarmError).toBe(true) + // Re-armed with a backoff (~1s), not the 33ms tick cadence and not abandoned. + const next = ctx._alarm() + expect(next).not.toBeNull() + expect(next! - Date.now()).toBeGreaterThan(500) + logSpy.mockRestore() + }) + + it('gives up and ends the game after repeated consecutive faults (bounded failure)', async () => { + const { gameRoom } = await reachPlaying() + ;(gameRoom as unknown as { tick: () => void }).tick = () => { + throw new Error('persistent fault') + } + const logSpy = vi.spyOn(console, 'log').mockImplementation(() => {}) + + // Drive enough failing alarms to exceed the give-up threshold. + for (let i = 0; i < 12; i++) await gameRoom.alarm() + + // The room halts deterministically instead of spinning forever. + expect((gameRoom as unknown as { game: GameState }).game.status).toBe('game_over') + const gaveUp = logSpy.mock.calls.some( + (c) => typeof c[0] === 'string' && c[0].includes('"event":"alarm_error_giving_up"'), + ) + expect(gaveUp).toBe(true) + logSpy.mockRestore() + }) + }) + it('schedules room cleanup when last player leaves', async () => { const { gameRoom, ctx } = await createInitializedGameRoom() const ws = createMockWebSocket() diff --git a/worker/src/GameRoom.ts b/worker/src/GameRoom.ts index 426f1e1..4cbca3c 100644 --- a/worker/src/GameRoom.ts +++ b/worker/src/GameRoom.ts @@ -45,7 +45,8 @@ function debugLog(tag: string, data: Record): void { console.log(tag, data) } import { createDefaultGameState, migrateGameState } from '../../shared/state-defaults' -import type { Env } from './env' +import type { Env, MatchmakerStub } from './env' +import type { MatchmakerLogContext } from './Matchmaker' import { logEvent } from './logger' export type { Env } @@ -56,6 +57,13 @@ export type { Env } * the constant in index.ts. */ const REQUEST_ID_HEADER = 'x-vaders-request-id' +const REGION_HEADER = 'x-vaders-region' + +/** Extract the originating edge region (colo) threaded from the Worker entry on + * the WS-upgrade request, if present. */ +function getRegion(request?: Request): string | undefined { + return request?.headers.get(REGION_HEADER) ?? undefined +} /** * Extract the requestId from the inbound request header, or generate one if @@ -85,6 +93,21 @@ const RATE_LIMIT_WINDOW_MS = 1000 const RATE_LIMIT_MAX_MESSAGES = 60 const UNAUTHENTICATED_SOCKET_TIMEOUT_MS = 5000 +// App-level heartbeat. The client sends exactly JSON.stringify({type:'ping'}). +// The runtime auto-responds with the pong below WITHOUT waking the DO, so idle +// lobbies hibernate through keepalives (Cloudflare: "Ping/pong handling does +// not interrupt hibernation"). serverTime is omitted — no client reads it. +const PING_REQUEST_BODY = JSON.stringify({ type: 'ping' }) +const PONG_RESPONSE_BODY = JSON.stringify({ type: 'pong' }) +// Liveness window mirroring the client watchdog (PING_INTERVAL + PONG_TIMEOUT). +const HEARTBEAT_LIVENESS_MS = 35000 + +// alarm() error boundary: a faulting alarm is caught and re-armed deliberately +// (backed off) rather than left to Cloudflare's blind retry. After this many +// consecutive failures the room gives up and ends the game (bounded failure). +const ALARM_ERROR_BACKOFF_MS = 1000 +const ALARM_MAX_CONSECUTIVE_ERRORS = 10 + // Per-connection rate limiting state (not serialized into attachment — lives in memory only) interface RateLimitState { count: number @@ -154,23 +177,44 @@ export class GameRoom extends DurableObject { // a stable requestId. This is a "contextual" field rather than passing // requestId through every private method signature, which would be noisy. private currentRequestId: string | null = null + // Originating edge region (colo), threaded in on the WS-upgrade request and + // reused for subsequent ws-message / alarm logs from this DO instance. The + // DO runs in its own isolate, so this is the only way its logs can carry the + // user's region — the Worker-isolate global the logger used to read is never + // visible here. + private currentRegion: string | null = null + // Consecutive alarm() failures (in-memory; resets on success or eviction). + private alarmErrorCount = 0 /** * Emit a wide-event log line with the room-scoped envelope: roomCode, - * requestId, and whatever caller-supplied fields. All meaningful state - * changes inside the DO should go through this so logs are consistent. + * requestId, region, and whatever caller-supplied fields. All meaningful + * state changes inside the DO should go through this so logs are consistent. */ private log(eventName: string, data: Record = {}): void { logEvent(eventName, { roomCode: this.game?.roomCode, requestId: this.currentRequestId ?? undefined, + region: this.currentRegion ?? undefined, ...data, }) } + /** Correlation context for RPC calls into the Matchmaker. */ + private logContext(): MatchmakerLogContext { + return { requestId: this.currentRequestId ?? undefined, region: this.currentRegion ?? undefined } + } + constructor(ctx: DurableObjectState, env: Env) { super(ctx, env) + // Auto-respond to the app-level heartbeat in the runtime so idle lobbies + // stay hibernated through pings instead of waking the DO every 30s per + // connection. Re-set on every constructor run (i.e. every wake/eviction). + if (typeof WebSocketRequestResponsePair !== 'undefined') { + ctx.setWebSocketAutoResponse(new WebSocketRequestResponsePair(PING_REQUEST_BODY, PONG_RESPONSE_BODY)) + } + // Load state from SQLite on wake (hibernation-aware) ctx.blockConcurrencyWhile(async () => { // Initialize SQLite schema if needed @@ -280,13 +324,33 @@ export class GameRoom extends DurableObject { return session.player_id } + /** + * Schedule a wake no later than `when`, without clobbering a sooner pending + * alarm. A Durable Object has a SINGLE alarm and setAlarm() overwrites it — + * the game tick (33ms), countdown (1s), cleanup (5min), and the + * unauthenticated-socket timeout (5s) all share it. Soft deadlines must + * never push out a sooner hard-cadence alarm (notably the game tick), or the + * loop freezes until the later alarm fires. This is Cloudflare's documented + * "Multiple Events (Single Alarm)" pattern: min-merge via getAlarm(). + */ + private async scheduleWakeNoLaterThan(when: number): Promise { + const current = await this.ctx.storage.getAlarm() + if (current === null || current === undefined || when < current) { + await this.ctx.storage.setAlarm(when) + } + } + private async ensureUnauthenticatedSocketAlarm() { const hasUnauthenticatedSocket = this.ctx.getWebSockets().some((ws) => { const attachment = ws.deserializeAttachment() as WebSocketAttachment | null return !attachment?.playerId }) if (!hasUnauthenticatedSocket) return - await this.ctx.storage.setAlarm(Date.now() + UNAUTHENTICATED_SOCKET_TIMEOUT_MS) + // Min-merge: a reconnect mid-match must not push the pending game-tick + // alarm out to +5s (which froze the room). closeStaleUnauthenticatedSockets + // runs at the top of every alarm() anyway, so a sooner tick alarm still + // reaps the unauthenticated socket promptly. + await this.scheduleWakeNoLaterThan(Date.now() + UNAUTHENTICATED_SOCKET_TIMEOUT_MS) } private closeStaleUnauthenticatedSockets(now = Date.now()): number { @@ -309,9 +373,12 @@ export class GameRoom extends DurableObject { * HTTP fetch handler for non-WebSocket requests */ async fetch(request: Request): Promise { - // Capture the per-request id (threaded from Worker entry via header) so - // any logEvent() call reached during this fetch carries it. + // Capture the per-request id + region (threaded from Worker entry via + // headers) so any logEvent() call reached during this fetch carries them. + // The WS upgrade carries the region; keep the prior value for header-less + // internal fetches so a DO instance remembers its connections' region. this.currentRequestId = getRequestId(request) + this.currentRegion = getRegion(request) ?? this.currentRegion const url = new URL(request.url) // POST /init - Initialize room with code @@ -600,6 +667,13 @@ export class GameRoom extends DurableObject { } case 'ready': { + // Ready is only meaningful in the lobby. Without this guard the + // handler bypassed the state machine: a scripted client could send + // `ready` mid-match and, once every live player's id was collected, + // checkStartConditions() would fire a countdown from `playing` and + // startGame() would wipe the live match. See state-machine.pbt and + // GameRoom.test "ignores ready during active play". + if (this.game.status !== 'waiting') break if (playerId && this.game.players[playerId] && !this.game.readyPlayerIds.includes(playerId)) { this.game.readyPlayerIds.push(playerId) debugLog('[READY]', { @@ -625,6 +699,10 @@ export class GameRoom extends DurableObject { } case 'unready': { + // Unready only matters in the lobby (`waiting`) or to cancel a + // pending `countdown`. Ignore it during active play / game_over so + // it can't churn state or broadcasts outside the lobby flow. + if (this.game.status !== 'waiting' && this.game.status !== 'countdown') break if (playerId && this.game.players[playerId]) { const wasReady = this.game.readyPlayerIds.includes(playerId) this.game.readyPlayerIds = this.game.readyPlayerIds.filter((id) => id !== playerId) @@ -768,6 +846,10 @@ export class GameRoom extends DurableObject { private async checkStartConditions() { if (!this.game) return + // Defense in depth: a coop countdown may only be kicked off from the + // lobby. Mirrors the status guard on the `ready` handler so no caller can + // start a countdown (and thus a match-resetting startGame) mid-game. + if (this.game.status !== 'waiting') return const playerCount = Object.keys(this.game.players).length const readyCount = this.game.readyPlayerIds.length const willStart = playerCount >= 2 && readyCount === playerCount @@ -824,34 +906,33 @@ export class GameRoom extends DurableObject { this.persistState() } - private internalRequest(url: string, init?: RequestInit): Request { - const headers = new Headers(init?.headers) - if (this.currentRequestId) headers.set(REQUEST_ID_HEADER, this.currentRequestId) - return new Request(url, { ...init, headers }) + private matchmakerStub(): MatchmakerStub { + return this.env.MATCHMAKER.get(this.env.MATCHMAKER.idFromName('global')) as unknown as MatchmakerStub } private async updateRoomRegistry() { if (!this.game) return - const matchmaker = this.env.MATCHMAKER.get(this.env.MATCHMAKER.idFromName('global')) - await matchmaker.fetch( - this.internalRequest('https://internal/register', { - method: 'POST', - body: JSON.stringify({ - roomCode: this.game.roomCode, - playerCount: Object.keys(this.game.players).length, - status: this.game.status, - }), - }), + await this.matchmakerStub().register( + this.game.roomCode, + Object.keys(this.game.players).length, + this.game.status, + this.logContext(), ) } private fireAndForget(label: string, task: Promise): void { - task.catch((err) => { + const guarded = task.catch((err) => { this.log('async_task_failed', { task: label, message: err instanceof Error ? err.message : String(err), }) }) + // Extend the DO's lifetime until the background task settles. Without this, + // an eviction between the handler returning and the task completing would + // silently drop it — e.g. a registry update lost, reintroducing + // matchmaker/reality drift. ctx.waitUntil is absent in some test harnesses; + // degrade gracefully there. + this.ctx.waitUntil?.(guarded) } private async startGame() { @@ -922,7 +1003,41 @@ export class GameRoom extends DurableObject { // Alarm-driven wakes don't carry an HTTP request, so mint a requestId // for any logEvent() reached during this alarm pass. this.currentRequestId = getRequestId() + try { + await this.runAlarm() + this.alarmErrorCount = 0 + } catch (err) { + this.alarmErrorCount++ + this.log('alarm_error', { + message: err instanceof Error ? err.message : String(err), + stack: err instanceof Error ? err.stack : undefined, + status: this.game?.status, + tick: this.game?.tick, + consecutiveErrors: this.alarmErrorCount, + }) + const active = + !!this.game && ['playing', 'wipe_exit', 'wipe_hold', 'wipe_reveal', 'countdown'].includes(this.game.status) + if (active && this.alarmErrorCount < ALARM_MAX_CONSECUTIVE_ERRORS) { + // Re-arm deliberately with a backoff instead of letting Cloudflare + // retry the alarm blindly at the original 30Hz cadence (an error storm). + await this.ctx.storage.setAlarm(Date.now() + ALARM_ERROR_BACKOFF_MS) + } else if (active) { + // Bounded failure: a persistently faulting room halts rather than + // spinning forever. + this.log('alarm_error_giving_up', { consecutiveErrors: this.alarmErrorCount, status: this.game?.status }) + try { + this.endGame('defeat') + } catch {} + } + } + } + /** + * The actual alarm work: game tick, countdown ticks, and empty-room cleanup. + * Wrapped by alarm() in an error boundary so a fault can't trigger a platform + * retry storm against poisoned state. + */ + private async runAlarm() { const closedUnauthenticated = this.closeStaleUnauthenticatedSockets() if (closedUnauthenticated > 0) { this.log('ws_unauth_timeout', { closedCount: closedUnauthenticated }) @@ -995,6 +1110,24 @@ export class GameRoom extends DurableObject { // Threshold 2400 ticks = 80s at 30Hz ≈ 2 × (PING_INTERVAL + PONG_TIMEOUT) // so we allow two missed pings before treating a player as gone. const IDLE_STALE_TICKS = 2400 + + // 0a. Reconcile liveness from auto-responded pings. With + // setWebSocketAutoResponse the runtime answers heartbeat pings while the DO + // sleeps, so they never reach webSocketMessage and can't bump lastActiveTick + // directly. The runtime stamps each socket's last auto-response, which still + // proves the connection is alive — fold that in before reaping so an idle + // player who is only pinging (no input) is not wrongly reaped. + const nowMs = Date.now() + for (const ws of this.ctx.getWebSockets()) { + const att = ws.deserializeAttachment() as WebSocketAttachment | null + const pid = att?.playerId + if (!pid || !this.game.players[pid]) continue + const lastPing = this.ctx.getWebSocketAutoResponseTimestamp(ws) + if (lastPing && nowMs - lastPing.getTime() < HEARTBEAT_LIVENESS_MS) { + this.game.players[pid].lastActiveTick = this.game.tick + } + } + let reaped = 0 for (const id of Object.keys(this.game.players)) { const p = this.game.players[id] @@ -1243,16 +1376,15 @@ export class GameRoom extends DurableObject { private async cleanup() { if (this.game) { - const matchmaker = this.env.MATCHMAKER.get(this.env.MATCHMAKER.idFromName('global')) - await matchmaker.fetch( - this.internalRequest('https://internal/unregister', { - method: 'POST', - body: JSON.stringify({ roomCode: this.game.roomCode }), - }), - ) + await this.matchmakerStub().unregister(this.game.roomCode, this.logContext()) } await this.ctx.storage.deleteAlarm() this.ctx.storage.sql.exec('DELETE FROM game_state') + // Drop the room's rejoin tokens too. They were only ever deleted on + // expiry/consume, so a cleaned-up room left its token rows in SQLite + // forever (a per-room leak across the fleet). The table itself is + // recreated by the constructor's CREATE TABLE IF NOT EXISTS on next use. + this.ctx.storage.sql.exec('DELETE FROM rejoin_sessions') this.game = null } } diff --git a/worker/src/Matchmaker.test.ts b/worker/src/Matchmaker.test.ts index 0d32c4e..d0da679 100644 --- a/worker/src/Matchmaker.test.ts +++ b/worker/src/Matchmaker.test.ts @@ -2,7 +2,7 @@ // Unit tests for the Matchmaker Durable Object import { describe, it, expect, vi } from 'vitest' -import { Matchmaker } from './Matchmaker' +import { Matchmaker, MAX_TRACKED_ROOMS } from './Matchmaker' // ============================================================================ // Mock DurableObjectState @@ -590,3 +590,78 @@ describe('concurrent operations', () => { expect(Object.keys(storedRooms).length).toBe(3) }) }) + +// ============================================================================ +// Registry bounding (storage-ceiling safety) +// ============================================================================ + +describe('registry bounding', () => { + const FRESH = () => Date.now() + const STALE = () => Date.now() - 6 * 60 * 1000 // older than the 5-min stale threshold + + it('sweeps a created-but-never-joined room from the registry once it goes stale', async () => { + // The big leak path: POST /room registers a room with playerCount 0. If + // nobody ever joins, it is never in openRooms and the old /find loop (which + // only iterated openRooms) never pruned it — so it lived in storage forever. + const { matchmaker, mockState } = await createMatchmaker({ + GHOST1: { playerCount: 0, status: 'waiting', updatedAt: STALE() }, + } as any) + + await matchmaker.fetch(createRequest('GET', '/find')) + + const rooms = (mockState._storage.get('rooms') ?? {}) as Record + expect(rooms.GHOST1).toBeUndefined() + expect(Object.keys(rooms)).toHaveLength(0) + }) + + it('does NOT sweep a recently-updated room (only stale ones are removed)', async () => { + const { matchmaker, mockState } = await createMatchmaker({ + LIVE01: { playerCount: 0, status: 'waiting', updatedAt: FRESH() }, + } as any) + + await matchmaker.fetch(createRequest('GET', '/find')) + + const rooms = (mockState._storage.get('rooms') ?? {}) as Record + expect(rooms.LIVE01).toBeDefined() + }) + + it('drops a game_over room from the registry on register (terminal games are not tracked)', async () => { + const { matchmaker, mockState } = await createMatchmaker({ + DONE01: { playerCount: 2, status: 'playing', updatedAt: FRESH() }, + } as any) + + const res = await matchmaker.fetch( + createRequest('POST', '/register', { roomCode: 'DONE01', playerCount: 2, status: 'game_over' }), + ) + expect(res.status).toBe(200) + + const rooms = (mockState._storage.get('rooms') ?? {}) as Record + expect(rooms.DONE01).toBeUndefined() + }) + + it('rejects a NEW room registration when the registry is at capacity', async () => { + // Structural guard against the single-value storage ceiling: even a burst + // of room creation cannot grow the registry past MAX_TRACKED_ROOMS. + const rooms: Record = {} + for (let i = 0; i < MAX_TRACKED_ROOMS; i++) { + const code = `R${i.toString().padStart(5, '0')}` + // Fresh + playing so the at-capacity sweep cannot free any slot. + rooms[code] = { playerCount: 1, status: 'playing', updatedAt: FRESH(), lastStatusChangeAt: FRESH() } + } + const { matchmaker, mockState } = await createMatchmaker(rooms as any) + + const rejected = await matchmaker.fetch( + createRequest('POST', '/register', { roomCode: 'NEWBIE', playerCount: 0, status: 'waiting' }), + ) + expect(rejected.status).toBe(503) + const stored = mockState._storage.get('rooms') as Record + expect(stored.NEWBIE).toBeUndefined() + expect(Object.keys(stored).length).toBe(MAX_TRACKED_ROOMS) + + // An EXISTING room can still update even at capacity (not a new entry). + const updated = await matchmaker.fetch( + createRequest('POST', '/register', { roomCode: 'R00000', playerCount: 2, status: 'waiting' }), + ) + expect(updated.status).toBe(200) + }) +}) diff --git a/worker/src/Matchmaker.ts b/worker/src/Matchmaker.ts index 2798e99..ffc1fc4 100644 --- a/worker/src/Matchmaker.ts +++ b/worker/src/Matchmaker.ts @@ -1,18 +1,57 @@ // worker/src/Matchmaker.ts -// Matchmaker Durable Object - in-memory room registry +// Matchmaker Durable Object - in-memory room registry. +// +// Exposes typed RPC methods (register / unregister / find / getRoomInfo) as the +// primary API — callers (the Worker entry and each GameRoom) invoke these +// directly on the stub, with no manual HTTP routing or JSON parsing. A thin +// `fetch` adapter is retained only so the existing unit tests and PBT harnesses +// can drive the same logic over a Request; production never uses it. -import type { DurableObjectState } from '@cloudflare/workers-types' +import { DurableObject } from 'cloudflare:workers' import { logEvent } from './logger' const REQUEST_ID_HEADER = 'x-vaders-request-id' +/** A room untouched for this long is presumed dead and swept from the registry. */ +const STALE_THRESHOLD = 5 * 60 * 1000 // 5 minutes +/** A `waiting` room that hasn't made status progress for this long is phantom-trapped. */ +const PROGRESS_STALE_THRESHOLD = 10 * 60 * 1000 // 10 minutes + +/** + * Hard ceiling on tracked rooms. The registry persists as a SINGLE storage + * value (`put('rooms', …)`). This DO is KV-backed (migration `new_classes`, + * async `storage.get/put`), so that value is subject to Cloudflare's **128 KiB + * KV value limit** — NOT the 2 MB SQLite row limit. Each room record is + * ~130 bytes worst case, so 500 rooms (~65 KB) stays well under 128 KiB with + * margin. The cap makes storage structurally bounded even under a burst of + * room creation — past this, new-room registration is refused (the Worker + * surfaces a 503) rather than letting the value grow until `put` throws and + * matchmaking breaks for everyone. + * + * Longer term, the registry should move to per-room SQLite rows (one + * `INSERT OR REPLACE` per room) to remove both the value-size ceiling and the + * whole-blob rewrite on every register. That needs a storage-backend + * migration; the cap + sweep is the correct bounded fix until then. + * See docs/TODO.md "Matchmaker scaling". + */ +export const MAX_TRACKED_ROOMS = 500 + +/** Correlation context threaded from the caller for wide-event logging. */ +export interface MatchmakerLogContext { + requestId?: string + region?: string +} + +/** Result of a register attempt — `matchmaker_full` when the cap is hit. */ +export type RegisterResult = { ok: true } | { ok: false; code: 'matchmaker_full' } + function getRequestId(request?: Request): string | undefined { return request?.headers.get(REQUEST_ID_HEADER) ?? undefined } // lastStatusChangeAt is the timestamp of the most recent status // transition (waiting → countdown, waiting → playing, etc.). Unlike -// updatedAt — which refreshes on every /register, including pure +// updatedAt — which refreshes on every register, including pure // playerCount churn — it only moves when productive progress happens. // A room that stays in `waiting` forever (because phantoms trap each // new victim in an endless "0/N ready" cycle) will have a frozen @@ -24,14 +63,15 @@ type RoomInfo = { lastStatusChangeAt: number } -export class Matchmaker { +export class Matchmaker extends DurableObject { private rooms: Record = {} private openRooms: Set = new Set() - constructor(private state: DurableObjectState) { + constructor(ctx: DurableObjectState, env: unknown) { + super(ctx, env as never) // Restore from storage on cold start - this.state.blockConcurrencyWhile(async () => { - const stored = await this.state.storage.get>('rooms') + this.ctx.blockConcurrencyWhile(async () => { + const stored = await this.ctx.storage.get>('rooms') let totalRooms = 0 let openRooms = 0 if (stored) { @@ -39,12 +79,11 @@ export class Matchmaker { // Rebuild openRooms set. Require playerCount > 0 so stranded // empty rooms (creator abandoned, room never unregistered) // don't become matchmaking targets and trap later joiners. - // PBT finding — see state-machine.pbt.test.ts FOUND BUG (LOW). for (const [roomCode, info] of Object.entries(stored)) { totalRooms++ - // Backfill lastStatusChangeAt on old persisted records — - // added alongside Option C. Use updatedAt as a conservative - // proxy; legacy entries aren't aged out immediately. + // Backfill lastStatusChangeAt on old persisted records — use + // updatedAt as a conservative proxy so legacy entries aren't aged + // out immediately. if (typeof info.lastStatusChangeAt !== 'number') { info.lastStatusChangeAt = info.updatedAt ?? Date.now() } @@ -54,10 +93,6 @@ export class Matchmaker { } } } - // Wide event on DO cold-start rehydration so we can see how the - // matchmaker warmed up — useful for diagnosing "nobody found my - // room" cases where storage hydration might have stripped stale - // entries. logEvent('mm_rehydrate', { totalRoomsStored: totalRooms, openRoomsRebuilt: openRooms, @@ -65,197 +100,269 @@ export class Matchmaker { }) } - async fetch(request: Request): Promise { - const url = new URL(request.url) - const requestId = getRequestId(request) - - // POST /register - Room registers/updates itself - if (url.pathname === '/register' && request.method === 'POST') { - const { roomCode, playerCount, status } = (await request.json()) as { - roomCode: string - playerCount: number - status: string - } - const wasOpen = this.openRooms.has(roomCode) - const prev = this.rooms[roomCode] - const statusChanged = !prev || prev.status !== status - const now = Date.now() - this.rooms[roomCode] = { - playerCount, - status, - updatedAt: now, - // Refresh only on status transitions (Option C). Plain - // playerCount churn — the signature of phantom-trapped rooms - // where new victims join/leave without ever readying — must - // NOT refresh this, or the progress-stale prune can't fire. - lastStatusChangeAt: statusChanged ? now : prev.lastStatusChangeAt, - } - - // Update openRooms set. Require playerCount > 0 to avoid - // returning empty rooms from /find — a player who creates a - // room then abandons it before anyone joins would otherwise - // strand the next matchmaker alone in a dead room for the - // full STALE_THRESHOLD (5 min). Once someone joins, - // playerCount > 0 and the room becomes matchable. - const nowOpen = status === 'waiting' && playerCount > 0 && playerCount < 4 - if (nowOpen) { - this.openRooms.add(roomCode) - } else { + /** + * Remove every room whose last update is older than STALE_THRESHOLD from + * both the registry and the open-rooms set. This is the structural bound on + * registry size: any room that stops registering — created-and-abandoned, + * crashed, or evicted without cleanup — disappears within STALE_THRESHOLD of + * its last touch, instead of living in the single-value `rooms` blob forever. + * Does not persist; callers persist once after their own writes. + */ + private sweepStaleRooms(now: number): number { + let swept = 0 + for (const [roomCode, info] of Object.entries(this.rooms)) { + if (now - info.updatedAt > STALE_THRESHOLD) { + delete this.rooms[roomCode] this.openRooms.delete(roomCode) + swept++ } + } + return swept + } - await this.state.storage.put('rooms', this.rooms) - // Wide event on every registration. Includes the transition so - // diagnostic queries like "which rooms flipped open ↔ closed - // around t?" are one filter. openRoomsCount is the post-update - // size of the matchable pool. + // ─── RPC methods (primary API) ──────────────────────────────────────────── + + /** Register or update a room. Returns `matchmaker_full` if the cap is hit. */ + async register( + roomCode: string, + playerCount: number, + status: string, + log: MatchmakerLogContext = {}, + ): Promise { + const { requestId, region } = log + const now = Date.now() + + // Terminal games are not matchable and only consume space. Treat a + // game_over registration as an unregister so finished rooms don't linger + // in the registry (one of the unbounded-growth paths). + if (status === 'game_over') { + const wasKnown = roomCode in this.rooms + delete this.rooms[roomCode] + this.openRooms.delete(roomCode) + await this.ctx.storage.put('rooms', this.rooms) logEvent('mm_register', { requestId, + region, roomCode, playerCount, status, - openTransition: wasOpen === nowOpen ? 'no-change' : wasOpen ? 'opened→closed' : 'closed→opened', + openTransition: wasKnown ? 'closed→closed' : 'no-change', openRoomsCount: this.openRooms.size, }) - return new Response('OK') + return { ok: true } } - // POST /unregister - Room removes itself - if (url.pathname === '/unregister' && request.method === 'POST') { - const { roomCode } = (await request.json()) as { roomCode: string } - const wasKnown = roomCode in this.rooms - delete this.rooms[roomCode] + // Capacity guard: never let the single-value registry grow past the cap. + // For a brand-new room, try to free space by sweeping stale entries first; + // if still full, refuse so `put('rooms', …)` can't exceed the 128 KiB limit. + const isNewRoom = !(roomCode in this.rooms) + if (isNewRoom && Object.keys(this.rooms).length >= MAX_TRACKED_ROOMS) { + this.sweepStaleRooms(now) + if (Object.keys(this.rooms).length >= MAX_TRACKED_ROOMS) { + await this.ctx.storage.put('rooms', this.rooms) + logEvent('mm_register_rejected_at_capacity', { + requestId, + region, + roomCode, + trackedRooms: Object.keys(this.rooms).length, + cap: MAX_TRACKED_ROOMS, + }) + return { ok: false, code: 'matchmaker_full' } + } + } + + const wasOpen = this.openRooms.has(roomCode) + const prev = this.rooms[roomCode] + const statusChanged = !prev || prev.status !== status + this.rooms[roomCode] = { + playerCount, + status, + updatedAt: now, + // Refresh only on status transitions (Option C). Plain playerCount churn + // — the signature of phantom-trapped rooms where new victims join/leave + // without ever readying — must NOT refresh this, or the progress-stale + // prune can't fire. + lastStatusChangeAt: statusChanged ? now : prev.lastStatusChangeAt, + } + + // Require playerCount > 0 to avoid returning empty rooms from find() — a + // creator who abandons before anyone joins would otherwise strand the next + // matchmaker alone for the full STALE_THRESHOLD. + const nowOpen = status === 'waiting' && playerCount > 0 && playerCount < 4 + if (nowOpen) { + this.openRooms.add(roomCode) + } else { this.openRooms.delete(roomCode) - await this.state.storage.put('rooms', this.rooms) - logEvent('mm_unregister', { - requestId, - roomCode, - wasKnown, - openRoomsCount: this.openRooms.size, - }) - return new Response('OK') } - // GET /find - Find an open room. - // - // Read-through verification: openRooms is maintained on every - // /register, but it's possible (for instance, on cold-start race with - // an in-flight update, or a register/unregister out-of-order) for - // the set to contain a room whose current info no longer satisfies - // the open criteria. Rather than trust set-membership, we re-verify - // status + playerCount against this.rooms before returning the - // roomCode. Stale entries are pruned on-the-fly. - if (url.pathname === '/find') { - const STALE_THRESHOLD = 5 * 60 * 1000 // 5 minutes - // Option C: a room that's been in `waiting` for more than this - // threshold without any status transition is presumed stuck - // (e.g. phantom-trapped: new victims cycle through without - // readying, keeping updatedAt fresh but lastStatusChangeAt - // frozen). Force-prune it from the matchmaker so the next - // matchmaker sees a fresh pool. - const PROGRESS_STALE_THRESHOLD = 10 * 60 * 1000 // 10 minutes - const now = Date.now() - - // Track pruning reasons so the wide event can explain WHY /find - // returned null in any given call. Helpful for the "I matchmaked - // but ended up alone" report — shows whether the pool was empty, - // populated-but-stale, or populated-but-all-filtered. - const scanned = this.openRooms.size - let prunedMissing = 0 - let prunedStale = 0 - let prunedProgressStale = 0 - let prunedFiltered = 0 - - for (const roomCode of this.openRooms) { - const info = this.rooms[roomCode] - if (!info) { - this.openRooms.delete(roomCode) - prunedMissing++ - continue - } - if (now - info.updatedAt > STALE_THRESHOLD) { - delete this.rooms[roomCode] - this.openRooms.delete(roomCode) - prunedStale++ - continue - } - // Progress-stale: the room IS active (updatedAt recent) but - // hasn't made any status progress for >10 min. Phantom-trapped - // rooms look exactly like this. Prune the registry entry - // entirely — the GameRoom DO remains, but matchmakers stop - // sending new victims. Next DO wake will fire Option A's - // reconciliation, cleaning the phantoms. Meanwhile, the next - // matchmaker gets a fresh room rather than joining the trap. - if (info.status === 'waiting' && now - info.lastStatusChangeAt > PROGRESS_STALE_THRESHOLD) { - logEvent('mm_prune_stale_by_progress', { - requestId, - roomCode, - playerCount: info.playerCount, - status: info.status, - msSinceStatusChange: now - info.lastStatusChangeAt, - msSinceLastUpdate: now - info.updatedAt, - progressThresholdMs: PROGRESS_STALE_THRESHOLD, - }) - delete this.rooms[roomCode] - this.openRooms.delete(roomCode) - prunedProgressStale++ - continue - } - // Read-through guard — defends against `openRooms` drifting out - // of sync with `rooms`. A room that started in `waiting` and - // flipped to `countdown`/`playing` via a re-register that came - // in between set updates would otherwise be briefly findable. - if (info.status !== 'waiting' || info.playerCount <= 0 || info.playerCount >= 4) { - this.openRooms.delete(roomCode) - prunedFiltered++ - continue - } - logEvent('mm_find_result', { + await this.ctx.storage.put('rooms', this.rooms) + logEvent('mm_register', { + requestId, + region, + roomCode, + playerCount, + status, + openTransition: wasOpen === nowOpen ? 'no-change' : wasOpen ? 'opened→closed' : 'closed→opened', + openRoomsCount: this.openRooms.size, + }) + return { ok: true } + } + + /** Remove a room from the registry. */ + async unregister(roomCode: string, log: MatchmakerLogContext = {}): Promise { + const wasKnown = roomCode in this.rooms + delete this.rooms[roomCode] + this.openRooms.delete(roomCode) + await this.ctx.storage.put('rooms', this.rooms) + logEvent('mm_unregister', { + requestId: log.requestId, + region: log.region, + roomCode, + wasKnown, + openRoomsCount: this.openRooms.size, + }) + } + + /** Find an open room to join, or null if none. Sweeps stale entries first. */ + async find(log: MatchmakerLogContext = {}): Promise { + const { requestId, region } = log + const now = Date.now() + + // Bound the registry: sweep EVERY room whose last update is older than + // STALE_THRESHOLD, not just the ones in openRooms. Created-but-never-joined + // rooms (playerCount 0) and any other entry that fell out of openRooms + // previously lived forever because the loop below only scans openRooms. + // Sweeping all of this.rooms here makes the registry size O(rooms active in + // the last 5 minutes). Persist immediately so the sweep is durable even + // when find() returns a hit early. + if (this.sweepStaleRooms(now) > 0) { + await this.ctx.storage.put('rooms', this.rooms) + } + + const scanned = this.openRooms.size + let prunedMissing = 0 + let prunedStale = 0 + let prunedProgressStale = 0 + let prunedFiltered = 0 + + for (const roomCode of this.openRooms) { + const info = this.rooms[roomCode] + if (!info) { + this.openRooms.delete(roomCode) + prunedMissing++ + continue + } + if (now - info.updatedAt > STALE_THRESHOLD) { + delete this.rooms[roomCode] + this.openRooms.delete(roomCode) + prunedStale++ + continue + } + // Progress-stale: the room IS active (updatedAt recent) but hasn't made + // any status progress for >10 min. Phantom-trapped rooms look exactly + // like this. Prune the registry entry so matchmakers stop sending victims. + if (info.status === 'waiting' && now - info.lastStatusChangeAt > PROGRESS_STALE_THRESHOLD) { + logEvent('mm_prune_stale_by_progress', { requestId, - result: 'hit', + region, roomCode, playerCount: info.playerCount, status: info.status, - openRoomsScanned: scanned, - prunedMissing, - prunedStale, - prunedProgressStale, - prunedFiltered, - }) - return new Response(JSON.stringify({ roomCode }), { - headers: { 'Content-Type': 'application/json' }, + msSinceStatusChange: now - info.lastStatusChangeAt, + msSinceLastUpdate: now - info.updatedAt, + progressThresholdMs: PROGRESS_STALE_THRESHOLD, }) + delete this.rooms[roomCode] + this.openRooms.delete(roomCode) + prunedProgressStale++ + continue + } + // Read-through guard — defends against openRooms drifting out of sync. + if (info.status !== 'waiting' || info.playerCount <= 0 || info.playerCount >= 4) { + this.openRooms.delete(roomCode) + prunedFiltered++ + continue } - - await this.state.storage.put('rooms', this.rooms) - logEvent('mm_find_result', { requestId, - result: 'miss', - roomCode: null, + region, + result: 'hit', + roomCode, + playerCount: info.playerCount, + status: info.status, openRoomsScanned: scanned, prunedMissing, prunedStale, prunedProgressStale, prunedFiltered, - openRoomsRemaining: this.openRooms.size, - }) - return new Response(JSON.stringify({ roomCode: null }), { - headers: { 'Content-Type': 'application/json' }, }) + return roomCode + } + + await this.ctx.storage.put('rooms', this.rooms) + logEvent('mm_find_result', { + requestId, + region, + result: 'miss', + roomCode: null, + openRoomsScanned: scanned, + prunedMissing, + prunedStale, + prunedProgressStale, + prunedFiltered, + openRoomsRemaining: this.openRooms.size, + }) + return null + } + + /** Look up a single room's registry entry, or null if unknown. */ + async getRoomInfo(roomCode: string): Promise<({ roomCode: string } & RoomInfo) | null> { + const info = this.rooms[roomCode] + if (!info) return null + return { roomCode, ...info } + } + + // ─── fetch adapter (test/harness compatibility only) ────────────────────── + // Thin shim delegating to the RPC methods above so existing tests that drive + // the Matchmaker over a Request keep working. Production code uses RPC. + + async fetch(request: Request): Promise { + const url = new URL(request.url) + const log: MatchmakerLogContext = { requestId: getRequestId(request) } + const json = { 'Content-Type': 'application/json' } + + if (url.pathname === '/register' && request.method === 'POST') { + const { roomCode, playerCount, status } = (await request.json()) as { + roomCode: string + playerCount: number + status: string + } + const result = await this.register(roomCode, playerCount, status, log) + if (!result.ok) { + return new Response(JSON.stringify({ code: result.code, message: 'Too many active rooms' }), { + status: 503, + headers: json, + }) + } + return new Response('OK') + } + + if (url.pathname === '/unregister' && request.method === 'POST') { + const { roomCode } = (await request.json()) as { roomCode: string } + await this.unregister(roomCode, log) + return new Response('OK') + } + + if (url.pathname === '/find') { + const roomCode = await this.find(log) + return new Response(JSON.stringify({ roomCode }), { headers: json }) } - // GET /info/:roomCode - Get room info const infoMatch = url.pathname.match(/^\/info\/([A-Z0-9]{6})$/) if (infoMatch) { - const roomCode = infoMatch[1] - const info = this.rooms[roomCode] - if (!info) { - return new Response('Not found', { status: 404 }) - } - return new Response(JSON.stringify({ roomCode, ...info }), { - headers: { 'Content-Type': 'application/json' }, - }) + const info = await this.getRoomInfo(infoMatch[1]) + if (!info) return new Response('Not found', { status: 404 }) + return new Response(JSON.stringify(info), { headers: json }) } return new Response('Not found', { status: 404 }) diff --git a/worker/src/env.ts b/worker/src/env.ts index 48259f7..557cb14 100644 --- a/worker/src/env.ts +++ b/worker/src/env.ts @@ -1,6 +1,18 @@ // worker/src/env.ts // Shared Env interface for Cloudflare Worker bindings +import type { Matchmaker } from './Matchmaker' + +/** + * The RPC surface of the Matchmaker DO as seen by callers. The binding's stub + * is cast to this at call sites: `env.MATCHMAKER.get(id) as unknown as + * MatchmakerStub`. (We can't type the namespace as + * `DurableObjectNamespace` directly because the worker package + * typechecks against the hand-rolled `cloudflare:workers` mock, whose + * DurableObject base isn't RPC-branded.) + */ +export type MatchmakerStub = Pick + export interface Env { GAME_ROOM: DurableObjectNamespace MATCHMAKER: DurableObjectNamespace diff --git a/worker/src/game/reducer.test.ts b/worker/src/game/reducer.test.ts index d7961d8..702044b 100644 --- a/worker/src/game/reducer.test.ts +++ b/worker/src/game/reducer.test.ts @@ -6,6 +6,7 @@ import { gameReducer, canTransition } from './reducer' import type { GameState, Player, BarrierEntity } from '../../../shared/types' import { LAYOUT, + STANDARD_WIDTH, DEFAULT_CONFIG, WIPE_TIMING, HITBOX, @@ -126,13 +127,21 @@ describe('canTransition', () => { }) describe('from game_over status', () => { - it('returns false for all actions (terminal state)', () => { + it('is terminal for gameplay actions (no progression out of game_over)', () => { expect(canTransition('game_over', 'TICK')).toBe(false) expect(canTransition('game_over', 'PLAYER_JOIN')).toBe(false) - expect(canTransition('game_over', 'PLAYER_LEAVE')).toBe(false) expect(canTransition('game_over', 'PLAYER_INPUT')).toBe(false) expect(canTransition('game_over', 'START_SOLO')).toBe(false) }) + + it('allows PLAYER_LEAVE so disconnecting players are removed and the room can drain', () => { + // Regression: game_over previously blocked PLAYER_LEAVE, so a player who + // disconnected at the game-over screen was never removed. playerCount + // never reached 0, cleanup() never ran, and the room + its matchmaker + // registry entry leaked forever (the same disease as the phantom-player + // bug, in its third home). Leaving must always be permitted. + expect(canTransition('game_over', 'PLAYER_LEAVE')).toBe(true) + }) }) }) @@ -215,6 +224,23 @@ describe('PLAYER_LEAVE action', () => { expect(Object.keys(result.state.players).length).toBe(0) }) + it('removes a player who leaves at the game_over screen (room can drain)', () => { + // Behavioural companion to the canTransition gate: leaving from game_over + // must actually remove the player and emit player_left, so the room can + // reach playerCount 0 and be cleaned up instead of leaking. + const { state, players } = createTestGameStateWithPlayers(2, { status: 'game_over' }) + const leaving = players[0].id + const remaining = players[1].id + + const result = gameReducer(state, { type: 'PLAYER_LEAVE', playerId: leaving }) + + expect(result.state.players[leaving]).toBeUndefined() + expect(result.state.players[remaining]).toBeDefined() + expect(Object.keys(result.state.players).length).toBe(1) + expect(result.events.some((e) => e.type === 'event' && e.name === 'player_left')).toBe(true) + expect(result.persist).toBe(true) + }) + it('removes player from readyPlayerIds', () => { const { state, player } = createTestGameStateWithPlayer({ id: 'p1' }) state.readyPlayerIds = ['p1'] @@ -3593,7 +3619,7 @@ describe('boundary conditions', () => { expect(result.state.players[player.id].x).toBe(LAYOUT.PLAYER_MIN_X) }) - it('player at x=PLAYER_MAX_X (112) trying to move right stays at PLAYER_MAX_X', () => { + it('player at x=PLAYER_MAX_X (116) trying to move right stays at PLAYER_MAX_X', () => { const { state, players } = createTestPlayingState(1) const player = players[0] player.x = LAYOUT.PLAYER_MAX_X @@ -3969,24 +3995,29 @@ describe('boundary conditions', () => { expect(result.state.players[player.id].alive).toBe(false) }) - it('player hitbox at PLAYER_MAX_X does not extend past screen width', () => { - // Player at x=PLAYER_MAX_X (114) - // Hitbox: [114-2, 114+3) = [112, 117) - // x=117 would be out of hitbox + it('player hitbox at PLAYER_MAX_X reaches the right wall without extending past it', () => { + // With symmetric bounds, x=PLAYER_MAX_X (116) puts the sprite CENTER such + // that its right edge sits on the last column (119). Hitbox is + // [116-3, 116+3+1) = [113, 120): it covers the final on-screen column and + // stops exactly at the screen width — no cell beyond it. The old test + // relied on a 4-column dead zone to the right (the bug being fixed), so + // there is no longer an on-screen column "just outside" on the right. const { state, players } = createTestPlayingState(1) const player = players[0] - player.x = LAYOUT.PLAYER_MAX_X // x=114 + player.x = LAYOUT.PLAYER_MAX_X // x=116 player.lives = 3 state.players[player.id] = player - // Alien bullet just outside hitbox - const alienBullet = createTestBullet('ab1', 117, LAYOUT.PLAYER_Y, null, 1) - state.entities = [alienBullet, createTestAlien('a1', 20, 5)] + // A bullet in the last on-screen column hits the player defending the wall. + const atWall = createTestBullet('ab1', STANDARD_WIDTH - 1, LAYOUT.PLAYER_Y, null, 1) + state.entities = [atWall, createTestAlien('a1', 20, 5)] const result = gameReducer(state, { type: 'TICK' }) - // checkPlayerHit: 117 >= 112 && 117 < 117 -> false (117 not < 117) - expect(result.state.players[player.id].alive).toBe(true) // Misses + // checkPlayerHit: 119 >= 113 && 119 < 120 -> true -> hit + expect(result.state.players[player.id].alive).toBe(false) + // The hitbox upper bound is exactly the screen width — it does not extend past it. + expect(LAYOUT.PLAYER_MAX_X + HITBOX.PLAYER_HALF_WIDTH + 1).toBe(STANDARD_WIDTH) }) it('UFO off-screen removal uses generous bounds (x < -3 or x > width + 3)', () => { diff --git a/worker/src/game/reducer.ts b/worker/src/game/reducer.ts index 722bc49..50e3145 100644 --- a/worker/src/game/reducer.ts +++ b/worker/src/game/reducer.ts @@ -101,7 +101,12 @@ const TRANSITIONS: Record { +async function generateUniqueRoomCode(matchmaker: MatchmakerStub): Promise { for (let attempt = 0; attempt < MAX_ROOM_GENERATION_ATTEMPTS; attempt++) { const roomCode = generateRoomCode() - const check = await matchmaker.fetch(new Request(`https://internal/info/${roomCode}`)) - if (check.status === 404) { + const existing = await matchmaker.getRoomInfo(roomCode) + if (existing === null) { return roomCode } } return null } +type CreateRoomResult = { ok: true } | { ok: false; status: number; body: unknown } + /** * Create and initialize a new game room. - * Returns the room code or null on failure. + * + * Both downstream calls are checked: a `register` rejection (the matchmaker is + * at capacity) must propagate so the caller can return an honest error instead + * of handing the client a roomCode for a room the registry never accepted, and + * a failed GameRoom init must be compensated by un-registering the room. */ -async function createRoom(env: Env, matchmaker: DurableObjectStub, roomCode: string): Promise { - const id = env.GAME_ROOM.idFromName(roomCode) - const stub = env.GAME_ROOM.get(id) +async function createRoom( + env: Env, + matchmaker: MatchmakerStub, + roomCode: string, + log: MatchmakerLogContext, +): Promise { + // Register FIRST. If the matchmaker is at capacity it rejects here, before we + // initialize a GameRoom Durable Object — otherwise a full matchmaker would + // leave behind orphan initialized rooms (a GameRoom with state but no + // registry entry) on every rejected create. A playerCount-0 `waiting` room + // is never returned by find(), so registering before init is safe. + const registered = await matchmaker.register(roomCode, 0, 'waiting', log) + if (!registered.ok) { + return { ok: false, status: 503, body: { code: registered.code, message: 'Too many active rooms' } } + } - await stub.fetch( + // GameRoom keeps a fetch handler (it serves WebSocket upgrades, which RPC + // cannot), so init goes over fetch. + const stub = env.GAME_ROOM.get(env.GAME_ROOM.idFromName(roomCode)) + const initRes = await stub.fetch( new Request('https://internal/init', { method: 'POST', body: JSON.stringify({ roomCode }), }), ) + if (!initRes.ok) { + // Compensate: roll back the registry entry so we never advertise a room + // whose GameRoom failed to initialize. + await Promise.resolve(matchmaker.unregister(roomCode, log)).catch(() => {}) + return { ok: false, status: 503, body: { code: 'room_init_failed', message: 'Could not initialize room' } } + } - await matchmaker.fetch( - new Request('https://internal/register', { - method: 'POST', - body: JSON.stringify({ roomCode, playerCount: 0, status: 'waiting' }), - }), - ) + return { ok: true } } export default { @@ -96,11 +124,13 @@ export default { // Worker entry log to the matching DO-side logs. const requestId = crypto.randomUUID() - // Capture the Cloudflare colo (region) from request.cf if present so - // subsequent logEvent() calls within this request can include it. - // request.cf is undefined in tests/Node; we guard for that. - const colo = (request as Request & { cf?: { colo?: string } }).cf?.colo - ;(globalThis as { CF_REGION?: string | undefined }).CF_REGION = colo ?? undefined + // Capture the Cloudflare colo (region) from request.cf and thread it + // EXPLICITLY into every logEvent and DO call below. Previously this was + // stashed in a module-level global that (a) is never visible inside a DO's + // own isolate, so DO logs had no region, and (b) is clobbered across + // concurrent requests in one isolate. request.cf is undefined in tests/Node. + const region = (request as Request & { cf?: { colo?: string } }).cf?.colo + const log: MatchmakerLogContext = { requestId, region } // CORS headers for all responses const corsHeaders = { @@ -116,6 +146,7 @@ export default { method: request.method, path: url.pathname, requestId, + region, }) // Handle CORS preflight @@ -125,11 +156,11 @@ export default { // POST /room - Create new room if (url.pathname === '/room' && request.method === 'POST') { - const matchmaker = env.MATCHMAKER.get(env.MATCHMAKER.idFromName('global')) + const matchmaker = env.MATCHMAKER.get(env.MATCHMAKER.idFromName('global')) as unknown as MatchmakerStub const roomCode = await generateUniqueRoomCode(matchmaker) if (!roomCode) { - logEvent('http_room_create', { requestId, outcome: 'generation_failed' }) + logEvent('http_room_create', { requestId, region, outcome: 'generation_failed' }) return new Response( JSON.stringify({ code: 'room_generation_failed', @@ -142,9 +173,16 @@ export default { ) } - await createRoom(env, matchmaker, roomCode) + const created = await createRoom(env, matchmaker, roomCode, log) + if (!created.ok) { + logEvent('http_room_create', { requestId, region, outcome: 'create_failed', roomCode }) + return new Response(JSON.stringify(created.body), { + status: created.status, + headers: { 'Content-Type': 'application/json', ...corsHeaders }, + }) + } - logEvent('http_room_create', { requestId, outcome: 'created', roomCode }) + logEvent('http_room_create', { requestId, region, outcome: 'created', roomCode }) return new Response(JSON.stringify({ roomCode }), { headers: { 'Content-Type': 'application/json', ...corsHeaders }, }) @@ -164,18 +202,18 @@ export default { // upgrade handshake; this line records that a client reached // the edge with the intent to connect. Pair via requestId with // any downstream room_join / game_in_progress / ws_error events. - logEvent('ws_upgrade_attempt', { requestId, roomCode }) + logEvent('ws_upgrade_attempt', { requestId, region, roomCode }) const id = env.GAME_ROOM.idFromName(roomCode) const stub = env.GAME_ROOM.get(id) - // Thread requestId to the DO so its logs correlate with this request. - return stub.fetch(withRequestId(request, requestId)) + // Thread requestId + region to the DO so its logs correlate with this + // request and carry the originating edge region. + return stub.fetch(withCorrelation(request, requestId, region)) } // GET /matchmake - Find or create open room if (url.pathname === '/matchmake') { - const matchmaker = env.MATCHMAKER.get(env.MATCHMAKER.idFromName('global')) - const result = await matchmaker.fetch(new Request('https://internal/find')) - const { roomCode: existingRoom } = (await result.json()) as { roomCode: string | null } + const matchmaker = env.MATCHMAKER.get(env.MATCHMAKER.idFromName('global')) as unknown as MatchmakerStub + const existingRoom = await matchmaker.find(log) if (existingRoom) { // Wide event: matchmaker returned an existing room. Pair with @@ -184,6 +222,7 @@ export default { // by requestId. logEvent('http_matchmake', { requestId, + region, outcome: 'joined_existing', roomCode: existingRoom, }) @@ -198,6 +237,7 @@ export default { if (!newRoomCode) { logEvent('http_matchmake', { requestId, + region, outcome: 'generation_failed', }) return new Response( @@ -212,13 +252,21 @@ export default { ) } - await createRoom(env, matchmaker, newRoomCode) + const createdForMatch = await createRoom(env, matchmaker, newRoomCode, log) + if (!createdForMatch.ok) { + logEvent('http_matchmake', { requestId, region, outcome: 'create_failed', roomCode: newRoomCode }) + return new Response(JSON.stringify(createdForMatch.body), { + status: createdForMatch.status, + headers: { 'Content-Type': 'application/json', ...corsHeaders }, + }) + } // "created_fresh" is the interesting case for diagnosis — this is // the path a "stranded matchmaker" hits (see docs/TODO.md). Pair // with the Lobby-side UX that shows them they seeded a new room. logEvent('http_matchmake', { requestId, + region, outcome: 'created_fresh', roomCode: newRoomCode, }) @@ -231,18 +279,17 @@ export default { const infoMatch = url.pathname.match(/^\/room\/([A-Z0-9]{6})$/) if (infoMatch) { const roomCode = infoMatch[1] - const matchmaker = env.MATCHMAKER.get(env.MATCHMAKER.idFromName('global')) - const result = await matchmaker.fetch(new Request(`https://internal/info/${roomCode}`)) + const matchmaker = env.MATCHMAKER.get(env.MATCHMAKER.idFromName('global')) as unknown as MatchmakerStub + const info = await matchmaker.getRoomInfo(roomCode) - if (result.status === 404) { + if (info === null) { return new Response(JSON.stringify({ error: 'Room not found' }), { status: 404, headers: { 'Content-Type': 'application/json', ...corsHeaders }, }) } - const data = await result.json() - return new Response(JSON.stringify(data), { + return new Response(JSON.stringify(info), { headers: { 'Content-Type': 'application/json', ...corsHeaders }, }) } diff --git a/worker/src/integration.test.ts b/worker/src/integration.test.ts index 0a77db5..368cf57 100644 --- a/worker/src/integration.test.ts +++ b/worker/src/integration.test.ts @@ -72,6 +72,7 @@ function createMockDurableObjectContext() { setAlarm: vi.fn(async (time: number) => { alarm = time }), + getAlarm: vi.fn(async () => alarm), deleteAlarm: vi.fn(async () => { alarm = null }), @@ -85,6 +86,8 @@ function createMockDurableObjectContext() { webSockets.push(ws) }), getWebSockets: vi.fn(() => webSockets), + setWebSocketAutoResponse: vi.fn(), + getWebSocketAutoResponseTimestamp: vi.fn((_ws: unknown): Date | null => null), _sqlData: sqlData, _webSockets: webSockets, _alarm: () => alarm, @@ -114,6 +117,35 @@ function createMockMatchmakerState() { } } +// env.MATCHMAKER binding that delegates the RPC surface to a real Matchmaker +// instance (and keeps fetch for any direct-fetch helpers). GameRoom/Worker call +// register/unregister/find/getRoomInfo over RPC now, not fetch. +function matchmakerBindingFor(matchmaker: Matchmaker) { + return { + idFromName: vi.fn(() => ({ toString: () => 'matchmaker-global' })), + get: vi.fn(() => ({ + fetch: (r: Request) => matchmaker.fetch(r), + register: matchmaker.register.bind(matchmaker), + unregister: matchmaker.unregister.bind(matchmaker), + find: matchmaker.find.bind(matchmaker), + getRoomInfo: matchmaker.getRoomInfo.bind(matchmaker), + })), + } +} + +// No-op RPC binding for scenarios that don't wire a real matchmaker. +function noopMatchmakerBinding() { + return { + idFromName: vi.fn(() => ({ toString: () => 'matchmaker-global' })), + get: vi.fn(() => ({ + register: vi.fn(async () => ({ ok: true })), + unregister: vi.fn(async () => {}), + find: vi.fn(async () => null), + getRoomInfo: vi.fn(async () => null), + })), + } +} + // Helper to extract messages of a specific type from WebSocket send calls function getMessages(ws: MockWebSocket, type?: string): ServerMessage[] { return ws.send.mock.calls @@ -183,10 +215,7 @@ describe('Integration: Player Creates Room, Another Player Joins', () => { idFromName: vi.fn((name: string) => ({ toString: () => name })), get: vi.fn(), } as any, - MATCHMAKER: { - idFromName: vi.fn((name: string) => ({ toString: () => `matchmaker-${name}` })), - get: vi.fn(() => ({ fetch: matchmakerFetch })), - } as any, + MATCHMAKER: matchmakerBindingFor(matchmaker) as any, } gameRoom = new GameRoom(gameRoomCtx as any, env) @@ -490,17 +519,13 @@ describe('Integration: Two Players Invoke Matchmaking', () => { } const ctx = createMockDurableObjectContext() - const matchmakerFetch = vi.fn(async (request: Request) => matchmaker.fetch(request)) const env: Env = { GAME_ROOM: { idFromName: vi.fn((name: string) => ({ toString: () => name })), get: vi.fn(), } as any, - MATCHMAKER: { - idFromName: vi.fn(() => ({ toString: () => 'matchmaker-global' })), - get: vi.fn(() => ({ fetch: matchmakerFetch })), - } as any, + MATCHMAKER: matchmakerBindingFor(matchmaker) as any, } const room = new GameRoom(ctx as any, env) @@ -719,17 +744,13 @@ describe('Integration: Complete 4-Player Game Flow', () => { const roomCode = 'FULL4P' const ctx = createMockDurableObjectContext() - const matchmakerFetch = vi.fn(async (request: Request) => matchmaker.fetch(request)) const env: Env = { GAME_ROOM: { idFromName: vi.fn((name: string) => ({ toString: () => name })), get: vi.fn(), } as any, - MATCHMAKER: { - idFromName: vi.fn(() => ({ toString: () => 'matchmaker-global' })), - get: vi.fn(() => ({ fetch: matchmakerFetch })), - } as any, + MATCHMAKER: matchmakerBindingFor(matchmaker) as any, } const gameRoom = new GameRoom(ctx as any, env) @@ -823,17 +844,13 @@ describe('Integration: Edge Cases', () => { beforeEach(async () => { ctx = createMockDurableObjectContext() - const matchmakerFetch = vi.fn(async () => new Response('OK')) const env: Env = { GAME_ROOM: { idFromName: vi.fn((name: string) => ({ toString: () => name })), get: vi.fn(), } as any, - MATCHMAKER: { - idFromName: vi.fn(() => ({ toString: () => 'matchmaker-global' })), - get: vi.fn(() => ({ fetch: matchmakerFetch })), - } as any, + MATCHMAKER: noopMatchmakerBinding() as any, } gameRoom = new GameRoom(ctx as any, env) @@ -972,7 +989,10 @@ describe('Worker: HTTP Endpoints', () => { MATCHMAKER: { idFromName: vi.fn((name: string) => ({ toString: () => `matchmaker-${name}` })), get: vi.fn(() => ({ - fetch: vi.fn(async () => new Response(JSON.stringify({ roomCode: null }))), + register: vi.fn(async () => ({ ok: true })), + unregister: vi.fn(async () => {}), + find: vi.fn(async () => null), + getRoomInfo: vi.fn(async () => null), })), } as any, } @@ -1030,7 +1050,7 @@ describe('Worker: HTTP Endpoints', () => { it('returns 404 for non-existent room', async () => { const env = createMockEnv() ;(env.MATCHMAKER.get as Mock).mockReturnValue({ - fetch: vi.fn(async () => new Response('Not found', { status: 404 })), + getRoomInfo: vi.fn(async () => null), }) const request = new Request('http://localhost/room/NOROOM') @@ -1043,6 +1063,49 @@ describe('Worker: HTTP Endpoints', () => { }) }) + describe('Room creation robustness', () => { + it('returns 503 (not a phantom room) when the matchmaker is at capacity', async () => { + // createRoom used to ignore the /register response, so POST /room would + // return 200 with a roomCode for a room the registry rejected — a room + // nobody could matchmake into. The Worker must surface the failure. + const env = createMockEnv() + ;(env.MATCHMAKER.get as Mock).mockReturnValue({ + getRoomInfo: vi.fn(async () => null), // generated code is unique + register: vi.fn(async () => ({ ok: false, code: 'matchmaker_full' })), + }) + + const response = await worker.fetch(new Request('http://localhost/room', { method: 'POST' }), env) + + expect(response.status).toBe(503) + const data = (await response.json()) as { code: string; roomCode?: string } + expect(data.code).toBe('matchmaker_full') + expect(data.roomCode).toBeUndefined() + }) + + it('returns 503 when room initialization conflicts (rolls back the registry, no broken room)', async () => { + const env = createMockEnv() + ;(env.GAME_ROOM.get as Mock).mockReturnValue({ + fetch: vi.fn(async () => new Response('Already initialized', { status: 409 })), + }) + const registerSpy = vi.fn(async () => ({ ok: true })) + const unregisterSpy = vi.fn(async () => {}) + ;(env.MATCHMAKER.get as Mock).mockReturnValue({ + getRoomInfo: vi.fn(async () => null), + register: registerSpy, + unregister: unregisterSpy, + }) + + const response = await worker.fetch(new Request('http://localhost/room', { method: 'POST' }), env) + + expect(response.status).toBe(503) + // Register-first ordering: the room WAS registered (via RPC), then the + // failed init must be compensated by an unregister so no broken room is + // advertised. + expect(registerSpy).toHaveBeenCalled() + expect(unregisterSpy).toHaveBeenCalled() + }) + }) + describe('404 handling', () => { it('returns 404 for unknown routes', async () => { const env = createMockEnv() diff --git a/worker/src/logger.test.ts b/worker/src/logger.test.ts index 3c74eeb..cc07384 100644 --- a/worker/src/logger.test.ts +++ b/worker/src/logger.test.ts @@ -82,19 +82,17 @@ describe('logEvent', () => { expect(parsed.roomCode).toBe('ABC123') }) - it('omits region when globalThis.CF_REGION is unset', () => { - // Default state (afterEach deletes CF_REGION) + it('omits region when the caller does not supply it', () => { logEvent('test_event', { roomCode: 'ABC123' }) const parsed = JSON.parse(consoleLogSpy.mock.calls[0][0] as string) expect('region' in parsed).toBe(false) }) - it('includes region from globalThis.CF_REGION when set (simulating request.cf?.colo)', () => { - // Simulate the HTTP middleware setting CF_REGION from request.cf.colo - ;(globalThis as { CF_REGION?: string }).CF_REGION = 'LHR' - - logEvent('request_received', { method: 'POST', path: '/room' }) + it('includes region when the caller supplies it (threaded from request.cf?.colo)', () => { + // Region is now an explicit caller-supplied field — threaded from the + // Worker entry through RPC contexts / the WS-upgrade header — not a global. + logEvent('request_received', { method: 'POST', path: '/room', region: 'LHR' }) const parsed = JSON.parse(consoleLogSpy.mock.calls[0][0] as string) expect(parsed.region).toBe('LHR') diff --git a/worker/src/logger.ts b/worker/src/logger.ts index ef31e6e..f04e4e4 100644 --- a/worker/src/logger.ts +++ b/worker/src/logger.ts @@ -12,8 +12,12 @@ // in GameRoom.ts for per-message breadcrumbs. // - `undefined` fields are stripped before emit so JSON output stays clean // and queryable in log aggregators (undefined → missing key, not "null"). -// - Region is read lazily from `globalThis.CF_REGION` so a per-request middleware -// can set it from `request.cf?.colo` without threading the value everywhere. +// - `region` (the edge colo) is a caller-supplied field passed explicitly in +// `data`, threaded from the Worker entry's `request.cf?.colo` through RPC +// contexts and the WS-upgrade header. It is NOT read from a global: a global +// set in the Worker isolate is never visible inside a Durable Object's own +// isolate (so DO logs would lack region) and is clobbered across concurrent +// requests in one isolate. import { BUILD_INFO } from './buildInfo' @@ -24,23 +28,18 @@ import { BUILD_INFO } from './buildInfo' * - `event` — the eventName * - `version`, `commitHash`, `buildTime` — from ./buildInfo * - `timestamp` — ISO-8601 at emit time - * - `region` — from globalThis.CF_REGION (may be undefined → omitted) * - * Caller-supplied fields in `data` override nothing in the envelope; they are - * merged alongside. `undefined` values (from either envelope or data) are - * stripped before serialization so downstream log queries don't hit "null" - * surprises. + * Caller-supplied fields in `data` (including `region`, `roomCode`, `requestId`) + * are merged alongside. `undefined` values are stripped before serialization so + * downstream log queries don't hit "null" surprises. */ export function logEvent(eventName: string, data: Record): void { - const region = (globalThis as { CF_REGION?: string }).CF_REGION - const envelope: Record = { event: eventName, version: BUILD_INFO.version, commitHash: BUILD_INFO.commitHash, buildTime: BUILD_INFO.buildTime, timestamp: new Date().toISOString(), - region, ...data, } diff --git a/worker/src/mocks/cloudflare-workers.ts b/worker/src/mocks/cloudflare-workers.ts index 474644e..ef0adf2 100644 --- a/worker/src/mocks/cloudflare-workers.ts +++ b/worker/src/mocks/cloudflare-workers.ts @@ -16,6 +16,20 @@ export interface DurableObjectState { blockConcurrencyWhile(callback: () => Promise): Promise acceptWebSocket(ws: WebSocket): void getWebSockets(): WebSocket[] + setWebSocketAutoResponse(maybeReqResp?: unknown): void + getWebSocketAutoResponseTimestamp(ws: WebSocket): Date | null + waitUntil?(promise: Promise): void +} + +// Provide the WebSocketRequestResponsePair runtime global for the test +// environment — it exists natively in the Workers runtime but not in Node. +if (typeof (globalThis as Record).WebSocketRequestResponsePair === 'undefined') { + ;(globalThis as Record).WebSocketRequestResponsePair = class { + constructor( + public request: string, + public response: string, + ) {} + } } export interface DurableObjectStorage { @@ -26,6 +40,7 @@ export interface DurableObjectStorage { put(key: string, value: unknown): Promise delete(key: string): Promise setAlarm(time: number): Promise + getAlarm(): Promise deleteAlarm(): Promise } diff --git a/worker/src/room-creation.pbt.test.ts b/worker/src/room-creation.pbt.test.ts index e809bdb..847c410 100644 --- a/worker/src/room-creation.pbt.test.ts +++ b/worker/src/room-creation.pbt.test.ts @@ -83,6 +83,7 @@ function createMockDurableObjectContext() { setAlarm: vi.fn(async (time: number) => { alarm = time }), + getAlarm: vi.fn(async () => alarm), deleteAlarm: vi.fn(async () => { alarm = null }), @@ -94,6 +95,8 @@ function createMockDurableObjectContext() { webSockets.push(ws) }), getWebSockets: vi.fn(() => webSockets.filter((ws) => !ws._closed)), + setWebSocketAutoResponse: vi.fn(), + getWebSocketAutoResponseTimestamp: vi.fn((_ws: unknown): Date | null => null), _sqlData: sqlData, _webSockets: webSockets, _alarm: () => alarm, @@ -156,7 +159,13 @@ class WorkerHarness { } as unknown as Env['GAME_ROOM'], MATCHMAKER: { idFromName: vi.fn(() => ({ toString: () => 'matchmaker-global' })), - get: vi.fn(() => ({ fetch: matchmakerFetch })), + get: vi.fn(() => ({ + fetch: matchmakerFetch, + register: this.matchmaker.register.bind(this.matchmaker), + unregister: this.matchmaker.unregister.bind(this.matchmaker), + find: this.matchmaker.find.bind(this.matchmaker), + getRoomInfo: this.matchmaker.getRoomInfo.bind(this.matchmaker), + })), } as unknown as Env['MATCHMAKER'], // eslint-disable-next-line @typescript-eslint/no-explicit-any ASSETS: undefined as any, @@ -422,9 +431,14 @@ describe('PBT Room Creation: saturation safety', () => { ) } } else if (result.status === 503) { - if (result.code !== 'room_generation_failed') { + // Two legitimate saturation reasons: the code generator exhausted its + // attempts (room_generation_failed), or the matchmaker registry hit + // its MAX_TRACKED_ROOMS cap (matchmaker_full). Both are valid; what + // must never happen is a duplicate or an unexpected status. + const validSaturationCodes = ['room_generation_failed', 'matchmaker_full'] + if (!validSaturationCodes.includes(result.code as string)) { throw new Error( - `Saturated POST /room returned 503 with code=${result.code}, expected 'room_generation_failed'`, + `Saturated POST /room returned 503 with code=${result.code}, expected one of ${validSaturationCodes.join(' | ')}`, ) } } else { diff --git a/worker/src/room-joining.pbt.test.ts b/worker/src/room-joining.pbt.test.ts index 84ead0c..1ad8cbe 100644 --- a/worker/src/room-joining.pbt.test.ts +++ b/worker/src/room-joining.pbt.test.ts @@ -92,6 +92,7 @@ function createMockDurableObjectContext() { setAlarm: vi.fn(async (time: number) => { alarm = time }), + getAlarm: vi.fn(async () => alarm), deleteAlarm: vi.fn(async () => { alarm = null }), @@ -103,6 +104,8 @@ function createMockDurableObjectContext() { webSockets.push(ws) }), getWebSockets: vi.fn(() => webSockets.filter((ws) => !ws._closed)), + setWebSocketAutoResponse: vi.fn(), + getWebSocketAutoResponseTimestamp: vi.fn((_ws: unknown): Date | null => null), _sqlData: sqlData, _webSockets: webSockets, _alarm: () => alarm, @@ -165,7 +168,13 @@ class RoomJoiningHarness { } as unknown as Env['GAME_ROOM'], MATCHMAKER: { idFromName: vi.fn(() => ({ toString: () => 'matchmaker-global' })), - get: vi.fn(() => ({ fetch: matchmakerFetch })), + get: vi.fn(() => ({ + fetch: matchmakerFetch, + register: this.matchmaker.register.bind(this.matchmaker), + unregister: this.matchmaker.unregister.bind(this.matchmaker), + find: this.matchmaker.find.bind(this.matchmaker), + getRoomInfo: this.matchmaker.getRoomInfo.bind(this.matchmaker), + })), } as unknown as Env['MATCHMAKER'], // eslint-disable-next-line @typescript-eslint/no-explicit-any ASSETS: undefined as any, diff --git a/worker/src/state-machine.pbt.test.ts b/worker/src/state-machine.pbt.test.ts index abc779b..c6a910c 100644 --- a/worker/src/state-machine.pbt.test.ts +++ b/worker/src/state-machine.pbt.test.ts @@ -97,6 +97,7 @@ function createMockDurableObjectContext() { setAlarm: vi.fn(async (time: number) => { alarm = time }), + getAlarm: vi.fn(async () => alarm), deleteAlarm: vi.fn(async () => { alarm = null }), @@ -108,6 +109,8 @@ function createMockDurableObjectContext() { webSockets.push(ws) }), getWebSockets: vi.fn(() => webSockets.filter((ws) => !ws._closed)), + setWebSocketAutoResponse: vi.fn(), + getWebSocketAutoResponseTimestamp: vi.fn((_ws: unknown): Date | null => null), _sqlData: sqlData, _webSockets: webSockets, _alarm: () => alarm, @@ -198,7 +201,13 @@ class RealSystem { } as unknown as Env['GAME_ROOM'], MATCHMAKER: { idFromName: vi.fn(() => ({ toString: () => 'matchmaker-global' })), - get: vi.fn(() => ({ fetch: matchmakerFetch })), + get: vi.fn(() => ({ + fetch: matchmakerFetch, + register: this.matchmaker.register.bind(this.matchmaker), + unregister: this.matchmaker.unregister.bind(this.matchmaker), + find: this.matchmaker.find.bind(this.matchmaker), + getRoomInfo: this.matchmaker.getRoomInfo.bind(this.matchmaker), + })), } as unknown as Env['MATCHMAKER'], } @@ -1304,7 +1313,12 @@ function makePhantomEnv(): Env { GAME_ROOM: { idFromName: vi.fn(), get: vi.fn() } as unknown as Env['GAME_ROOM'], MATCHMAKER: { idFromName: vi.fn(), - get: vi.fn(() => ({ fetch: vi.fn(async () => new Response('OK')) })), + get: vi.fn(() => ({ + register: vi.fn(async () => ({ ok: true })), + unregister: vi.fn(async () => {}), + find: vi.fn(async () => null), + getRoomInfo: vi.fn(async () => null), + })), } as unknown as Env['MATCHMAKER'], } }