Real DuckDB Quack infrastructure for the "Five LLMs, One Browser" Werewolf post.
The post's default demo runs entirely in one browser tab, so it uses a
postMessage shim for the server side of Quack. This repo is the companion lab
for the real transport: native DuckDB processes serving Quack over HTTP, queried
by a gateway DuckDB client, with player-local agent actions and a small browser
runner.
- Config-driven native DuckDB player nodes, each running
quack_serve(...). - One gateway container that queries the player nodes with real
quack_query(...). - Generated player services from JSON config, plus a browser UI that can generate a 3 to 12 player roster and choose each role.
- Per-player private tables:
self,knowledge,suspicions,intents,votes. - Container-local agent actions through
container/agent-act.sh, written into the running DuckDB process through a FIFO to avoid file-lock conflicts. - Scripted, oMLX, OpenAI-compatible, and OpenAI provider modes. OpenAI defaults
to
gpt-4o-mini; oMLX discovers local models through/v1/models. - Model output normalization before writes: phase actions are constrained, wolf targets are forced to valid non-partners, and wolf actions cannot publish text.
- Public views that expose only safe columns during play.
- A wolf-channel view that row-filters locally based on the player's own role.
- Optional post-game audit view through
POST_GAME=true. - Browser controls for
Play Game,Audit Log,Download JSON,Stop, plus collapsible manual commands. - A lightweight auto-play referee that starts the lab, runs discussion, vote, and wolf phases, tracks alive players, applies plurality eliminations, and declares village, wolves, or undecided after the max round limit.
- Server-side Quack authentication and authorization callbacks.
- Quack protocol logging from the gateway side.
This is now more than a transport smoke test, but it is still a lab. The current referee is intentionally small: it handles votes, wolf kills, alive-player filtering, and win checks, but it is not a complete Werewolf rules engine with rich role powers, persistent long-term agent memory, or production session security.
- Docker with Compose v2.
jqon the host.- Node 18 or newer for the browser runner.
- Network access during image build so Docker can download the DuckDB CLI and the Quack extension from DuckDB's extension repository.
Quack is currently a DuckDB 1.5 beta feature. The Dockerfile defaults to
v1.5.2; change DUCKDB_VERSION if the Quack extension requires a newer release.
The Compose file pins linux/amd64 because DuckDB's CLI release assets are most
predictable there. Docker Desktop will emulate it on Apple Silicon.
Run the deterministic Quack smoke test:
./bin/labctl smokeThe smoke test:
- generates
.generated/docker-compose.ymlfromconfig/game.sample.json - starts one gateway and one container per configured player
- asks every player container to take one day action
- asks every wolf container to take one wolf action
- verifies
whoamireturns one row per player node - verifies
public_logreturns public statements and no rationale - verifies
wolf_channelqueries every player, but only wolf nodes return rows - verifies
post_game_intentsis closed whilePOST_GAME=false - verifies
denied_private_tablefails because player authorization rejects direct access to the privateintentstable
Start the browser runner:
make webThen open http://localhost:5174.
During development, use the hot-reload runner:
make web-devIt polls bin, lib, container, eval, web, sql, config, Dockerfile,
docker-compose.yml, and Makefile. When one of those files changes, it
restarts the Node web server and the browser reloads after the reconnect.
The browser runner is the easiest way to exercise the lab. It wraps the same
labctl commands and writes a runtime config to .generated/web-game.json.
Main controls:
Players: choose 3 to 12 generated players and set each role.Provider: chooseScripted,oMLX,Compatible, orOpenAI.Play Game: runs the lightweight referee through discussion, vote, and wolf phases until village wins, wolves win, or the max round limit is reached.Audit Log: queriesfull_log, which returns private rationale only when post-game audit is enabled.Download JSON: exports nodes, public log, wolf channel, audit log, roster, provider, model, round, and export warnings.Stop: tears down the generated lab and removes volumes.Manual steps: exposes lower-level controls such as start, day, wolf, public log, wolf channel, denied scope, whoami, smoke, and one full manual round.
For OpenAI, set LLM_API_KEY in the shell that starts make web, select
OpenAI in the UI, and leave the API key field blank. The UI will default the
model to gpt-4o-mini.
For oMLX, start the local server on the host, choose oMLX, click Discover,
then run the game controls. Player containers reach the host through
http://host.docker.internal:8000/v1.
./bin/labctl generate
./bin/labctl up
./bin/labctl run-day
./bin/labctl run-wolf
./bin/labctl query public_log
./bin/labctl query wolf_channel
./bin/labctl query denied_private_tableRun a single player action:
./bin/labctl run-agent agent-a vote
./bin/labctl run-agent agent-d wolfClean up:
./bin/labctl downTo change the default CLI roster, edit config/game.sample.json. The generated
Compose file is not source controlled.
oMLX exposes an OpenAI-compatible API at
http://localhost:8000/v1, including /v1/chat/completions and /v1/models.
Because the player agents run inside Docker, they reach the host server through
host.docker.internal.
The local OMLX path is the default engine for live evals. For repeatable local
development, .env.example documents:
OMLX_API_KEY=1234
OMLX_BASE_URL=http://localhost:8000/v11234 is only a local development key for your own OMLX server. Do not reuse it
as a production secret.
Start oMLX on the host:
brew services start omlxor run the CLI server directly:
omlx serve --model-dir ~/modelsVerify that the server is reachable and has a model:
curl -sS -H "Authorization: Bearer ${OMLX_API_KEY}" http://localhost:8000/v1/modelsThen run the canonical local OMLX smoke and daily eval:
make eval-omlx-smoke
make eval-miniThe make targets preflight /v1/models with OMLX_API_KEY before starting a
live run. The smoke target chooses the first model unless OMLX_MODEL is set,
generates a three-player config, asks each container-local agent to act through
oMLX, and then runs the same Quack gateway assertions as the deterministic
smoke. Profile eval targets such as make eval-mini, make eval-hot, and
make eval-all-omlx preflight the model configured in the selected profile.
The model response is treated as a proposal, not as trusted SQL input. Before
container/agent-act.sh writes to DuckDB, it normalizes the action for the
current phase, retargets invalid wolf choices, and strips public text from
wolf-phase actions.
| Browser post | Real Quack lab |
|---|---|
| Player Web Worker | Native DuckDB process in a container |
| Worker-owned DuckDB-WASM DB | Container-owned DuckDB database file |
postMessage request/response |
Quack HTTP protocol |
| JS token/policy shim | Quack auth/authz callbacks |
| Gateway worker fan-out | Gateway DuckDB running quack_query(...) |
| Browser orchestrator asking an agent to act | labctl run-agent, run-day, and run-wolf invoking container/agent-act.sh inside each player container |
| Browser game loop | Browser runner plus server-side lightweight referee |
wolf-team-read row predicate |
wolf_channel view checks local self.role |
The default post still matters because it runs for anyone in one browser tab. This lab is for readers who want the real distributed version.
labctl accepts these commands:
generate
up
down
run-agent <id> <day|vote|wolf>
run-day
run-wolf
query <whoami|public_log|wolf_channel|full_log|denied_private_table>
smoke
config
Useful make targets:
make web Start the browser runner on http://localhost:5174
make web-dev Start the browser runner with file watching and browser reload
make web-test Run the orchestrator unit checks (tests/lab-web.ts)
make eval-test Run the eval framework unit checks
make eval-run PROFILE=eval/profiles/<name>.json Run a batch eval profile
make eval-report Compare eval/runs into Markdown and JSON
make eval-matrix Run the promptfoo matrix (requires npm install + running web server)
make eval-matrix-node24 Run the promptfoo matrix through one-off Node 24
make eval-inspect-test Syntax-check the Inspect wrapper through uv
make eval-omlx-smoke Preflight local OMLX, then run the keyed smoke
make eval-large Run the 50-game omlx variance profile
make eval-mini Run the 5-game / 3-player omlx daily smoke
make eval-nothink Run the thinking_budget=0 omlx counterfactual
make eval-7p Run the 7-player omlx profile
make eval-hot Run the temperature=0.7 omlx variance probe
make eval-all-omlx Run every omlx profile back-to-back (short-circuits on gate failure)
make eval-anthropic Run the Claude Haiku 4.5 profile (needs ANTHROPIC_API_KEY)
make baseline-refresh Regenerate eval/baselines/fixtures.json
make baseline-check Verify the committed baseline matches the aggregator
make test Run agent, generator, web, eval, and real Quack smoke checks
make down Stop the generated lab
Research-grade eval details live in docs/research-eval-plan.md. The local
omlx path remains the default: run a profile with make eval-mini or
make eval-run PROFILE=..., then use eval/report.ts to compare scorecards.
Each run directory contains manifest.json, scorecard.json, gates.json, and
copied durable game logs.
Promptfoo is available as a comparison orchestrator with stable provider labels
for stub, omlx-mini, omlx-default, omlx-nothink, omlx-hot, and
omlx-7p. Use make eval-matrix with the installed Node runtime. If promptfoo's
SQLite dependency fails on a newer Node version, use make eval-matrix-node24
or npm run eval:matrix:node24.
full_log reads from post_game_intents. By default the player nodes start with
POST_GAME=false, so the view returns no rows. Set POST_GAME=true before
starting the nodes, or check Post-game audit in the browser runner, to expose
private rationale through that view. This is an explicit post-game audit surface,
not hidden chain-of-thought.
This is a local lab, not a production deployment.
- Quack binds inside a Docker network.
- Each player has its own token.
- Authentication is backed by a
quack_tokenstable. - Authorization is a SQL macro that allowlists read-only queries against exposed views and rejects raw private tables.
- Agent actions write through a local FIFO into the DuckDB process already running inside the player container. That avoids DuckDB file-lock conflicts and keeps action writes local to the player node rather than granting the gateway a write path.
- API keys can be passed through environment variables or the browser form. For
repeatable local testing, prefer setting
LLM_API_KEYin the shell that starts the browser runner and leaving the UI field blank. - There is no TLS in the lab. Put a reverse proxy in front of Quack for anything beyond local development.
The important property is architectural: each player server owns the private data, and policy is evaluated on that player's DuckDB side before rows leave the node.
bin/ user-callable entry points (labctl, web servers, smoke runner)
container/ scripts that run INSIDE Docker player / gateway containers
lib/ importable modules (orchestrator helpers, lab-span, mint-token,
generate-compose)
eval/ eval framework (aggregate.ts, run.ts, profiles/)
tests/ every test suite
sql/ DuckDB SQL fragments (player init, gateway init, scopes)
web/ browser runner static assets
docs/ architecture, roadmap, eval plan, implementation status
In the container image, the host directories container/ and lib/ are
copied to /app/container/ and /app/lib/ respectively. Callers
(labctl, gateway-query.sh, the generated compose) use those container
paths.
docs/architecture.md: implementation architecture and current boundaries.docs/roadmap.md: completed work and next milestones.docs/eval-plan.md: historical eval framework design and metric taxonomy.docs/research-eval-plan.md: current research eval layer and workflow.docs/implementation-status.md: per-feature status, including the eval framework and the omlx + Qwen3.5 reasoning-mode notes.