Source: heuristic discovery from
docker-compose.yml,app/main.py,app/core/config.py,HANDOFF.md,.claude/rules/. Operational scope is intentionally small — this is a single-host portfolio system.
Symptoms: Pages mount but every TanStack Query hook stays in pending state. Likely causes:
frontend/.envVITE_API_BASE_URLpoints at a LAN host the browser can't reach.- Backend not running on
:8123. - CORS rejected the origin. Diagnosis:
curl -s http://localhost:8123/health # backend reachable?
grep VITE_API_BASE_URL frontend/.env # what URL is the SPA calling?Resolution:
- Edit
frontend/.env→VITE_API_BASE_URL=http://localhost:8123. - Restart Vite:
cd frontend && ./node_modules/.bin/vite --host 0.0.0.0. - If CORS error in browser console, add the origin to
app/main.pyallow_origins(dev-only LAN regex already covers10.x/192.168.x/172.16-31.x).
Symptoms: make docker-up exits non-zero, or docker compose ps shows one of postgres / backend / frontend as Health: unhealthy (or starting past the start_period).
Diagnosis:
docker compose ps --format json | python3 -c "import sys,json; [print(json.loads(l)['Service'],'=>',json.loads(l)['Health']) for l in sys.stdin if l.strip()]"
docker compose logs <unhealthy-svc> --tail 50Resolution: one of the following, by failure mode.
- Backend logs show
getaddrinfo failedonpostgres. Postgres wasn't healthy when the backend started despite thedepends_on: service_healthygate (rare — usually a cold-start race after adocker system prune). Restart:docker compose restart backend. If it persists,docker compose down && make docker-upfrom a clean state. curl http://localhost:8123/healthfrom the host returns ECONNREFUSED, butdocker compose pssaysbackend healthy. Port 8123 is held by a host-side process (a staleuv run uvicornfrom an earlier session). Resolve:lsof -iTCP:8123 -sTCP:LISTENto find the PID and kill it, or stop the container's port publish and try again.- Frontend reachable from the host but a backend
curl http://frontend:5173fails. This is expected — the browser is the consumer of the frontend, never the backend. No backend → frontend hop exists; if you have a feature that needs one, it belongs in a follow-up PRP (CORS + a container-DNS origin). - Frontend page shows
Loading...aftermake docker-up. A stalefrontend/.envis overriding the container'senvironment:block with a LAN IP. Remove or reset the bind-mountedfrontend/.env; the container shipsVITE_API_BASE_URL=http://localhost:8123which is the browser-correct value. make docker-up-gpubrought everything up but Ollama isunhealthyornvidia-smiinside the container errors. The host is missingnvidia-container-runtime. Verify withdocker info | grep -i runtime(must shownvidia) andnvidia-smion the host. If it isn't installed, install the NVIDIA Container Toolkit per https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html, thensudo systemctl restart dockerand try again.make docker-upreturns 0 but a service is stillstarting60 s later. Inspect viadocker compose logs <svc> --tail 100. The most common culprit is alembic taking longer thanstart_period: 30sto apply migrations on a slower host — bump the healthcheckstart_periodindocker-compose.ymlif this is reproducible.- Migration error inside backend logs after the schema changed. The backend container ran
alembic upgrade headat entrypoint. If the migration is failing, checkdocker compose logs backend --tail 80for the SQLAlchemy / alembic stack trace, fix the migration on a branch, rebuild the backend image (docker compose build backend), and bring the stack back up.
Symptoms: asyncpg.exceptions.CannotConnectNowError or ConnectionRefusedError on first request.
Diagnosis:
docker compose ps # postgres container running?
docker compose logs postgres | tail -50Resolution:
docker compose up -d # bring it up
uv run alembic upgrade head # apply migrations
uv run python scripts/check_db.py # confirm connectivitySymptoms: Integration tests pass on the dev host (which has stale seeded data) and fail in CI. Diagnosis: Integration tests must be idempotent — they may not assume pre-existing rows. Resolution:
docker compose down -v && docker compose up -d
uv run alembic upgrade head
uv run pytest -v -m integrationSymptoms: pnpm 11 depsStatusCheck reinstalls and blocks the esbuild postinstall script.
Workaround: ./node_modules/.bin/vite --host 0.0.0.0 directly. Permanent fix: add pnpm.onlyBuiltDependencies: ["esbuild"] to frontend/package.json.
Symptoms: app/core/tests/test_config.py::test_settings_has_defaults and a few agents/tests/test_config_validation.py cases fail when .env exists.
Root cause: Settings() reads .env via SettingsConfigDict(env_file=".env").
Fix: Use Settings(_env_file=None) in those tests to bypass .env.
Symptoms: python / tsc are reported as IntxLNK data blobs; uv run / tsc --noEmit fails with cannot execute binary file.
Resolution:
rm -rf .venv && uv sync --extra dev
rm -rf frontend/node_modules && corepack enable pnpm && cd frontend && pnpm install && pnpm rebuild esbuildSymptoms: scripts/run_demo.py prints ❌ Step N/11: <name> -- ... and exits 1 (step failure) or 2 (precondition).
Diagnosis flow:
- Precheck failed (exit 2) — backend isn't reachable on the URL the script is hitting.
Fix: start uvicorn (
curl -s http://localhost:8123/health # should print {"status":"ok"} docker compose ps # confirm Postgres is up on :5433
uv run uvicorn app.main:app --port 8123) and/ordocker compose up -d. The Makefile targetsdemoanddemo-cleaninvokedocker compose up -dfor you;demo-quickdoes not. - Seed step failed — production-guard or scenario mismatch. The script POSTs
demo_minimalto/seeder/generate; checkapp_env != "production"(or setseeder_allow_production=trueif you really mean it). The scenario must exist inapp/shared/seeder/config.py:ScenarioPreset(added by PRP-15 / issue #128). - Features step failed — schema drift on
ComputeFeaturesRequest. The script sends a minimalFeatureSetConfigwithname="demo_featureset"+ lag/rolling/calendar configs; if a recent change tightened aField(strict=...)constraint, the failure surfaces here. - Train step failed (one of three) — the script trains naive / seasonal_naive / moving_average in parallel via
asyncio.gather. Check the failing model's RFC 7807 body (echoed in the script output); therequest_idcorrelates with the uvicorn logs. - Backtest produced NaN WAPE —
demo_minimalis tuned to avoid the SPARSE-style NaN trap (moderatenoise_sigma=0.10, no sparsity). If you customized the scenario and now hit NaN, follow theapp/shared/seeder/tests/test_phase1_regression.pypattern. - Register step failed — most likely
pending → successinstead of the mandatorypending → running → successtransition, oralias_namedoesn't match the registry pattern (^[a-z0-9][a-z0-9\-_]*$). The script usesdemo-productionwhich is compliant; only worry if you forked the script. - Agent step showed ⏭️ but you expected ✅ —
OPENAI_API_KEY/ANTHROPIC_API_KEYnot set in the environment the backend reads. Verify withgrep -E '^(OPENAI|ANTHROPIC)_API_KEY=' .env(name only — never paste the value).
Wall-clock soft-warn:
⚠️ Result: GREEN (over budget ...)— the run succeeded but exceeded the 180 s budget. Not a failure; expected on slower hardware. The integration test (tests/test_e2e_demo.py) follows the same soft-warn semantics.
Capture artifacts for a postmortem:
# Nightly CI uploads .ci-logs/uvicorn.log on failure (e2e-nightly.yml).
# Locally, capture both streams:
uv run python scripts/run_demo.py --seed 42 --quiet 2>&1 | tee demo.logSymptoms: The dashboard Showcase page (/showcase) — or POST /demo/run — shows a step card flip to ❌, the run stops, and the summary banner is red.
Diagnosis flow (matches app/features/demo/pipeline.py step names):
statusstep fails —skip_seed=true(the default) ran against an empty database. Seed first: tick Re-seed first on the page, orPOST /seeder/generatethedemo_minimalscenario, or runmake demoonce.registerstep fails withHTTP 500 -- Database Error— the registry's_find_duplicatehit multiple pre-existingmodel_runrows with the same config hash (accumulated by priormake demo/run_demo.pyruns). Not a demo-slice bug — the demo correctly surfaces the registry's 500. Fix by clearing stale runs or running against a fresh database.agentstep shows ⏭️ — no API key matches the configuredagent_default_modelprovider, or the provider rejected the key. Expected; not a failure. The pipeline still goes green.- Page shows an
errorbanner ("Pipeline could not start") — either the start frame was malformed, or another run is already in progress (409). Only one demo pipeline runs at a time (module-levelasyncio.Lock). Wait for the active run to finish. Notes: thePOST /demo/runbody andWS /demo/streamevents are documented indocs/_base/API_CONTRACTS.md. The pipeline mirrorsscripts/run_demo.py; the per-step diagnosis formake demoabove applies to the same steps.
Symptoms: dev → main PR is merged, CD Release workflow on main completes in ~10s, no Release PR is opened. release-please log shows No user facing commits found since <sha> - skipping.
Root cause: gh pr merge --merge uses the PR title as the merge-commit subject. If that subject is a valid conventional commit of a non-bumping type (chore, docs, refactor, test, ci), release-please reads it at face value, classifies the whole merge as non-bumping, and stops. Prior dev→main merges done via the GitHub web UI used the default Merge pull request #N from <branch> subject — non-conventional — so release-please traversed to the underlying commits and bumped correctly.
Diagnosis:
git log origin/main -1 --format='%s' # if this matches type(scope): ..., that's the trap
gh run view <cd-release-run-id> --log | grep -E "(Considering|No user facing)"Prevention (any one of):
- Merge dev → main via the GitHub web UI (uses default non-conventional subject).
- Force a non-conventional subject from the CLI:
gh pr merge <N> --merge --subject "Merge pull request #<N> from w7-mgfcode/dev". - Title the dev → main PR with a
feat:orfix:type so the subject bumps regardless. Recovery (after the trap fires):
- Open an issue tracking the release (
release: cut vX.Y.Z for ...). - Branch off
dev:git switch -c feat/release-trigger-X-Y-Z. git commit --allow-empty -m "feat(release): trigger vX.Y.Z release for <slice> (#<issue>)".- PR to
main, wait for all fourmainstatus checks to go green (Lint & Format, Type Check, Test, Migration Check — enforced at the branch-protection layer since #108), then admin-merge — the emptyfeat:becomes the merge subject and release-please bumps PATCH (pre-1.0 config). Reference example: PRs #99 → #100 → #101 for v0.2.8. Plan ~3-5 min between push and the merge button becoming available.
There is no "production" — break-glass is N/A. The closest equivalent is the seeder:
uv run python scripts/seed_random.py --delete --confirm
uv run python scripts/seed_random.py --full-new --seed 42 --confirm
uv run python scripts/seed_random.py --verifyThere are no managed secrets — keys live in the developer's .env. Rotation = edit .env, restart uvicorn. Never commit .env. .env.example is the canonical schema; new env vars must land there first.
# from dev, ensure CI green
gh pr create --base main --head dev --title "release: ..."
# merge PR via the GitHub web UI (or with an explicit non-conventional --subject)
# → release-please opens "chore(main): release X.Y.Z" PR on main
# merge that PR → release-please tags vX.Y.Z and cd-release.yml uploads wheel
⚠️ Do NOT merge withgh pr merge --mergeif the PR title is a non-bumping conventional commit (chore(...),docs(...), etc.) — the merge-commit subject becomes the PR title verbatim and release-please will skip the bump. See the "release-please skipped the bump after a dev → main merge" incident above.
# undo a tag is destructive; prefer cutting a new patch release with the fix
git revert <bad-commit-sha>
gh pr create --base dev --head fix/<slug>
# proceed through normal release flow → vX.Y.(Z+1)Never git push --force on dev or main (see .claude/rules/security-patterns.md).
- Backend logs: stdout from
uvicorn(JSON inproduction, console indevelopment). Each request carries anX-Request-IDheader echoed in error bodies (request_idfield) — grep logs by that ID. - Frontend network errors: open browser devtools → Network tab → check
/health, then the failing endpoint's status + RFC 7807 body. - Agent issues: check
app/features/agents/models.pyagent_sessiontable —message_historyJSONB has the full transcript, including tool calls and pending approvals.