Claude Code reads this file every session. When you hit a non-obvious problem and find the fix, add it here so future sessions don't repeat the same mistake.
Format: pattern-oriented, not timeline. Group by category. Rule: only add entries confirmed by actual code/test — no speculation.
- Python runtime: The venv under
server/.venv/uses Python 3.12. Run tests viacd server && .venv/bin/python -m pytest. Do NOT usepython3.14 -m pytest— 3.14 has a separate env without project deps. Excludetests/test_e2e/which requires a live server. - Package manager:
uv(not pip). Run fromserver/dir.uv sync --all-extrasinstalls deps;uv run <cmd>runs in venv. - Frontend Node shim: some local shells still launch
npm run ...with an old systemnodeeven when a newerfnminstall exists, which breaks Vitest's ESM entrypoint. Route dashboard scripts throughweb/scripts/with-node.sh; it requires Node>=20.9.0and falls back toOPENHIVE_NODE_BIN, PATHnode, then repo-localfnm/ Volta / asdf installs. - Next 16 dev server: run dashboard dev with webpack (
next dev --webpack). The default Turbopack dev server has returned App Router 404s for valid routes and skipped thesrc/proxy.tsAPI rewrite in this repo, while webpack dev and production build both serve/projectsand/api/*correctly. - PyPI mirror: PyPI unreachable directly —
server/uv.tomlsets Tsinghua mirror. - Env file location:
.envat repo root.HiveSettingssearches(".env", "../.env"). - DB_PASSWORD clash:
.envhasDB_PASSWORDfor docker-compose only. Pydantic Settings would choke on it —extra="ignore"inHiveSettingsprevents the error. - Postgres:
docker compose up -d postgres(usesDB_PASSWORDfrom.env). - Dev server:
make run→cd server && uv run uvicorn hive.main:app --reload --port 8080
- lark-oapi + uvloop:
ws.Clientcaptures the event loop at import time in a module-level variable. Under uvicorn (uvloop), callingrun_until_completefrom a thread raisesRuntimeError: this event loop is already running. Fix: create aSelectorEventLoopper daemon thread, set it as current before constructingClient, patchws_mod.loopunder_ws_module_lock, then runclient._connect()and_select()on that thread-local loop. Seefeishu_ws.start_ws_client. - Callbacks into closed loop:
asyncio.run_coroutine_threadsafe(coro, loop)raisesRuntimeError: Event loop is closedduring shutdown. Fix: guard withif not loop.is_closed()before calling. Use theschedule_on_loop()helper infeishu_ws.py. - SQLAlchemy
metadataconflict: never name a columnmetadata— it clashes withBase.metadata. Useextrainstead (seeFeedbackQueue.extra). - Pydantic v2 immutability: never mutate a model after construction. Use
model_copy(update=...). - pytest-asyncio: requires
asyncio_mode = "auto"inpyproject.tomland@pytest.mark.asyncioon async tests. - APScheduler:
AsyncIOSchedulermust be started after the event loop is running, not at import time. - Alembic renamed revision aliases: if a local DB stores a superseded revision id such as
0042_platform_extension_entitlement_rights, startup must repairalembic_versionbeforecommand.upgrade()loads the repo script directory. Add the alias to_LEGACY_REVISION_REPAIRSwith schema checks so complete DBs are stamped to the canonical id and incomplete DBs fall back to the prior revision. - Business-ops Alembic aliases: business-operations slices may rename generated migration ids while local DBs still carry the old id. Confirmed case:
0048_project_entity_registrymust repair to0048_project_entitiesonly when bothproject_entitiesandproject_entity_aliasesplus their indexes exist; otherwise fall back to0047_source_ingestionso Alembic can apply the entity migration normally.
- session_factory: returning a raw session caused context leak → return an async context manager instead. Applied in
auth.py,bot.py,router.py. Optional[X]: project convention isX | None— pyright and ruff both prefer the union syntax.- Agent infinite loop:
agent.run()without a turn limit would loop forever on tool-call chains → addedMAX_TURNSguard that raisesHiveErrorafter the limit. - Mocking internal code: early tests mocked internal functions → violated "only mock external I/O" rule. Refactored to mock only LLM/Feishu/DB boundaries.
- packages.shared importable:
packages/shared/api_types.pyis outsideserver/— add repo root to sys.path in TWO places:server/conftest.pyfor tests, and at the top ofhive/main.pyfor the live server (Path(__file__).resolve().parent.parent.parent= repo root fromserver/hive/main.py). - TYPE_CHECKING + response_model: FastAPI
response_model=SomeTypeneeds the type at module load time. If the type is inTYPE_CHECKING, it raisesNameError. Always import Pydantic models at module level in API router files. - Eager init.py imports: adding
from hive.gateway.api import admin, auth, projectsin__init__.pyforces all three modules to load together, surfacing import-time errors immediately. Remove eager imports from__init__.py; import modules explicitly where needed. - AsyncMock.add() pitfall:
session.add()in SQLAlchemy is sync.AsyncMock()auto-mocks.addas async, creating an unawaited coroutine warning. Fix:session.add = MagicMock(). - server_default not set in tests: after
session.add(obj); await session.commit(), columns withserver_default=func.now()remainNonein ORM objects. Fix: addawait session.refresh(obj)in production code, then mocksession.refreshasasync def _refresh(obj): obj.created_at = datetime(...)in tests. - DB field names vs API names:
Agent.role(notagent_type),Group.id/Group.name(notgroup_id/group_name),FeedbackQueue.processedbool (notstatusenum),AgentChangehas noapplied_at. Always verify actual model columns before writing API layer code.
- CORS dev setup: FastAPI has no CORS by default; add
CORSMiddlewareinhive/main.py(notapp.py, which is unused).allow_origins=["http://localhost:3000"],allow_credentials=True. - Cross-origin cookies blocked: even with CORS fixed,
SameSite=Laxcookies are not sent on cross-origin fetch. Fix: addrewritesinnext.config.tsto proxy/api/:path*→http://localhost:8080/api/:path*, and setAPI_BASE_URL = ""inlib/api.tsso all requests are same-origin. - app.py is unused: the live server entry point is
hive/main.py(module-levelapp = FastAPI(...)).create_app()inapp.pyis dead code — changes there have no effect. - Playwright session cookie: set
httpOnly: falsewhen injecting a test session cookie via Playwright (httpOnly blocks JS reads but the bigger issue issameSite). Use the Next.js proxy approach instead. - DB seed order: SQLAlchemy batches inserts in a single flush; FK constraints fail if parent not flushed first. Call
await session.flush()after each parent row before adding children. - DB model required fields:
Project.config_path,Group.config_path,Agent.config_pathare allnullable=False— always provide them in seed scripts. - Dashboard i18n shape: page-local
COPYdictionaries make locale coverage drift fast once dashboard surfaces expand. Put dashboard strings in centralizedweb/src/lib/i18n/module catalogs, consume them throughtranslateDashboardText()/useDashboardLanguage().t(), and keep a catalog-parity test so adding a locale like Japanese becomes a structured catalog change instead of a page-by-page hunt.
- groups.id = chat_id PK: using the Feishu chat_id as the groups table PK blocks multi-bot-per-group. Fix: use UUID PK +
chat_idcolumn +(chat_id, channel_id)UNIQUE constraint. - Lazy import patch target:
from hive.gateway.crypto import encrypt_secretinside a handler function creates a local binding.patch("hive.gateway.api.projects.encrypt_secret")fails because the attribute doesn't exist at module level. Either patch the source module (patch("hive.gateway.crypto.encrypt_secret")— but only if already imported) or useskipiffor crypto-dependent tests. - ScopedDB.query vs get_by_id:
ScopedDB.get_by_id(table, pk)uses the PK column. After changing the groups PK from chat_id to UUID, all calls that passedchat_idas the lookup key must switch toScopedDB.query(table, chat_id=value). - Alembic PK swap pattern: cannot directly change a PK in Postgres without dropping/recreating the constraint. Pattern: add new UUID column → populate → drop old PK → rename columns → add new PK → drop temp column.
- redirect_slashes loop: Next.js (default
trailingSlash: false) strips trailing slashes via 308. FastAPI (defaultredirect_slashes=True) adds them back via 307. Result: infinite redirect loop in browser. Fix: addredirect_slashes=FalsetoFastAPI(...)inhive/main.pyAND to eachAPIRouter(...), then change"/"collection routes to""so they match the stripped path. - app.py vs main.py (again):
create_app()inapp.pyis dead code — the live app is the module-levelFastAPIinstance inhive/main.py. Fixes toapp.pysilently have no effect. - DB project rows must match the live runtime workspace: if an ad hoc E2E script writes
projectsrows into the shared local DB but stores project files under.artifacts/, the live dashboard resolves those files underHIVE_WORKSPACE(normally.runtime) and endpoints such as/prompt-review/config404. Either keep the whole proof isolated and clean up DB rows, or materialize project files under.runtime/projects/<project_id>/. - Alembic stamp-then-migrate pitfall:
alembic stamp <rev>skips all intermediate migrations without running them. If you stamp over unapplied migrations and then the nextupgrade headcall fails mid-run, PostgreSQL's transactional DDL rolls back all changes in that transaction. The result: alembic thinks it's at head but the schema is missing columns. Fix: apply missing DDL manually via SQL, then stamp to the correct revision. - Playwright Chrome session lock: Playwright MCP uses a persistent Chrome profile. If a previous session left Chrome running, a new
browser_navigatecall fails with "Opening in existing browser session." Fix:pkill -f "mcp-chrome-<hash>"to release the profile lock. - Browser 308 cache: 308 Permanent Redirect is cached by the browser. After fixing a redirect loop on the server side, the browser still loops from cached 308s. Fix: use
Network.clearBrowserCachevia CDP:const client = await page.context().newCDPSession(page); await client.send('Network.clearBrowserCache').
- Rich-text messages silently dropped:
parse_sdk_eventonly handledmsg_type="text". Backticks, bold, @mentions, or multi-line input cause Feishu to sendmsg_type="post"(rich text) with a nested paragraph structure instead of a top-level"text"key →content.get("text", "")returns""→ message dropped with zero logs. Fix: checkmessage.message_typeand add_extract_text_from_post()to walk[[{tag, text}, ...]]paragraphs. Also handle locale-wrapped variant ({"zh_cn": {"content": [...]}}). - Dual-WS duplicate processing: both the global WS and per-project WS receive events for the same group. Each creates its own
build_event_handlerclosure, so_route_group's dedup dict doesn't help. Fix: module-level_SEEN_MESSAGE_IDSdict infeishu_ws.pywiththreading.Lock(WS callbacks run from daemon threads), checked inhandle_messagebefore dispatching. 30s TTL. - Fire-and-forget asyncio.create_task:
asyncio.create_task(_run_keeper())without storing the reference lets Python GC collect the task mid-flight (event loop only holds weak refs). Long-running tool-call loops are especially vulnerable. Fix:_BACKGROUND_TASKSset inrouter.pywithtask.add_done_callback(_BACKGROUND_TASKS.discard).
- Keeper missing fallback:
_run_keeperhad noelseclause — whenagent.run()returns"", Keeper sent nothing. Always mirror Scout's fallback pattern in_run_keeper. - Qwen3 thinking mode empty content: Qwen3-max sometimes writes its full explanation in
reasoning_contentand leavescontent = None/""for final-turn responses.OpenAICompatibleProvider._to_llm_responsenow falls back toreasoning_contentwhencontentis empty and there are no tool calls. - Tool-call wire format mismatch: runtime stores tool_calls in OpenAI format (
tc["function"]["name"]), butAnthropicProvider._convert_messageread them flat (tc["name"]) → KeyError on second LLM turn. Also,ToolRegistryused"tool_call_id"but Anthropic expects"tool_use_id". Fix: unified internal format on"tool_use_id"; each provider converts to its wire format. Added_convert_messagetoOpenAICompatibleProviderfor the reverse translation. - Scout send_message double-reply: If Scout calls
send_message(tool) to answer the user AND returnsresp.text = "", the router's fallback fires on top — sending two messages. Fix: track_sent_this_turnonSendMessageTool, reset it before eachrun(), suppress fallback when the tool already sent a reply. Also: updatesend_messagetool description to explicitly say "do NOT use this to reply to the current user request". - Scout Local Chat setup fallback can hide pool saturation: repeated Local Chat eval failures that return
"An error occurred while preparing this Local Chat room message"can come from hittingHIVE_MAX_ACTIVE_AGENTS, not from the benchmark prompt itself. In one confirmed run, restarting with more agent-pool headroom restored the previously failinggaia-memory-probessuite from intermittent failures back to6/6.
- Business terms in Core: easy to accidentally add sentiment/negative/feedback-type enums in
hive/code. Litmus test: "Would this logic apply to an e-commerce scenario?" If no → it belongs inplugins/orskills/. - Bare dict crossing boundaries: tempting to return
{"status": "ok"}from tools → always use a typed dataclass or Pydantic model. - Direct LocalAgentPool import: upper-layer code must only use the
AgentPoolabstract type. Caught this in gateway/router.py during Week 1 review.
- Remote sandbox cannot trust operator host paths: passing
workspace_path=/Users/...from a laptop through Gateway to a K8s sandbox pod fails even if the path exists locally — the pod cannot see the operator filesystem. Fix: for cross-container / cross-host dev-task creation, upload a workspace snapshot (for exampleworkspace_archive_b64) and seed the task repo inside the sandbox; only use rawworkspace_pathwhen the sandbox can actually access that filesystem. - Source-local sandbox Codex must inherit the operator's Codex config before inferring provider defaults: for local
.runtime/sandbox/tasks/...runs, forcing inferredqwen/DashScope bootstrap made benchmark tasks diverge from~/.codex/config.tomland reproduced provider-specific tool-call failures. Fix: copy the operator~/.codex/config.toml(andauth.jsonwhen present) into the isolated sandbox home first, setHIVE_SANDBOX_CODEX_MODELfrom that copied config when available, and only fall back to inferred provider bootstrap when no operator Codex config exists. - Source-local sandbox Codex copies must honor
preferred_auth_method = "apikey": when the operator config selected a provider such asclubwithrequires_openai_auth = true, source-local dev tasks could fail with expired refresh-token errors even though the copied provider config already had an API-key path. Fix: when seeding the sandbox-local~/.codex/config.toml, rewrite the selected provider torequires_openai_auth = falsewhenever the operator config explicitly prefers API-key auth and the provider hasapi_keyorenv_keyconfigured. - Source-local sandbox Codex copies must also export the copied provider's env key before sandbox env scrubbing: local dev-task launches run with
inherit_parent_env=False, so copying~/.codex/config.tomlalone still left Codex without the selected provider'senv_keysecret. Fix: after copying the operator config, read the selected provider metadata, setHIVE_SANDBOX_CODEX_{MODEL,PROVIDER_ID,PROVIDER_NAME,BASE_URL,ENV_KEY,REQUIRES_OPENAI_AUTH}from that config, and seed the provider's env var from the copiedapi_keywhen present so the sandbox can pass the right secret instead of falling back toward DashScope/Qwen defaults. - Source-local sandbox Codex auth copies must rewrite
auth.jsonto match the selected provider's env key: copying the operator~/.codex/auth.jsonverbatim can preserve a conflicting generic key such asOPENAI_API_KEYfrom a different provider, even when the copied sandbox config points at another API-key-backed provider. Fix: when copyingauth.json, read the selected provider from the copiedconfig.tomland overwrite that provider'senv_keyentry with the selected providerapi_key. - Sandbox-local
codex execneeds bounded retries for transient provider failures: live benchmark dev tasks can fail spuriously on provider-side disconnects such asstream disconnected before completionor token-refresh hiccups even when the prompt and workspace are valid. Fix: wrapcodex execin a small retry helper that preserves stdout/stderr, retries only known transient provider failures, and leaves deterministic prompt/workspace errors fail-fast. - Archive-seeded approval needs an explicit apply-back channel: if the remote sandbox was seeded from
workspace_archive_b64, applying the patch against the task-local repo fails because that repo already contains the edits. Preserverequested_workspace_paththrough the sandbox command handoff, apply back to that path when it is reachable, and fail closed or useHIVE_SANDBOX_WORKSPACE_APPLY_RELAY_URLfor operator-host workspaces that only the host can mutate. - Host apply relays must bypass provider proxy env: sandbox pods may carry
HTTP_PROXY/HTTPS_PROXYfor realcodex_cliprovider access. Internal callbacks to an operator-local apply relay must usehttpx.AsyncClient(..., trust_env=False)or add an explicit no-proxy path; otherwise the provider proxy can intercept the relay call and return misleading 5xx errors. - Reviewable sandbox tasks need artifact tooling, not just the coding binary: a task can execute and still never reach
awaiting_approvalif the runtime image lacks the tools needed to generate patch evidence. In BG-08 the proof image first lackedcodex, then lackedgit; the durable fix was to add a baseline-snapshot diff fallback inLocalSandboxBackendso patch / changed-files artifacts still materialize even whengit diffcannot run. - Proof-path claims must stay narrower than production-path claims: the deterministic BG-08
codexshim proved the operator review/apply control loop in local and cluster proof runs, but it did not prove the default provider-backedcodex_clipath. Keep docs, run-state notes, and release claims explicit about whether evidence came from the real default runtime or a proof-only shim path.