Skip to content

Latest commit

 

History

History
107 lines (83 loc) · 19.3 KB

File metadata and controls

107 lines (83 loc) · 19.3 KB

Lessons Learned

Claude Code reads this file every session. When you hit a non-obvious problem and find the fix, add it here so future sessions don't repeat the same mistake.

Format: pattern-oriented, not timeline. Group by category. Rule: only add entries confirmed by actual code/test — no speculation.


Environment & Toolchain

  • Python runtime: The venv under server/.venv/ uses Python 3.12. Run tests via cd server && .venv/bin/python -m pytest. Do NOT use python3.14 -m pytest — 3.14 has a separate env without project deps. Exclude tests/test_e2e/ which requires a live server.
  • Package manager: uv (not pip). Run from server/ dir. uv sync --all-extras installs deps; uv run <cmd> runs in venv.
  • Frontend Node shim: some local shells still launch npm run ... with an old system node even when a newer fnm install exists, which breaks Vitest's ESM entrypoint. Route dashboard scripts through web/scripts/with-node.sh; it requires Node >=20.9.0 and falls back to OPENHIVE_NODE_BIN, PATH node, then repo-local fnm / Volta / asdf installs.
  • Next 16 dev server: run dashboard dev with webpack (next dev --webpack). The default Turbopack dev server has returned App Router 404s for valid routes and skipped the src/proxy.ts API rewrite in this repo, while webpack dev and production build both serve /projects and /api/* correctly.
  • PyPI mirror: PyPI unreachable directly — server/uv.toml sets Tsinghua mirror.
  • Env file location: .env at repo root. HiveSettings searches (".env", "../.env").
  • DB_PASSWORD clash: .env has DB_PASSWORD for docker-compose only. Pydantic Settings would choke on it — extra="ignore" in HiveSettings prevents the error.
  • Postgres: docker compose up -d postgres (uses DB_PASSWORD from .env).
  • Dev server: make runcd server && uv run uvicorn hive.main:app --reload --port 8080

Library Gotchas

  • lark-oapi + uvloop: ws.Client captures the event loop at import time in a module-level variable. Under uvicorn (uvloop), calling run_until_complete from a thread raises RuntimeError: this event loop is already running. Fix: create a SelectorEventLoop per daemon thread, set it as current before constructing Client, patch ws_mod.loop under _ws_module_lock, then run client._connect() and _select() on that thread-local loop. See feishu_ws.start_ws_client.
  • Callbacks into closed loop: asyncio.run_coroutine_threadsafe(coro, loop) raises RuntimeError: Event loop is closed during shutdown. Fix: guard with if not loop.is_closed() before calling. Use the schedule_on_loop() helper in feishu_ws.py.
  • SQLAlchemy metadata conflict: never name a column metadata — it clashes with Base.metadata. Use extra instead (see FeedbackQueue.extra).
  • Pydantic v2 immutability: never mutate a model after construction. Use model_copy(update=...).
  • pytest-asyncio: requires asyncio_mode = "auto" in pyproject.toml and @pytest.mark.asyncio on async tests.
  • APScheduler: AsyncIOScheduler must be started after the event loop is running, not at import time.
  • Alembic renamed revision aliases: if a local DB stores a superseded revision id such as 0042_platform_extension_entitlement_rights, startup must repair alembic_version before command.upgrade() loads the repo script directory. Add the alias to _LEGACY_REVISION_REPAIRS with schema checks so complete DBs are stamped to the canonical id and incomplete DBs fall back to the prior revision.
  • Business-ops Alembic aliases: business-operations slices may rename generated migration ids while local DBs still carry the old id. Confirmed case: 0048_project_entity_registry must repair to 0048_project_entities only when both project_entities and project_entity_aliases plus their indexes exist; otherwise fall back to 0047_source_ingestion so Alembic can apply the entity migration normally.

Patterns That Failed → What Worked

  • session_factory: returning a raw session caused context leak → return an async context manager instead. Applied in auth.py, bot.py, router.py.
  • Optional[X]: project convention is X | None — pyright and ruff both prefer the union syntax.
  • Agent infinite loop: agent.run() without a turn limit would loop forever on tool-call chains → added MAX_TURNS guard that raises HiveError after the limit.
  • Mocking internal code: early tests mocked internal functions → violated "only mock external I/O" rule. Refactored to mock only LLM/Feishu/DB boundaries.

Week 5 — Dashboard API

  • packages.shared importable: packages/shared/api_types.py is outside server/ — add repo root to sys.path in TWO places: server/conftest.py for tests, and at the top of hive/main.py for the live server (Path(__file__).resolve().parent.parent.parent = repo root from server/hive/main.py).
  • TYPE_CHECKING + response_model: FastAPI response_model=SomeType needs the type at module load time. If the type is in TYPE_CHECKING, it raises NameError. Always import Pydantic models at module level in API router files.
  • Eager init.py imports: adding from hive.gateway.api import admin, auth, projects in __init__.py forces all three modules to load together, surfacing import-time errors immediately. Remove eager imports from __init__.py; import modules explicitly where needed.
  • AsyncMock.add() pitfall: session.add() in SQLAlchemy is sync. AsyncMock() auto-mocks .add as async, creating an unawaited coroutine warning. Fix: session.add = MagicMock().
  • server_default not set in tests: after session.add(obj); await session.commit(), columns with server_default=func.now() remain None in ORM objects. Fix: add await session.refresh(obj) in production code, then mock session.refresh as async def _refresh(obj): obj.created_at = datetime(...) in tests.
  • DB field names vs API names: Agent.role (not agent_type), Group.id/Group.name (not group_id/group_name), FeedbackQueue.processed bool (not status enum), AgentChange has no applied_at. Always verify actual model columns before writing API layer code.

Week 5 — Dashboard Frontend / Playwright Testing

  • CORS dev setup: FastAPI has no CORS by default; add CORSMiddleware in hive/main.py (not app.py, which is unused). allow_origins=["http://localhost:3000"], allow_credentials=True.
  • Cross-origin cookies blocked: even with CORS fixed, SameSite=Lax cookies are not sent on cross-origin fetch. Fix: add rewrites in next.config.ts to proxy /api/:path*http://localhost:8080/api/:path*, and set API_BASE_URL = "" in lib/api.ts so all requests are same-origin.
  • app.py is unused: the live server entry point is hive/main.py (module-level app = FastAPI(...)). create_app() in app.py is dead code — changes there have no effect.
  • Playwright session cookie: set httpOnly: false when injecting a test session cookie via Playwright (httpOnly blocks JS reads but the bigger issue is sameSite). Use the Next.js proxy approach instead.
  • DB seed order: SQLAlchemy batches inserts in a single flush; FK constraints fail if parent not flushed first. Call await session.flush() after each parent row before adding children.
  • DB model required fields: Project.config_path, Group.config_path, Agent.config_path are all nullable=False — always provide them in seed scripts.
  • Dashboard i18n shape: page-local COPY dictionaries make locale coverage drift fast once dashboard surfaces expand. Put dashboard strings in centralized web/src/lib/i18n/ module catalogs, consume them through translateDashboardText() / useDashboardLanguage().t(), and keep a catalog-parity test so adding a locale like Japanese becomes a structured catalog change instead of a page-by-page hunt.

Week 6 — Multi-Channel Refactor

  • groups.id = chat_id PK: using the Feishu chat_id as the groups table PK blocks multi-bot-per-group. Fix: use UUID PK + chat_id column + (chat_id, channel_id) UNIQUE constraint.
  • Lazy import patch target: from hive.gateway.crypto import encrypt_secret inside a handler function creates a local binding. patch("hive.gateway.api.projects.encrypt_secret") fails because the attribute doesn't exist at module level. Either patch the source module (patch("hive.gateway.crypto.encrypt_secret") — but only if already imported) or use skipif for crypto-dependent tests.
  • ScopedDB.query vs get_by_id: ScopedDB.get_by_id(table, pk) uses the PK column. After changing the groups PK from chat_id to UUID, all calls that passed chat_id as the lookup key must switch to ScopedDB.query(table, chat_id=value).
  • Alembic PK swap pattern: cannot directly change a PK in Postgres without dropping/recreating the constraint. Pattern: add new UUID column → populate → drop old PK → rename columns → add new PK → drop temp column.

Playwright E2E Testing

  • redirect_slashes loop: Next.js (default trailingSlash: false) strips trailing slashes via 308. FastAPI (default redirect_slashes=True) adds them back via 307. Result: infinite redirect loop in browser. Fix: add redirect_slashes=False to FastAPI(...) in hive/main.py AND to each APIRouter(...), then change "/" collection routes to "" so they match the stripped path.
  • app.py vs main.py (again): create_app() in app.py is dead code — the live app is the module-level FastAPI instance in hive/main.py. Fixes to app.py silently have no effect.
  • DB project rows must match the live runtime workspace: if an ad hoc E2E script writes projects rows into the shared local DB but stores project files under .artifacts/, the live dashboard resolves those files under HIVE_WORKSPACE (normally .runtime) and endpoints such as /prompt-review/config 404. Either keep the whole proof isolated and clean up DB rows, or materialize project files under .runtime/projects/<project_id>/.
  • Alembic stamp-then-migrate pitfall: alembic stamp <rev> skips all intermediate migrations without running them. If you stamp over unapplied migrations and then the next upgrade head call fails mid-run, PostgreSQL's transactional DDL rolls back all changes in that transaction. The result: alembic thinks it's at head but the schema is missing columns. Fix: apply missing DDL manually via SQL, then stamp to the correct revision.
  • Playwright Chrome session lock: Playwright MCP uses a persistent Chrome profile. If a previous session left Chrome running, a new browser_navigate call fails with "Opening in existing browser session." Fix: pkill -f "mcp-chrome-<hash>" to release the profile lock.
  • Browser 308 cache: 308 Permanent Redirect is cached by the browser. After fixing a redirect loop on the server side, the browser still loops from cached 308s. Fix: use Network.clearBrowserCache via CDP: const client = await page.context().newCDPSession(page); await client.send('Network.clearBrowserCache').

Feishu Message Handling

  • Rich-text messages silently dropped: parse_sdk_event only handled msg_type="text". Backticks, bold, @mentions, or multi-line input cause Feishu to send msg_type="post" (rich text) with a nested paragraph structure instead of a top-level "text" key → content.get("text", "") returns "" → message dropped with zero logs. Fix: check message.message_type and add _extract_text_from_post() to walk [[{tag, text}, ...]] paragraphs. Also handle locale-wrapped variant ({"zh_cn": {"content": [...]}}).
  • Dual-WS duplicate processing: both the global WS and per-project WS receive events for the same group. Each creates its own build_event_handler closure, so _route_group's dedup dict doesn't help. Fix: module-level _SEEN_MESSAGE_IDS dict in feishu_ws.py with threading.Lock (WS callbacks run from daemon threads), checked in handle_message before dispatching. 30s TTL.
  • Fire-and-forget asyncio.create_task: asyncio.create_task(_run_keeper()) without storing the reference lets Python GC collect the task mid-flight (event loop only holds weak refs). Long-running tool-call loops are especially vulnerable. Fix: _BACKGROUND_TASKS set in router.py with task.add_done_callback(_BACKGROUND_TASKS.discard).

Agent Reply Handling

  • Keeper missing fallback: _run_keeper had no else clause — when agent.run() returns "", Keeper sent nothing. Always mirror Scout's fallback pattern in _run_keeper.
  • Qwen3 thinking mode empty content: Qwen3-max sometimes writes its full explanation in reasoning_content and leaves content = None/"" for final-turn responses. OpenAICompatibleProvider._to_llm_response now falls back to reasoning_content when content is empty and there are no tool calls.
  • Tool-call wire format mismatch: runtime stores tool_calls in OpenAI format (tc["function"]["name"]), but AnthropicProvider._convert_message read them flat (tc["name"]) → KeyError on second LLM turn. Also, ToolRegistry used "tool_call_id" but Anthropic expects "tool_use_id". Fix: unified internal format on "tool_use_id"; each provider converts to its wire format. Added _convert_message to OpenAICompatibleProvider for the reverse translation.
  • Scout send_message double-reply: If Scout calls send_message (tool) to answer the user AND returns resp.text = "", the router's fallback fires on top — sending two messages. Fix: track _sent_this_turn on SendMessageTool, reset it before each run(), suppress fallback when the tool already sent a reply. Also: update send_message tool description to explicitly say "do NOT use this to reply to the current user request".
  • Scout Local Chat setup fallback can hide pool saturation: repeated Local Chat eval failures that return "An error occurred while preparing this Local Chat room message" can come from hitting HIVE_MAX_ACTIVE_AGENTS, not from the benchmark prompt itself. In one confirmed run, restarting with more agent-pool headroom restored the previously failing gaia-memory-probes suite from intermittent failures back to 6/6.

Architecture Traps

  • Business terms in Core: easy to accidentally add sentiment/negative/feedback-type enums in hive/ code. Litmus test: "Would this logic apply to an e-commerce scenario?" If no → it belongs in plugins/ or skills/.
  • Bare dict crossing boundaries: tempting to return {"status": "ok"} from tools → always use a typed dataclass or Pydantic model.
  • Direct LocalAgentPool import: upper-layer code must only use the AgentPool abstract type. Caught this in gateway/router.py during Week 1 review.

Sandbox / K8s Live Proof

  • Remote sandbox cannot trust operator host paths: passing workspace_path=/Users/... from a laptop through Gateway to a K8s sandbox pod fails even if the path exists locally — the pod cannot see the operator filesystem. Fix: for cross-container / cross-host dev-task creation, upload a workspace snapshot (for example workspace_archive_b64) and seed the task repo inside the sandbox; only use raw workspace_path when the sandbox can actually access that filesystem.
  • Source-local sandbox Codex must inherit the operator's Codex config before inferring provider defaults: for local .runtime/sandbox/tasks/... runs, forcing inferred qwen/DashScope bootstrap made benchmark tasks diverge from ~/.codex/config.toml and reproduced provider-specific tool-call failures. Fix: copy the operator ~/.codex/config.toml (and auth.json when present) into the isolated sandbox home first, set HIVE_SANDBOX_CODEX_MODEL from that copied config when available, and only fall back to inferred provider bootstrap when no operator Codex config exists.
  • Source-local sandbox Codex copies must honor preferred_auth_method = "apikey": when the operator config selected a provider such as club with requires_openai_auth = true, source-local dev tasks could fail with expired refresh-token errors even though the copied provider config already had an API-key path. Fix: when seeding the sandbox-local ~/.codex/config.toml, rewrite the selected provider to requires_openai_auth = false whenever the operator config explicitly prefers API-key auth and the provider has api_key or env_key configured.
  • Source-local sandbox Codex copies must also export the copied provider's env key before sandbox env scrubbing: local dev-task launches run with inherit_parent_env=False, so copying ~/.codex/config.toml alone still left Codex without the selected provider's env_key secret. Fix: after copying the operator config, read the selected provider metadata, set HIVE_SANDBOX_CODEX_{MODEL,PROVIDER_ID,PROVIDER_NAME,BASE_URL,ENV_KEY,REQUIRES_OPENAI_AUTH} from that config, and seed the provider's env var from the copied api_key when present so the sandbox can pass the right secret instead of falling back toward DashScope/Qwen defaults.
  • Source-local sandbox Codex auth copies must rewrite auth.json to match the selected provider's env key: copying the operator ~/.codex/auth.json verbatim can preserve a conflicting generic key such as OPENAI_API_KEY from a different provider, even when the copied sandbox config points at another API-key-backed provider. Fix: when copying auth.json, read the selected provider from the copied config.toml and overwrite that provider's env_key entry with the selected provider api_key.
  • Sandbox-local codex exec needs bounded retries for transient provider failures: live benchmark dev tasks can fail spuriously on provider-side disconnects such as stream disconnected before completion or token-refresh hiccups even when the prompt and workspace are valid. Fix: wrap codex exec in a small retry helper that preserves stdout/stderr, retries only known transient provider failures, and leaves deterministic prompt/workspace errors fail-fast.
  • Archive-seeded approval needs an explicit apply-back channel: if the remote sandbox was seeded from workspace_archive_b64, applying the patch against the task-local repo fails because that repo already contains the edits. Preserve requested_workspace_path through the sandbox command handoff, apply back to that path when it is reachable, and fail closed or use HIVE_SANDBOX_WORKSPACE_APPLY_RELAY_URL for operator-host workspaces that only the host can mutate.
  • Host apply relays must bypass provider proxy env: sandbox pods may carry HTTP_PROXY / HTTPS_PROXY for real codex_cli provider access. Internal callbacks to an operator-local apply relay must use httpx.AsyncClient(..., trust_env=False) or add an explicit no-proxy path; otherwise the provider proxy can intercept the relay call and return misleading 5xx errors.
  • Reviewable sandbox tasks need artifact tooling, not just the coding binary: a task can execute and still never reach awaiting_approval if the runtime image lacks the tools needed to generate patch evidence. In BG-08 the proof image first lacked codex, then lacked git; the durable fix was to add a baseline-snapshot diff fallback in LocalSandboxBackend so patch / changed-files artifacts still materialize even when git diff cannot run.
  • Proof-path claims must stay narrower than production-path claims: the deterministic BG-08 codex shim proved the operator review/apply control loop in local and cluster proof runs, but it did not prove the default provider-backed codex_cli path. Keep docs, run-state notes, and release claims explicit about whether evidence came from the real default runtime or a proof-only shim path.