fix(db): wrap SET LOCAL hnsw.ef_search in $transaction (Phase 0e) by webdevcom01-cell · Pull Request #99 · webdevcom01-cell/agent-studio

webdevcom01-cell · 2026-05-18T19:42:42Z

Same root cause as Phase 0a (PR #97): a SET LOCAL Postgres session variable and the subsequent query were issued as separate Prisma calls. With pool connections, the SET LOCAL can land on one connection and the SELECT on another — the ef_search tuning silently reverts to the server default before the vector search runs.

Phase 0a fixed this for the org-context session variable in withOrgContext. Phase 0e applies the same $transaction wrap to all three SET LOCAL hnsw.ef_search call sites surfaced during STEP 1 audit:

src/lib/knowledge/search.ts:201 (searchKnowledgeBase)
- Hot path: 6+ production callers (KB chat/eval API routes, kb-search-handler runtime, MCP tool exposure, rag-inject, codebase-rag SDLC).
- Dynamic ef_search (40 | 60 | 100) based on query word count.
- Vector search on KBChunk.
src/lib/memory/hot-cold-tier.ts:113 (getColdMemories)
- No production callers in src/ today, but exported and reachable dynamically. Patched for consistency with the other two sites so the pattern is uniform across the codebase.
- ef_search = 40 (hardcoded).
- Vector search on AgentMemory.
src/lib/runtime/handlers/memory-read-handler.ts:168 (memoryReadHandler vector search path)
- Runtime handler for the memory_read node type. Triggered for every agent run that includes a memory_read node.
- ef_search = 40 (hardcoded).
- Vector search on AgentMemory.

PLAN-V2 §4.5 named only search.ts; the audit instruction in that section ('Apply same pattern to other SET LOCAL callers if discovered during STEP 1 audit') is honored here with all three sites in one PR.

FORENSIC HAL-9 covers the same finding.

Tests:

63 existing tests still pass (search.test.ts utility tests: 30; hot-cold-tier.test.ts: 16; memory-read-handler.test.ts: 17).
6 new regression-guard tests added in two new files: src/lib/knowledge/tests/search-vector-shape.test.ts (3) src/lib/runtime/handlers/tests/memory-read-handler-vector-shape.test.ts (3)
hot-cold-tier.test.ts mock extended with $transaction delegating to a tx mock; existing assertions on $executeRawUnsafe still hold via tx-aliased fn references.
Each shape test asserts: (a) $transaction is invoked, (b) SET LOCAL runs on the tx, not the outer prisma, (c) callback receives tx, (d) outer prisma's direct exec/query paths are NOT used.
Mocked tests verify SHAPE of the fix. They cannot verify pool-survival semantics; that needs a real Postgres integration harness (TODO comments left for Phase 1+).

Risk:

search.ts is on the KB hot path. $transaction holds a pool connection for the duration of the callback (one SET LOCAL + one SELECT, both microsecond-millisecond). Should not materially affect pool exhaustion under load; verify in staging if concerned.
hot-cold-tier.ts getColdMemories has no production callers, so functional risk is zero.
memory-read-handler.ts vector search is per-agent-run when memory_read node fires. Similar transaction profile to search.ts.

Out of scope (intentional):

Migrating $executeRawUnsafe -> $executeRaw on hot-cold-tier.ts and memory-read-handler.ts. Both keep the existing Unsafe variant; SET LOCAL takes no parameters, no SQL injection vector. Refactoring to safer APIs is a separate workstream.

Verification:

vitest run (search.test.ts + search-vector-shape.test.ts + hot-cold-tier.test.ts + memory-read-handler.test.ts + memory-read-handler-vector-shape.test.ts): 69 passed
tsc --noEmit -p tsconfig.json: exit 0

Refs:

skill-rls-rollout-PLAN-V2.md §4.5 (Phase 0e bundled bug fix)
skill-rls-rollout-FORENSIC-REPORT.md HAL-9
PR fix(db): wrap withOrgContext in $transaction so session var survives … #97 (Phase 0a — same fix pattern for withOrgContext)
PR docs(rls): track 3 deferred items from Phase 0a rollout #98 (docs/rls-tech-debt.md — tracking)

Description

Briefly describe the changes in this PR and the motivation behind them.

Type of change

Checklist

I have read the Contributing Guide
My code follows the project code standards (TypeScript strict, no any, Tailwind only)
I have added tests that cover my changes
All tests pass (pnpm test)
Linting passes (pnpm lint)
Type checking passes (pnpm typecheck)
I have updated documentation if needed
No console.log statements left in code
No hardcoded secrets or internal URLs

Same root cause as Phase 0a (PR #97): a SET LOCAL Postgres session variable and the subsequent query were issued as separate Prisma calls. With pool connections, the SET LOCAL can land on one connection and the SELECT on another — the ef_search tuning silently reverts to the server default before the vector search runs. Phase 0a fixed this for the org-context session variable in withOrgContext. Phase 0e applies the same $transaction wrap to all three SET LOCAL hnsw.ef_search call sites surfaced during STEP 1 audit: 1. src/lib/knowledge/search.ts:201 (searchKnowledgeBase) - Hot path: 6+ production callers (KB chat/eval API routes, kb-search-handler runtime, MCP tool exposure, rag-inject, codebase-rag SDLC). - Dynamic ef_search (40 | 60 | 100) based on query word count. - Vector search on KBChunk. 2. src/lib/memory/hot-cold-tier.ts:113 (getColdMemories) - No production callers in src/ today, but exported and reachable dynamically. Patched for consistency with the other two sites so the pattern is uniform across the codebase. - ef_search = 40 (hardcoded). - Vector search on AgentMemory. 3. src/lib/runtime/handlers/memory-read-handler.ts:168 (memoryReadHandler vector search path) - Runtime handler for the memory_read node type. Triggered for every agent run that includes a memory_read node. - ef_search = 40 (hardcoded). - Vector search on AgentMemory. PLAN-V2 §4.5 named only search.ts; the audit instruction in that section ('Apply same pattern to other SET LOCAL callers if discovered during STEP 1 audit') is honored here with all three sites in one PR. FORENSIC HAL-9 covers the same finding. Tests: - 63 existing tests still pass (search.test.ts utility tests: 30; hot-cold-tier.test.ts: 16; memory-read-handler.test.ts: 17). - 6 new regression-guard tests added in two new files: src/lib/knowledge/__tests__/search-vector-shape.test.ts (3) src/lib/runtime/handlers/__tests__/memory-read-handler-vector-shape.test.ts (3) - hot-cold-tier.test.ts mock extended with $transaction delegating to a tx mock; existing assertions on $executeRawUnsafe still hold via tx-aliased fn references. - Each shape test asserts: (a) $transaction is invoked, (b) SET LOCAL runs on the tx, not the outer prisma, (c) callback receives tx, (d) outer prisma's direct exec/query paths are NOT used. - Mocked tests verify SHAPE of the fix. They cannot verify pool-survival semantics; that needs a real Postgres integration harness (TODO comments left for Phase 1+). Risk: - search.ts is on the KB hot path. $transaction holds a pool connection for the duration of the callback (one SET LOCAL + one SELECT, both microsecond-millisecond). Should not materially affect pool exhaustion under load; verify in staging if concerned. - hot-cold-tier.ts getColdMemories has no production callers, so functional risk is zero. - memory-read-handler.ts vector search is per-agent-run when memory_read node fires. Similar transaction profile to search.ts. Out of scope (intentional): - Migrating $executeRawUnsafe -> $executeRaw on hot-cold-tier.ts and memory-read-handler.ts. Both keep the existing Unsafe variant; SET LOCAL takes no parameters, no SQL injection vector. Refactoring to safer APIs is a separate workstream. Verification: - vitest run (search.test.ts + search-vector-shape.test.ts + hot-cold-tier.test.ts + memory-read-handler.test.ts + memory-read-handler-vector-shape.test.ts): 69 passed - tsc --noEmit -p tsconfig.json: exit 0 Refs: - skill-rls-rollout-PLAN-V2.md §4.5 (Phase 0e bundled bug fix) - skill-rls-rollout-FORENSIC-REPORT.md HAL-9 - PR #97 (Phase 0a — same fix pattern for withOrgContext) - PR #98 (docs/rls-tech-debt.md — tracking)

The E2E job in CI has been failing on every push to main since at least PR #99 (Phase 0e merge) with the same error: Error: Timed out waiting 120000ms from config.webServer. Root cause: Playwright's `webServer.timeout` was set to 120_000 ms (2 minutes), but the CI command is `pnpm build && pnpm start`, which has to run a full Next.js production build on a fresh CI runner. The build alone takes 5-15 minutes (4m 11s in the dedicated CI Build job, 15.3 min in the Docker build job). The 120-second window expires before the server can listen on port 3000, and Playwright aborts with the timeout error above. This was masked until now because: 1. The E2E job only runs on push to main (and labeled PRs), so PR-level CI is green even when main-level CI is red. 2. Railway's "Wait for CI" toggle was off, so deploys went through despite the red CI status on the main branch. After enabling "Wait for CI" in preparation for Phase 0b migrations, this latent issue became a hard block: Railway now refuses to deploy any new main commit until CI is green. Fix: bump `webServer.timeout` to 1_500_000 ms (25 minutes) when running under CI. The original 120s remains in place locally so fast feedback isn't lost when running `pnpm test:e2e` against `pnpm dev`. Comment in the config explains the asymmetry. Followups out of scope for this PR: - Caching the `.next` build output between CI Build and E2E jobs so the build doesn't have to run twice. This is the proper long-term fix; tracked separately. - Investigating why Next.js compile is 15 minutes inside Docker builds vs 4 minutes in the dedicated Build job (see PR #98 docs/rls-tech-debt.md item #1). Verification: - tsc --noEmit -p tsconfig.json: exit 0 - Only one file changed (playwright.config.ts). - No app code, no test files, no migrations touched. Risk: low. Worst case, the new 25-minute cap is still too short and we widen it further. Best case, CI goes green and we unblock Phase 0b deploy + future PRs.

webdevcom01-cell merged commit c4cfd9a into main May 18, 2026
7 of 8 checks passed

webdevcom01-cell deleted the phase-0e/set-local-transaction-wrap branch May 18, 2026 19:50

webdevcom01-cell mentioned this pull request May 20, 2026

fix(ci): increase Playwright webServer timeout for CI #105

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(db): wrap SET LOCAL hnsw.ef_search in $transaction (Phase 0e)#99

fix(db): wrap SET LOCAL hnsw.ef_search in $transaction (Phase 0e)#99
webdevcom01-cell merged 1 commit into
mainfrom
phase-0e/set-local-transaction-wrap

webdevcom01-cell commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

webdevcom01-cell commented May 18, 2026

Description

Type of change

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant