Skip to content

fix(db): wrap SET LOCAL hnsw.ef_search in $transaction (Phase 0e)#99

Merged
webdevcom01-cell merged 1 commit into
mainfrom
phase-0e/set-local-transaction-wrap
May 18, 2026
Merged

fix(db): wrap SET LOCAL hnsw.ef_search in $transaction (Phase 0e)#99
webdevcom01-cell merged 1 commit into
mainfrom
phase-0e/set-local-transaction-wrap

Conversation

@webdevcom01-cell
Copy link
Copy Markdown
Owner

Same root cause as Phase 0a (PR #97): a SET LOCAL Postgres session variable and the subsequent query were issued as separate Prisma calls. With pool connections, the SET LOCAL can land on one connection and the SELECT on another — the ef_search tuning silently reverts to the server default before the vector search runs.

Phase 0a fixed this for the org-context session variable in withOrgContext. Phase 0e applies the same $transaction wrap to all three SET LOCAL hnsw.ef_search call sites surfaced during STEP 1 audit:

  1. src/lib/knowledge/search.ts:201 (searchKnowledgeBase)

    • Hot path: 6+ production callers (KB chat/eval API routes, kb-search-handler runtime, MCP tool exposure, rag-inject, codebase-rag SDLC).
    • Dynamic ef_search (40 | 60 | 100) based on query word count.
    • Vector search on KBChunk.
  2. src/lib/memory/hot-cold-tier.ts:113 (getColdMemories)

    • No production callers in src/ today, but exported and reachable dynamically. Patched for consistency with the other two sites so the pattern is uniform across the codebase.
    • ef_search = 40 (hardcoded).
    • Vector search on AgentMemory.
  3. src/lib/runtime/handlers/memory-read-handler.ts:168 (memoryReadHandler vector search path)

    • Runtime handler for the memory_read node type. Triggered for every agent run that includes a memory_read node.
    • ef_search = 40 (hardcoded).
    • Vector search on AgentMemory.

PLAN-V2 §4.5 named only search.ts; the audit instruction in that section ('Apply same pattern to other SET LOCAL callers if discovered during STEP 1 audit') is honored here with all three sites in one PR.

FORENSIC HAL-9 covers the same finding.

Tests:

  • 63 existing tests still pass (search.test.ts utility tests: 30; hot-cold-tier.test.ts: 16; memory-read-handler.test.ts: 17).
  • 6 new regression-guard tests added in two new files: src/lib/knowledge/tests/search-vector-shape.test.ts (3) src/lib/runtime/handlers/tests/memory-read-handler-vector-shape.test.ts (3)
  • hot-cold-tier.test.ts mock extended with $transaction delegating to a tx mock; existing assertions on $executeRawUnsafe still hold via tx-aliased fn references.
  • Each shape test asserts: (a) $transaction is invoked, (b) SET LOCAL runs on the tx, not the outer prisma, (c) callback receives tx, (d) outer prisma's direct exec/query paths are NOT used.
  • Mocked tests verify SHAPE of the fix. They cannot verify pool-survival semantics; that needs a real Postgres integration harness (TODO comments left for Phase 1+).

Risk:

  • search.ts is on the KB hot path. $transaction holds a pool connection for the duration of the callback (one SET LOCAL + one SELECT, both microsecond-millisecond). Should not materially affect pool exhaustion under load; verify in staging if concerned.
  • hot-cold-tier.ts getColdMemories has no production callers, so functional risk is zero.
  • memory-read-handler.ts vector search is per-agent-run when memory_read node fires. Similar transaction profile to search.ts.

Out of scope (intentional):

  • Migrating $executeRawUnsafe -> $executeRaw on hot-cold-tier.ts and memory-read-handler.ts. Both keep the existing Unsafe variant; SET LOCAL takes no parameters, no SQL injection vector. Refactoring to safer APIs is a separate workstream.

Verification:

  • vitest run (search.test.ts + search-vector-shape.test.ts + hot-cold-tier.test.ts + memory-read-handler.test.ts + memory-read-handler-vector-shape.test.ts): 69 passed
  • tsc --noEmit -p tsconfig.json: exit 0

Refs:

Description

Briefly describe the changes in this PR and the motivation behind them.

Type of change

  • New feature
  • Bug fix
  • Documentation update
  • Refactor (no behavior change)
  • Test update
  • Chore (build, CI, dependencies)

Checklist

  • I have read the Contributing Guide
  • My code follows the project code standards (TypeScript strict, no any, Tailwind only)
  • I have added tests that cover my changes
  • All tests pass (pnpm test)
  • Linting passes (pnpm lint)
  • Type checking passes (pnpm typecheck)
  • I have updated documentation if needed
  • No console.log statements left in code
  • No hardcoded secrets or internal URLs

Same root cause as Phase 0a (PR #97): a SET LOCAL Postgres session
variable and the subsequent query were issued as separate Prisma
calls. With pool connections, the SET LOCAL can land on one
connection and the SELECT on another — the ef_search tuning
silently reverts to the server default before the vector search
runs.

Phase 0a fixed this for the org-context session variable in
withOrgContext. Phase 0e applies the same $transaction wrap to
all three SET LOCAL hnsw.ef_search call sites surfaced during
STEP 1 audit:

1. src/lib/knowledge/search.ts:201 (searchKnowledgeBase)
   - Hot path: 6+ production callers (KB chat/eval API routes,
     kb-search-handler runtime, MCP tool exposure, rag-inject,
     codebase-rag SDLC).
   - Dynamic ef_search (40 | 60 | 100) based on query word count.
   - Vector search on KBChunk.

2. src/lib/memory/hot-cold-tier.ts:113 (getColdMemories)
   - No production callers in src/ today, but exported and reachable
     dynamically. Patched for consistency with the other two sites
     so the pattern is uniform across the codebase.
   - ef_search = 40 (hardcoded).
   - Vector search on AgentMemory.

3. src/lib/runtime/handlers/memory-read-handler.ts:168
   (memoryReadHandler vector search path)
   - Runtime handler for the memory_read node type. Triggered for
     every agent run that includes a memory_read node.
   - ef_search = 40 (hardcoded).
   - Vector search on AgentMemory.

PLAN-V2 §4.5 named only search.ts; the audit instruction in that
section ('Apply same pattern to other SET LOCAL callers if
discovered during STEP 1 audit') is honored here with all three
sites in one PR.

FORENSIC HAL-9 covers the same finding.

Tests:
  - 63 existing tests still pass (search.test.ts utility tests: 30;
    hot-cold-tier.test.ts: 16; memory-read-handler.test.ts: 17).
  - 6 new regression-guard tests added in two new files:
    src/lib/knowledge/__tests__/search-vector-shape.test.ts (3)
    src/lib/runtime/handlers/__tests__/memory-read-handler-vector-shape.test.ts (3)
  - hot-cold-tier.test.ts mock extended with $transaction
    delegating to a tx mock; existing assertions on
    $executeRawUnsafe still hold via tx-aliased fn references.
  - Each shape test asserts: (a) $transaction is invoked, (b)
    SET LOCAL runs on the tx, not the outer prisma, (c) callback
    receives tx, (d) outer prisma's direct exec/query paths are
    NOT used.
  - Mocked tests verify SHAPE of the fix. They cannot verify
    pool-survival semantics; that needs a real Postgres
    integration harness (TODO comments left for Phase 1+).

Risk:
  - search.ts is on the KB hot path. $transaction holds a pool
    connection for the duration of the callback (one SET LOCAL +
    one SELECT, both microsecond-millisecond). Should not
    materially affect pool exhaustion under load; verify in
    staging if concerned.
  - hot-cold-tier.ts getColdMemories has no production callers,
    so functional risk is zero.
  - memory-read-handler.ts vector search is per-agent-run when
    memory_read node fires. Similar transaction profile to
    search.ts.

Out of scope (intentional):
  - Migrating $executeRawUnsafe -> $executeRaw on hot-cold-tier.ts
    and memory-read-handler.ts. Both keep the existing Unsafe
    variant; SET LOCAL takes no parameters, no SQL injection
    vector. Refactoring to safer APIs is a separate workstream.

Verification:
  - vitest run (search.test.ts + search-vector-shape.test.ts +
    hot-cold-tier.test.ts + memory-read-handler.test.ts +
    memory-read-handler-vector-shape.test.ts): 69 passed
  - tsc --noEmit -p tsconfig.json: exit 0

Refs:
  - skill-rls-rollout-PLAN-V2.md §4.5 (Phase 0e bundled bug fix)
  - skill-rls-rollout-FORENSIC-REPORT.md HAL-9
  - PR #97 (Phase 0a — same fix pattern for withOrgContext)
  - PR #98 (docs/rls-tech-debt.md — tracking)
@webdevcom01-cell webdevcom01-cell merged commit c4cfd9a into main May 18, 2026
7 of 8 checks passed
@webdevcom01-cell webdevcom01-cell deleted the phase-0e/set-local-transaction-wrap branch May 18, 2026 19:50
webdevcom01-cell added a commit that referenced this pull request May 20, 2026
The E2E job in CI has been failing on every push to main since at
least PR #99 (Phase 0e merge) with the same error:

    Error: Timed out waiting 120000ms from config.webServer.

Root cause: Playwright's `webServer.timeout` was set to 120_000 ms
(2 minutes), but the CI command is `pnpm build && pnpm start`,
which has to run a full Next.js production build on a fresh CI
runner. The build alone takes 5-15 minutes (4m 11s in the dedicated
CI Build job, 15.3 min in the Docker build job). The 120-second
window expires before the server can listen on port 3000, and
Playwright aborts with the timeout error above.

This was masked until now because:
  1. The E2E job only runs on push to main (and labeled PRs),
     so PR-level CI is green even when main-level CI is red.
  2. Railway's "Wait for CI" toggle was off, so deploys went
     through despite the red CI status on the main branch.

After enabling "Wait for CI" in preparation for Phase 0b
migrations, this latent issue became a hard block: Railway now
refuses to deploy any new main commit until CI is green.

Fix: bump `webServer.timeout` to 1_500_000 ms (25 minutes) when
running under CI. The original 120s remains in place locally so
fast feedback isn't lost when running `pnpm test:e2e` against
`pnpm dev`. Comment in the config explains the asymmetry.

Followups out of scope for this PR:
  - Caching the `.next` build output between CI Build and E2E jobs
    so the build doesn't have to run twice. This is the proper
    long-term fix; tracked separately.
  - Investigating why Next.js compile is 15 minutes inside Docker
    builds vs 4 minutes in the dedicated Build job (see PR #98
    docs/rls-tech-debt.md item #1).

Verification:
  - tsc --noEmit -p tsconfig.json: exit 0
  - Only one file changed (playwright.config.ts).
  - No app code, no test files, no migrations touched.

Risk: low. Worst case, the new 25-minute cap is still too short
and we widen it further. Best case, CI goes green and we unblock
Phase 0b deploy + future PRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant