DB latency amplifies significantly across MCP tool calls

## Summary

A 5s per-TCP-segment latency on the Postgres connection results in 45-60s total time for a single MCP tool call (`COLLECTION_CONNECTIONS_LIST`). This is because each tool call involves multiple sequential DB roundtrips: authentication, permission checks, and the actual tool query.

## Reproduction

Using the resilience test bed (`tests/resilience/`):

1. Start the Docker stack
2. Add 5s latency toxic to Postgres: `curl -X POST http://127.0.0.1:18474/proxies/postgres/toxics -d '{"type":"latency","attributes":{"latency":5000},"name":"db-slow"}'`
3. Call any MCP tool and measure: a simple `COLLECTION_CONNECTIONS_LIST` takes ~45s
4. With 15s latency, health check (`SELECT 1`) takes 15s but tool calls would take 2-3 minutes

## Observed Behavior

| DB Latency | Health Check (`SELECT 1`) | Tool Call (`COLLECTION_CONNECTIONS_LIST`) |
|---|---|---|
| 0ms | ~5ms | ~30ms |
| 5s/segment | ~5s | ~45-60s |
| 15s/segment | ~15s | 2-3min (estimated) |

## Analysis

Each MCP tool call goes through this pipeline, each step hitting the DB:
1. **API key verification** — look up key in `apikeys` table
2. **Session/user resolution** — query user/session tables
3. **Organization resolution** — query organization membership
4. **Permission check** — query API key permissions
5. **Tool execution** — the actual query (e.g., list connections)
6. **Audit logging** — write audit log entry

With 5s latency per segment, 9+ DB roundtrips = 45s+ total.

## Impact

- **User experience**: During DB slowdowns (e.g., vacuum, replication lag), users experience tool call timeouts even though the DB is technically "up"
- **Connection pool exhaustion**: Slow queries hold connections longer, reducing pool capacity for other requests
- **Cascading failures**: Health checks pass (single `SELECT 1` is fast enough) while actual tool calls timeout — load balancers continue routing to degraded pods

## Potential Mitigations

1. **Connection pooling with statement timeout**: Set `statement_timeout` at the pool level so individual queries fail fast
2. **Caching**: Cache auth/permission lookups (they rarely change) to reduce DB roundtrips per tool call
3. **Circuit breaker on DB**: If average query latency exceeds threshold, start failing fast instead of queuing
4. **Health check with representative query**: Use a query that approximates real tool call cost, not just `SELECT 1`

## Found By

Resilience test bed: `tests/resilience/scenarios/postgres-slow.test.ts`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB latency amplifies significantly across MCP tool calls #2995

Summary

Reproduction

Observed Behavior

Analysis

Impact

Potential Mitigations

Found By

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DB Latency	Health Check (`SELECT 1`)	Tool Call (`COLLECTION_CONNECTIONS_LIST`)
0ms	~5ms	~30ms
5s/segment	~5s	~45-60s
15s/segment	~15s	2-3min (estimated)

DB latency amplifies significantly across MCP tool calls #2995

Description

Summary

Reproduction

Observed Behavior

Analysis

Impact

Potential Mitigations

Found By

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions