Summary
A 5s per-TCP-segment latency on the Postgres connection results in 45-60s total time for a single MCP tool call (COLLECTION_CONNECTIONS_LIST). This is because each tool call involves multiple sequential DB roundtrips: authentication, permission checks, and the actual tool query.
Reproduction
Using the resilience test bed (tests/resilience/):
- Start the Docker stack
- Add 5s latency toxic to Postgres:
curl -X POST http://127.0.0.1:18474/proxies/postgres/toxics -d '{"type":"latency","attributes":{"latency":5000},"name":"db-slow"}'
- Call any MCP tool and measure: a simple
COLLECTION_CONNECTIONS_LIST takes ~45s
- With 15s latency, health check (
SELECT 1) takes 15s but tool calls would take 2-3 minutes
Observed Behavior
| DB Latency |
Health Check (SELECT 1) |
Tool Call (COLLECTION_CONNECTIONS_LIST) |
| 0ms |
~5ms |
~30ms |
| 5s/segment |
~5s |
~45-60s |
| 15s/segment |
~15s |
2-3min (estimated) |
Analysis
Each MCP tool call goes through this pipeline, each step hitting the DB:
- API key verification — look up key in
apikeys table
- Session/user resolution — query user/session tables
- Organization resolution — query organization membership
- Permission check — query API key permissions
- Tool execution — the actual query (e.g., list connections)
- Audit logging — write audit log entry
With 5s latency per segment, 9+ DB roundtrips = 45s+ total.
Impact
- User experience: During DB slowdowns (e.g., vacuum, replication lag), users experience tool call timeouts even though the DB is technically "up"
- Connection pool exhaustion: Slow queries hold connections longer, reducing pool capacity for other requests
- Cascading failures: Health checks pass (single
SELECT 1 is fast enough) while actual tool calls timeout — load balancers continue routing to degraded pods
Potential Mitigations
- Connection pooling with statement timeout: Set
statement_timeout at the pool level so individual queries fail fast
- Caching: Cache auth/permission lookups (they rarely change) to reduce DB roundtrips per tool call
- Circuit breaker on DB: If average query latency exceeds threshold, start failing fast instead of queuing
- Health check with representative query: Use a query that approximates real tool call cost, not just
SELECT 1
Found By
Resilience test bed: tests/resilience/scenarios/postgres-slow.test.ts
Summary
A 5s per-TCP-segment latency on the Postgres connection results in 45-60s total time for a single MCP tool call (
COLLECTION_CONNECTIONS_LIST). This is because each tool call involves multiple sequential DB roundtrips: authentication, permission checks, and the actual tool query.Reproduction
Using the resilience test bed (
tests/resilience/):curl -X POST http://127.0.0.1:18474/proxies/postgres/toxics -d '{"type":"latency","attributes":{"latency":5000},"name":"db-slow"}'COLLECTION_CONNECTIONS_LISTtakes ~45sSELECT 1) takes 15s but tool calls would take 2-3 minutesObserved Behavior
SELECT 1)COLLECTION_CONNECTIONS_LIST)Analysis
Each MCP tool call goes through this pipeline, each step hitting the DB:
apikeystableWith 5s latency per segment, 9+ DB roundtrips = 45s+ total.
Impact
SELECT 1is fast enough) while actual tool calls timeout — load balancers continue routing to degraded podsPotential Mitigations
statement_timeoutat the pool level so individual queries fail fastSELECT 1Found By
Resilience test bed:
tests/resilience/scenarios/postgres-slow.test.ts