Skip to content

DB latency amplifies significantly across MCP tool calls #2995

@tlgimenes

Description

@tlgimenes

Summary

A 5s per-TCP-segment latency on the Postgres connection results in 45-60s total time for a single MCP tool call (COLLECTION_CONNECTIONS_LIST). This is because each tool call involves multiple sequential DB roundtrips: authentication, permission checks, and the actual tool query.

Reproduction

Using the resilience test bed (tests/resilience/):

  1. Start the Docker stack
  2. Add 5s latency toxic to Postgres: curl -X POST http://127.0.0.1:18474/proxies/postgres/toxics -d '{"type":"latency","attributes":{"latency":5000},"name":"db-slow"}'
  3. Call any MCP tool and measure: a simple COLLECTION_CONNECTIONS_LIST takes ~45s
  4. With 15s latency, health check (SELECT 1) takes 15s but tool calls would take 2-3 minutes

Observed Behavior

DB Latency Health Check (SELECT 1) Tool Call (COLLECTION_CONNECTIONS_LIST)
0ms ~5ms ~30ms
5s/segment ~5s ~45-60s
15s/segment ~15s 2-3min (estimated)

Analysis

Each MCP tool call goes through this pipeline, each step hitting the DB:

  1. API key verification — look up key in apikeys table
  2. Session/user resolution — query user/session tables
  3. Organization resolution — query organization membership
  4. Permission check — query API key permissions
  5. Tool execution — the actual query (e.g., list connections)
  6. Audit logging — write audit log entry

With 5s latency per segment, 9+ DB roundtrips = 45s+ total.

Impact

  • User experience: During DB slowdowns (e.g., vacuum, replication lag), users experience tool call timeouts even though the DB is technically "up"
  • Connection pool exhaustion: Slow queries hold connections longer, reducing pool capacity for other requests
  • Cascading failures: Health checks pass (single SELECT 1 is fast enough) while actual tool calls timeout — load balancers continue routing to degraded pods

Potential Mitigations

  1. Connection pooling with statement timeout: Set statement_timeout at the pool level so individual queries fail fast
  2. Caching: Cache auth/permission lookups (they rarely change) to reduce DB roundtrips per tool call
  3. Circuit breaker on DB: If average query latency exceeds threshold, start failing fast instead of queuing
  4. Health check with representative query: Use a query that approximates real tool call cost, not just SELECT 1

Found By

Resilience test bed: tests/resilience/scenarios/postgres-slow.test.ts

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions