Skip to content

[vMCP] Implement optional backend session keepalive #3870

@yrobla

Description

@yrobla

Depends on #3865

For stateful backends (e.g., Playwright, database connections), implement an optional keepalive mechanism to prevent backend session expiration while the corresponding vMCP session is active.

Implementation:

  • Prefer the MCP spec-defined ping protocol request (side-effect-free, supported by all compliant servers); fall back to an explicitly configured low-cost tool only if ping consistently fails
  • Per-backend configuration: keepalive_method: ping | tool:<name> | none; default: attempt ping
  • Configurable interval at server level (default ≥ 5 min); jitter keepalive calls across sessions to avoid spikes
  • The keepalive goroutine must acquire the session/backend lock before issuing calls to prevent races with reinitializeBackend
  • Circuit-breaker: after N consecutive failures, disable keepalive for that backend and log a warning; probe again after ~30 min to re-enable without requiring full session recreation
  • Disable by default for stateless backends and backends where TTL alignment already covers the session lifetime

Acceptance Criteria

  • Keepalive uses ping by default; falls back to configured tool only when ping fails
  • Keepalive is disabled when keepalive_method: none is set
  • The keepalive interval is configurable and defaults to ≥ 5 minutes
  • Keepalive calls across sessions are jittered to avoid synchronized spikes
  • The keepalive goroutine holds the appropriate lock before calling the backend
  • Keepalive failures do not surface as errors to the end user or fail the vMCP session
  • After N consecutive failures, keepalive is disabled for that backend with a logged warning
  • A probe after ~30 min re-enables keepalive if the backend recovers
  • Keepalive is disabled by default for stateless backends
  • Metrics are emitted: keepalive_attempt_count, keepalive_success_count, keepalive_failure_count (by reason), keepalive_latency_ms, keepalive_auto_disabled_total (by reason)
  • Unit tests cover: ping used by default, fallback to tool, circuit breaker, re-enable after probe, metrics emitted

RFC: THV-0038 — Session-scoped client lifecycle

Metadata

Metadata

Assignees

No one assigned

    Labels

    apiItems related to the APIenhancementNew feature or requestp1MediumtelemetryvmcpVirtual MCP Server related issues

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions