Skip to content

[Bug]: PilotStatusAgent leaks SSH connections and exhausts memory (~45 GB), overloading SSH gateways #8568

@aldbr

Description

@aldbr

Search before creating an issue

  • I have searched existing issues and confirmed this is not a duplicate

Bug Description

On a setup that interacts with clusters through the SSHComputingElement (the new one based on fabric) the PilotStatusAgent grows to tens of GB of RAM (~45 GB observed)
and floods the SSH gateways with a burst of connections, overloading the host running the
agent and the gateways it connects to.

Root cause: when declaring stalled pilots Deleted, PilotStatusAgent._killPilots() calls
DiracAdmin.killPilot() once per pilot. Each call goes through
WMSUtilities.killPilotsInQueues(), which builds a fresh ComputingElement via
ComputingElementFactory.getCE() and calls ce.killJob(). For an SSHComputingElement
that opens a new SSH connection (and, with SSHTunnel, a second connection to the gateway),
and the CE/connection is never closed.

Consequences when a backlog of stalled pilots accumulates (e.g. after a site/queue stops
reporting pilot status):

  • one (or two, with a gateway) new SSH connection per stalled pilot → connection burst
    on the gateways;
  • the fabric/paramiko Connection objects and their background Transport threads are
    never released → unbounded memory growth. Fabric's documentation explicitly warns that
    relying on garbage collection to close connections "is not currently safe".

Steps to Reproduce

  1. Configure a queue served by an SSHComputingElement using an SSHTunnel (gateway).
  2. Let a backlog of pilots in transient states accumulate older than PilotStalledDays
    (e.g. a queue/CE that stops updating pilot status for a few days).
  3. Run PilotStatusAgent.
  4. Observe, during handleOldPilots:
    • a burst of SSH connections from the agent host to the gateway, one per stalled pilot;
    • dirac-agent .../PilotStatusAgent process climbing into the tens of GB;

Expected Behavior

PilotStatusAgent declares stalled pilots Deleted and kills them on their CEs without
unbounded memory growth and without opening (and leaking) a separate SSH connection per
pilot. Connections to a given queue/gateway should be (reused and released when
no longer valid.

Actual Behavior

  • The agent process grows to ~45 GB and overloads the host
  • A large number of SSH connections are opened
  • Connections/threads are never closed, so the footprint persists/accumulates within and
    across cycles.

Additional Context

Proposed fix (implemented on a branch):

Reuse CEs/connections across cycles instead of creating one per pilot:

  • extract the CE-caching logic the SiteDirector already uses (hash-based invalidation)
    into a shared QueueCECache (in QueueUtilities), with getCE() (cached-or-rebuilt)
    and drop() (evict + close());
  • PilotStatusAgent._killPilots() groups pilots by queue and issues one killJob()
    per queue on a cached CE, refreshing pilot credentials each cycle;
  • migrate getQueuesResolved() (hence SiteDirector and PushJobAgent) onto
    QueueCECache, which additionally fixes a latent leak there (CEs were dropped from the
    cache on config change / invalid queue without being closed).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions