[Bug]: PilotStatusAgent leaks SSH connections and exhausts memory (~45 GB), overloading SSH gateways

## Search before creating an issue
- [x] I have searched existing issues and confirmed this is not a duplicate

## Bug Description
On a setup that interacts with clusters through the `SSHComputingElement` (the new one based on fabric) the `PilotStatusAgent` grows to tens of GB of RAM (~45 GB observed)
and floods the SSH gateways with a burst of connections, overloading the host running the
agent and the gateways it connects to.

Root cause: when declaring stalled pilots `Deleted`, `PilotStatusAgent._killPilots()` calls
`DiracAdmin.killPilot()` **once per pilot**. Each call goes through
`WMSUtilities.killPilotsInQueues()`, which builds a **fresh** `ComputingElement` via
`ComputingElementFactory.getCE()` and calls `ce.killJob()`. For an `SSHComputingElement`
that opens a new SSH connection (and, with `SSHTunnel`, a second connection to the gateway),
and the CE/connection is **never closed**.

Consequences when a backlog of stalled pilots accumulates (e.g. after a site/queue stops
reporting pilot status):
- one (or two, with a gateway) new SSH connection **per stalled pilot** → connection burst
  on the gateways;
- the fabric/paramiko `Connection` objects and their background `Transport` threads are
  never released → unbounded memory growth. Fabric's documentation explicitly warns that
  relying on garbage collection to close connections "is not currently safe".

## Steps to Reproduce

1. Configure a queue served by an `SSHComputingElement` using an `SSHTunnel` (gateway).
2. Let a backlog of pilots in transient states accumulate older than `PilotStalledDays`
   (e.g. a queue/CE that stops updating pilot status for a few days).
3. Run `PilotStatusAgent`.
4. Observe, during `handleOldPilots`:
   - a burst of SSH connections from the agent host to the gateway, one per stalled pilot;
   - `dirac-agent .../PilotStatusAgent` process climbing into the tens of GB;

## Expected Behavior
`PilotStatusAgent` declares stalled pilots `Deleted` and kills them on their CEs without
unbounded memory growth and without opening (and leaking) a separate SSH connection per
pilot. Connections to a given queue/gateway should be (reused and released when
no longer valid.

## Actual Behavior
- The agent process grows to ~45 GB and overloads the host
- A large number of SSH connections are opened
- Connections/threads are never closed, so the footprint persists/accumulates within and
  across cycles.

## Additional Context
**Proposed fix (implemented on a branch):**

Reuse CEs/connections across cycles instead of creating one per pilot:
   - extract the CE-caching logic the `SiteDirector` already uses (hash-based invalidation)
     into a shared `QueueCECache` (in `QueueUtilities`), with `getCE()` (cached-or-rebuilt)
     and `drop()` (evict + `close()`);
   - `PilotStatusAgent._killPilots()` groups pilots **by queue** and issues one `killJob()`
     per queue on a cached CE, refreshing pilot credentials each cycle;
   - migrate `getQueuesResolved()` (hence `SiteDirector` and `PushJobAgent`) onto
     `QueueCECache`, which additionally fixes a latent leak there (CEs were dropped from the
     cache on config change / invalid queue without being closed).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: PilotStatusAgent leaks SSH connections and exhausts memory (~45 GB), overloading SSH gateways #8568

Search before creating an issue

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: PilotStatusAgent leaks SSH connections and exhausts memory (~45 GB), overloading SSH gateways #8568

Description

Search before creating an issue

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions