Skip to content

fix(rpc): prevent connection leak in metrics middleware#1251

Closed
pavelm wants to merge 1 commit intomainfrom
fix/rpc-connection-leak
Closed

fix(rpc): prevent connection leak in metrics middleware#1251
pavelm wants to merge 1 commit intomainfrom
fix/rpc-connection-leak

Conversation

@pavelm
Copy link
Copy Markdown
Contributor

@pavelm pavelm commented Jan 16, 2026

Fix RPC Connection Leak in Metrics Middleware

Problem Summary

The RPC server experiences a connection leak where active connections grow unbounded until reaching the max_connections limit (default: 100), at which point 10-20% of incoming requests are rejected. The service continues to operate but cannot accept new connections. A restart temporarily resolves the issue, confirming it's a resource accounting problem rather than actual connection exhaustion.

Root Cause Analysis

The Bug

The connection leak occurs due to improper cleanup when requests timeout. Here's the sequence:

  1. RPC Request Starts: RpcMetricsMiddleware calls MethodSessionLogger::start() which:

    • Increments open_requests gauge
    • Records the start time
  2. Request Processes: The async handler processes the request

  3. Timeout Occurs: The HTTP layer timeout middleware (configured at 20 seconds via RPC_TIMEOUT_SECONDS) cancels the request future when it exceeds the timeout

  4. Cleanup Skipped: When the future is cancelled, execution never reaches the method_logger.done() call in the middleware, so:

    • open_requests gauge is never decremented
    • Connection accounting is corrupted
    • jsonrpsee's ConnectionGuard thinks connections are still active
  5. Leak Accumulates: Over time, slow requests (especially rundler_maxPriorityFeePerGas which makes multiple RPC calls to the underlying node) timeout frequently, accumulating leaked connection counts

  6. Pool Exhaustion: Eventually the leaked count reaches 100, and jsonrpsee refuses new connections

Code Location

Affected Code (crates/rpc/src/rpc_metrics.rs:69-77):

async move {
    // ... connection tracking ...
    let rp = svc.call(req).await;  // ⚠️ Can be cancelled by timeout
    method_logger.done();           // ⚠️ Never called on cancellation
    // ...
}

The Problem: When the HTTP timeout layer cancels this future, the async block is dropped immediately at the .await point, and done() is never called.

Why This Manifests as a Connection Leak

The rundler_jsonrpsee_stats_active_connections metric is set based on the ConnectionGuard state. The guard's accounting relies on proper cleanup of request sessions. When open_requests never decrements:

  1. The middleware believes connections are still active
  2. The jsonrpsee server's connection semaphore is never released
  3. New connection attempts are rejected with "too many connections"
  4. The actual TCP connections may be closed, but the accounting is wrong

Observable Symptoms

From production metrics, this bug manifests as:

  • rundler_jsonrpsee_stats_active_connections steadily climbing toward 100
  • rundler_rpc_stats_open_requests stuck at non-zero values for certain methods
  • rundler_rpc_stats_request_latency P99 approaching/exceeding 20,000ms (timeout threshold)
  • ✅ Increased rate of failed requests once connection limit is reached
  • ✅ Restart immediately resolves the issue (metrics reset)

Why Slow Requests Trigger This

The rundler_maxPriorityFeePerGas endpoint is particularly vulnerable because it:

  1. Calls FeeEstimator::latest_bundle_fees() which performs two sequential RPC calls:

    • eth_feeHistory (1 block) to get pending base fee
    • eth_feeHistory (5 blocks) to determine network congestion via the usage-based oracle
  2. On Base, these calls can be slow due to:

    • RPC provider latency
    • Network congestion checking (requires 5 blocks of history)
    • No caching on the latest_bundle_fees() path
  3. When latency exceeds 20 seconds, the request times out and leaks a connection count

Solution

Approach: Drop Guard Pattern

The fix implements a RAII (Resource Acquisition Is Initialization) pattern using Rust's Drop trait to guarantee cleanup even when futures are cancelled.

Changes

1. Add MethodSessionGuard (crates/types/src/task/metric_recorder.rs)

/// Guard that ensures `done()` is called even if the future is cancelled.
pub struct MethodSessionGuard<'a> {
    logger: &'a MethodSessionLogger,
}

impl<'a> Drop for MethodSessionGuard<'a> {
    fn drop(&mut self) {
        self.logger.done();
    }
}

This guard calls done() when dropped, regardless of how the async block exits (normal completion, early return, panic, or cancellation).

2. Add guard() Method

impl MethodSessionLogger {
    pub fn guard(&self) -> MethodSessionGuard {
        MethodSessionGuard { logger: self }
    }
}

3. Use Guard in Middleware (crates/rpc/src/rpc_metrics.rs)

async move {
    // Create guard at the start - done() will be called when _guard drops
    let _guard = method_logger.guard();
    
    // ... rest of async block ...
    let rp = svc.call(req).await;
    // ... status recording ...
    
    rp
    // _guard drops here, calling done() automatically
}

Why This Works

  1. Guard Created Early: The _guard is created at the start of the async block
  2. Guaranteed Cleanup: Rust's ownership system guarantees Drop::drop() is called when the guard goes out of scope
  3. Cancellation Safe: Even if the future is cancelled mid-execution, the guard is dropped and done() is called
  4. Panic Safe: Even if the code panics, drop handlers run during unwinding
  5. No Double-Call: The original method_logger.done() call was removed, preventing double-decrement

Testing & Validation

Before Fix

Monitor these metrics with the Grafana dashboard provided separately:

# Should show steady growth
rundler_jsonrpsee_stats_active_connections{service_name="rundler-rpc-service"}

# Should show positive derivative (leak rate)
deriv(rundler_jsonrpsee_stats_active_connections[5m]) > 0

# Should show stuck non-zero values
rundler_rpc_stats_open_requests{method_name="rundler_maxPriorityFeePerGas"}

After Fix

Expected behavior:

# Should fluctuate with traffic, return to baseline
rundler_jsonrpsee_stats_active_connections{service_name="rundler-rpc-service"}

# Should oscillate around zero
deriv(rundler_jsonrpsee_stats_active_connections[5m])

# Should always return to 0 when no active requests
rundler_rpc_stats_open_requests

Load Testing

To validate the fix under load:

  1. Deploy to a staging environment
  2. Generate traffic with requests that timeout (artificially increase timeout duration or slow down node RPC)
  3. Verify connection count remains stable over time
  4. Verify open_requests always returns to 0

Additional Considerations

Alternative Fixes Considered

  1. Reorder Middleware Layers: Place metrics outside timeout

    • ❌ Would still leak on panics or other cancellation sources
    • ❌ Doesn't protect against all future cancellation scenarios
  2. Add Per-Method Timeouts: Wrap slow operations in tokio::time::timeout()

    • ✅ Good idea for additional defense-in-depth
    • ❌ Doesn't solve the fundamental cleanup issue
  3. Use scopeguard crate: Third-party defer mechanism

    • ❌ Unnecessary dependency when Drop trait works natively

The Drop guard approach is the most robust solution as it:

  • ✅ Handles all cancellation sources (timeout, panic, early return, drop)
  • ✅ Uses native Rust idioms (no external deps)
  • ✅ Is a well-established pattern for cleanup in async Rust
  • ✅ Provides compile-time guarantees

Performance Impact

None. The guard is a zero-cost abstraction:

  • Compiled to the same cleanup code
  • No heap allocation (stack-only)
  • No additional function calls beyond the original done()

Future Improvements

While this PR fixes the leak, the following improvements could reduce timeout frequency:

  1. Add Caching to latest_bundle_fees(): Similar to how required_bundle_fees() has an LRU cache
  2. Switch Base to PROVIDER Oracle: Use eth_maxPriorityFeePerGas instead of usage-based (one RPC call vs two)
  3. Reduce Usage-Based History Window: Check fewer blocks for congestion (e.g., 3 instead of 5)
  4. Add Per-Method Timeouts: Wrap individual RPC methods in shorter timeouts before the global 20s limit

Related Issues

This issue is related to:

  • High latency on rundler_maxPriorityFeePerGas endpoint
  • Base network fee estimation using usage-based oracle requiring multiple RPC calls
  • Lack of caching on bundle fee estimation

Checklist

  • Root cause identified and documented
  • Fix implemented using Drop guard pattern
  • Manual testing completed
  • Grafana dashboard for monitoring provided separately
  • Ready for staging deployment and validation

Add Drop guard to ensure metrics cleanup on request timeout/cancellation.

When the HTTP timeout middleware cancels a request future due to timeout,
the RPC metrics middleware was not calling done(), causing:
- open_requests gauge to never decrement
- Connection pool accounting to leak
- Eventually exhausting max_connections limit (100)

This fix adds a MethodSessionGuard that calls done() via Drop, ensuring
cleanup even when futures are cancelled by timeout or other mechanisms.
@github-actions github-actions Bot added the stale label Feb 16, 2026
@github-actions github-actions Bot closed this Feb 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants