fix(rpc): prevent connection leak in metrics middleware by pavelm · Pull Request #1251 · alchemyplatform/rundler

pavelm · 2026-01-16T15:28:57Z

Fix RPC Connection Leak in Metrics Middleware

Problem Summary

The RPC server experiences a connection leak where active connections grow unbounded until reaching the max_connections limit (default: 100), at which point 10-20% of incoming requests are rejected. The service continues to operate but cannot accept new connections. A restart temporarily resolves the issue, confirming it's a resource accounting problem rather than actual connection exhaustion.

Root Cause Analysis

The Bug

The connection leak occurs due to improper cleanup when requests timeout. Here's the sequence:

RPC Request Starts: RpcMetricsMiddleware calls MethodSessionLogger::start() which:
- Increments open_requests gauge
- Records the start time
Request Processes: The async handler processes the request
Timeout Occurs: The HTTP layer timeout middleware (configured at 20 seconds via RPC_TIMEOUT_SECONDS) cancels the request future when it exceeds the timeout
Cleanup Skipped: When the future is cancelled, execution never reaches the method_logger.done() call in the middleware, so:
- open_requests gauge is never decremented
- Connection accounting is corrupted
- jsonrpsee's ConnectionGuard thinks connections are still active
Leak Accumulates: Over time, slow requests (especially rundler_maxPriorityFeePerGas which makes multiple RPC calls to the underlying node) timeout frequently, accumulating leaked connection counts
Pool Exhaustion: Eventually the leaked count reaches 100, and jsonrpsee refuses new connections

Code Location

Affected Code (crates/rpc/src/rpc_metrics.rs:69-77):

async move {
    // ... connection tracking ...
    let rp = svc.call(req).await;  // ⚠️ Can be cancelled by timeout
    method_logger.done();           // ⚠️ Never called on cancellation
    // ...
}

The Problem: When the HTTP timeout layer cancels this future, the async block is dropped immediately at the .await point, and done() is never called.

Why This Manifests as a Connection Leak

The rundler_jsonrpsee_stats_active_connections metric is set based on the ConnectionGuard state. The guard's accounting relies on proper cleanup of request sessions. When open_requests never decrements:

The middleware believes connections are still active
The jsonrpsee server's connection semaphore is never released
New connection attempts are rejected with "too many connections"
The actual TCP connections may be closed, but the accounting is wrong

Observable Symptoms

From production metrics, this bug manifests as:

✅ rundler_jsonrpsee_stats_active_connections steadily climbing toward 100
✅ rundler_rpc_stats_open_requests stuck at non-zero values for certain methods
✅ rundler_rpc_stats_request_latency P99 approaching/exceeding 20,000ms (timeout threshold)
✅ Increased rate of failed requests once connection limit is reached
✅ Restart immediately resolves the issue (metrics reset)

Why Slow Requests Trigger This

The rundler_maxPriorityFeePerGas endpoint is particularly vulnerable because it:

Calls FeeEstimator::latest_bundle_fees() which performs two sequential RPC calls:
- eth_feeHistory (1 block) to get pending base fee
- eth_feeHistory (5 blocks) to determine network congestion via the usage-based oracle
On Base, these calls can be slow due to:
- RPC provider latency
- Network congestion checking (requires 5 blocks of history)
- No caching on the latest_bundle_fees() path
When latency exceeds 20 seconds, the request times out and leaks a connection count

Solution

Approach: Drop Guard Pattern

The fix implements a RAII (Resource Acquisition Is Initialization) pattern using Rust's Drop trait to guarantee cleanup even when futures are cancelled.

Changes

1. Add `MethodSessionGuard` (`crates/types/src/task/metric_recorder.rs`)

/// Guard that ensures `done()` is called even if the future is cancelled.
pub struct MethodSessionGuard<'a> {
    logger: &'a MethodSessionLogger,
}

impl<'a> Drop for MethodSessionGuard<'a> {
    fn drop(&mut self) {
        self.logger.done();
    }
}

This guard calls done() when dropped, regardless of how the async block exits (normal completion, early return, panic, or cancellation).

2. Add `guard()` Method

impl MethodSessionLogger {
    pub fn guard(&self) -> MethodSessionGuard {
        MethodSessionGuard { logger: self }
    }
}

3. Use Guard in Middleware (`crates/rpc/src/rpc_metrics.rs`)

async move {
    // Create guard at the start - done() will be called when _guard drops
    let _guard = method_logger.guard();
    
    // ... rest of async block ...
    let rp = svc.call(req).await;
    // ... status recording ...
    
    rp
    // _guard drops here, calling done() automatically
}

Why This Works

Guard Created Early: The _guard is created at the start of the async block
Guaranteed Cleanup: Rust's ownership system guarantees Drop::drop() is called when the guard goes out of scope
Cancellation Safe: Even if the future is cancelled mid-execution, the guard is dropped and done() is called
Panic Safe: Even if the code panics, drop handlers run during unwinding
No Double-Call: The original method_logger.done() call was removed, preventing double-decrement

Testing & Validation

Before Fix

Monitor these metrics with the Grafana dashboard provided separately:

# Should show steady growth
rundler_jsonrpsee_stats_active_connections{service_name="rundler-rpc-service"}

# Should show positive derivative (leak rate)
deriv(rundler_jsonrpsee_stats_active_connections[5m]) > 0

# Should show stuck non-zero values
rundler_rpc_stats_open_requests{method_name="rundler_maxPriorityFeePerGas"}

After Fix

Expected behavior:

# Should fluctuate with traffic, return to baseline
rundler_jsonrpsee_stats_active_connections{service_name="rundler-rpc-service"}

# Should oscillate around zero
deriv(rundler_jsonrpsee_stats_active_connections[5m])

# Should always return to 0 when no active requests
rundler_rpc_stats_open_requests

Load Testing

To validate the fix under load:

Deploy to a staging environment
Generate traffic with requests that timeout (artificially increase timeout duration or slow down node RPC)
Verify connection count remains stable over time
Verify open_requests always returns to 0

Additional Considerations

Alternative Fixes Considered

Reorder Middleware Layers: Place metrics outside timeout
- ❌ Would still leak on panics or other cancellation sources
- ❌ Doesn't protect against all future cancellation scenarios
Add Per-Method Timeouts: Wrap slow operations in tokio::time::timeout()
- ✅ Good idea for additional defense-in-depth
- ❌ Doesn't solve the fundamental cleanup issue
Use scopeguard crate: Third-party defer mechanism
- ❌ Unnecessary dependency when Drop trait works natively

The Drop guard approach is the most robust solution as it:

✅ Handles all cancellation sources (timeout, panic, early return, drop)
✅ Uses native Rust idioms (no external deps)
✅ Is a well-established pattern for cleanup in async Rust
✅ Provides compile-time guarantees

Performance Impact

None. The guard is a zero-cost abstraction:

Compiled to the same cleanup code
No heap allocation (stack-only)
No additional function calls beyond the original done()

Future Improvements

While this PR fixes the leak, the following improvements could reduce timeout frequency:

Add Caching to latest_bundle_fees(): Similar to how required_bundle_fees() has an LRU cache
Switch Base to PROVIDER Oracle: Use eth_maxPriorityFeePerGas instead of usage-based (one RPC call vs two)
Reduce Usage-Based History Window: Check fewer blocks for congestion (e.g., 3 instead of 5)
Add Per-Method Timeouts: Wrap individual RPC methods in shorter timeouts before the global 20s limit

Related Issues

This issue is related to:

High latency on rundler_maxPriorityFeePerGas endpoint
Base network fee estimation using usage-based oracle requiring multiple RPC calls
Lack of caching on bundle fee estimation

Checklist

Root cause identified and documented
Fix implemented using Drop guard pattern
Manual testing completed
Grafana dashboard for monitoring provided separately
Ready for staging deployment and validation

Add Drop guard to ensure metrics cleanup on request timeout/cancellation. When the HTTP timeout middleware cancels a request future due to timeout, the RPC metrics middleware was not calling done(), causing: - open_requests gauge to never decrement - Connection pool accounting to leak - Eventually exhausting max_connections limit (100) This fix adds a MethodSessionGuard that calls done() via Drop, ensuring cleanup even when futures are cancelled by timeout or other mechanisms.

pavelm requested review from 0xfourzerofour and dancoombs as code owners January 16, 2026 15:28

dancoombs approved these changes Jan 16, 2026

View reviewed changes

github-actions Bot added the stale label Feb 16, 2026

github-actions Bot closed this Feb 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rpc): prevent connection leak in metrics middleware#1251

fix(rpc): prevent connection leak in metrics middleware#1251
pavelm wants to merge 1 commit intomainfrom
fix/rpc-connection-leak

pavelm commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pavelm commented Jan 16, 2026

Fix RPC Connection Leak in Metrics Middleware

Problem Summary

Root Cause Analysis

The Bug

Code Location

Why This Manifests as a Connection Leak

Observable Symptoms

Why Slow Requests Trigger This

Solution

Approach: Drop Guard Pattern

Changes

1. Add MethodSessionGuard (crates/types/src/task/metric_recorder.rs)

2. Add guard() Method

3. Use Guard in Middleware (crates/rpc/src/rpc_metrics.rs)

Why This Works

Testing & Validation

Before Fix

After Fix

Load Testing

Additional Considerations

Alternative Fixes Considered

Performance Impact

Future Improvements

Related Issues

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Add `MethodSessionGuard` (`crates/types/src/task/metric_recorder.rs`)

2. Add `guard()` Method

3. Use Guard in Middleware (`crates/rpc/src/rpc_metrics.rs`)