fix(rpc): prevent connection leak in metrics middleware#1251
Closed
fix(rpc): prevent connection leak in metrics middleware#1251
Conversation
Add Drop guard to ensure metrics cleanup on request timeout/cancellation. When the HTTP timeout middleware cancels a request future due to timeout, the RPC metrics middleware was not calling done(), causing: - open_requests gauge to never decrement - Connection pool accounting to leak - Eventually exhausting max_connections limit (100) This fix adds a MethodSessionGuard that calls done() via Drop, ensuring cleanup even when futures are cancelled by timeout or other mechanisms.
dancoombs
approved these changes
Jan 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix RPC Connection Leak in Metrics Middleware
Problem Summary
The RPC server experiences a connection leak where active connections grow unbounded until reaching the
max_connectionslimit (default: 100), at which point 10-20% of incoming requests are rejected. The service continues to operate but cannot accept new connections. A restart temporarily resolves the issue, confirming it's a resource accounting problem rather than actual connection exhaustion.Root Cause Analysis
The Bug
The connection leak occurs due to improper cleanup when requests timeout. Here's the sequence:
RPC Request Starts:
RpcMetricsMiddlewarecallsMethodSessionLogger::start()which:open_requestsgaugeRequest Processes: The async handler processes the request
Timeout Occurs: The HTTP layer timeout middleware (configured at 20 seconds via
RPC_TIMEOUT_SECONDS) cancels the request future when it exceeds the timeoutCleanup Skipped: When the future is cancelled, execution never reaches the
method_logger.done()call in the middleware, so:open_requestsgauge is never decrementedConnectionGuardthinks connections are still activeLeak Accumulates: Over time, slow requests (especially
rundler_maxPriorityFeePerGaswhich makes multiple RPC calls to the underlying node) timeout frequently, accumulating leaked connection countsPool Exhaustion: Eventually the leaked count reaches 100, and jsonrpsee refuses new connections
Code Location
Affected Code (
crates/rpc/src/rpc_metrics.rs:69-77):The Problem: When the HTTP timeout layer cancels this future, the async block is dropped immediately at the
.awaitpoint, anddone()is never called.Why This Manifests as a Connection Leak
The
rundler_jsonrpsee_stats_active_connectionsmetric is set based on theConnectionGuardstate. The guard's accounting relies on proper cleanup of request sessions. Whenopen_requestsnever decrements:Observable Symptoms
From production metrics, this bug manifests as:
rundler_jsonrpsee_stats_active_connectionssteadily climbing toward 100rundler_rpc_stats_open_requestsstuck at non-zero values for certain methodsrundler_rpc_stats_request_latencyP99 approaching/exceeding 20,000ms (timeout threshold)Why Slow Requests Trigger This
The
rundler_maxPriorityFeePerGasendpoint is particularly vulnerable because it:Calls
FeeEstimator::latest_bundle_fees()which performs two sequential RPC calls:eth_feeHistory(1 block) to get pending base feeeth_feeHistory(5 blocks) to determine network congestion via the usage-based oracleOn Base, these calls can be slow due to:
latest_bundle_fees()pathWhen latency exceeds 20 seconds, the request times out and leaks a connection count
Solution
Approach: Drop Guard Pattern
The fix implements a RAII (Resource Acquisition Is Initialization) pattern using Rust's
Droptrait to guarantee cleanup even when futures are cancelled.Changes
1. Add
MethodSessionGuard(crates/types/src/task/metric_recorder.rs)This guard calls
done()when dropped, regardless of how the async block exits (normal completion, early return, panic, or cancellation).2. Add
guard()Method3. Use Guard in Middleware (
crates/rpc/src/rpc_metrics.rs)Why This Works
_guardis created at the start of the async blockDrop::drop()is called when the guard goes out of scopedone()is calledmethod_logger.done()call was removed, preventing double-decrementTesting & Validation
Before Fix
Monitor these metrics with the Grafana dashboard provided separately:
After Fix
Expected behavior:
Load Testing
To validate the fix under load:
open_requestsalways returns to 0Additional Considerations
Alternative Fixes Considered
Reorder Middleware Layers: Place metrics outside timeout
Add Per-Method Timeouts: Wrap slow operations in
tokio::time::timeout()Use
scopeguardcrate: Third-party defer mechanismThe Drop guard approach is the most robust solution as it:
Performance Impact
None. The guard is a zero-cost abstraction:
done()Future Improvements
While this PR fixes the leak, the following improvements could reduce timeout frequency:
latest_bundle_fees(): Similar to howrequired_bundle_fees()has an LRU cacheeth_maxPriorityFeePerGasinstead of usage-based (one RPC call vs two)Related Issues
This issue is related to:
rundler_maxPriorityFeePerGasendpointChecklist