Skip to content

feat(libp2p): tolerate transient ping failures#3504

Draft
tabcat wants to merge 4 commits into
libp2p:mainfrom
tabcat:fix/connection-monitor-silence-gate
Draft

feat(libp2p): tolerate transient ping failures#3504
tabcat wants to merge 4 commits into
libp2p:mainfrom
tabcat:fix/connection-monitor-silence-gate

Conversation

@tabcat
Copy link
Copy Markdown
Member

@tabcat tabcat commented May 13, 2026

Description

Connection-monitor currently aborts on the first failed ping. With the AdaptiveTimeout 5s floor, transient loss or backpressure on a healthy connection routinely trips the abort even though TCP would have recovered.

Supersedes #3463, which fixes the same problem by flipping the default to false (see it for the trans-Pacific disconnect evidence); this PR keeps the default but aborts only on genuine transport silence.

The default abortConnectionOnPingFailure: true is preserved but the abort is now gated on actual transport silence:

  • @libp2p/interface: optional lastReadAt?: number added to MessageStreamTimeline; new ConnectionStaleError export.
  • @libp2p/utils: AbstractMessageStream.onData() sets timeline.lastReadAt = Date.now() whenever bytes arrive — covers every transport in one line since they all route through this method.
  • libp2p: connection-monitor adds connectionStaleTimeout (default 60s). On ping failure it aborts only if Date.now() - (conn.timeline.lastReadAt ?? conn.timeline.open) > connectionStaleTimeout. Aborts emit ConnectionStaleError so consumers can distinguish gate-driven aborts from transport-level failures.

Net effect: ping failures from heavy traffic on other streams, GC pauses, TCP retransmit bursts, or path-level jitter no longer kill healthy connections — only sustained silence does.

Behavior change

abortConnectionOnPingFailure keeps its true default, but its meaning changes: previously a single failed ping aborted the connection; now a failed ping aborts only after the peer has sent no data for connectionStaleTimeout (default 60s). Deployments relying on true for immediate teardown will keep connections open longer when pings fail while data is still flowing. The option and its default are unchanged, so no config action is required.

Notes & open questions

Some structural cleanups (drop AdaptiveTimeout from this file, simplify pingTimeout, drop abortConnectionOnPingFailure) become natural once the abort decision lives elsewhere — they're breaking and noted for v4.

Change checklist

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation if necessary (this includes comments as well)
  • I have added tests that prove my fix is effective or that my feature works

tabcat added 2 commits May 13, 2026 21:50
A single failed ping no longer aborts the connection. The heartbeat now
records `lastReadAt` on every MaConn read and only aborts when the ping
fails AND no data has been received from the peer for
`connectionStaleTimeout` ms (default 60s). Aborts use the new
`ConnectionStaleError` so consumers can distinguish staleness from
transport-level failures.
@tabcat tabcat changed the title fix(libp2p): gate connection-monitor abort on transport silence fix(libp2p): tolerate transient ping failures May 16, 2026
tabcat added 2 commits May 23, 2026 09:47
Add a @libp2p/utils test that received data sets timeline.lastReadAt, a libp2p test that the connection shares its maConn timeline by reference, and assert stale aborts emit ConnectionStaleError.
Drop the unconditional heartbeat error log so a healthy connection with a flaky ping no longer spams errors; error level is reserved for the staleness abort.
@tabcat tabcat changed the title fix(libp2p): tolerate transient ping failures feat(libp2p): tolerate transient ping failures May 23, 2026
@tabcat
Copy link
Copy Markdown
Member Author

tabcat commented May 23, 2026

@marshalleq would you mind testing these changes and see if they work for you. it tolerates multiple ping failures and handles cases where the ping messages havent been decoded yet but data has been read from the stream. default stale connection timeout is 1 minute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant