fix: dedup simultaneous-dial sessions in addPeer to stop connection-flap loop#4
Merged
Merged
Conversation
…lap loop When two peers Bonjour-discover each other within ~50ms on a LAN and both initiate outbound TCP, each side ends up with two handshaked sessions for the same nodeId (one outbound, one inbound). Prior to this change the second handshake silently overwrote the first in `peers[nodeId].transports` without cancelling its NWConnection. The orphaned session kept a live read loop, eventually fired `didDisconnectWith`, and `removeTransport` stripped the surviving winner from the dict — killing the healthy connection ~1–2s after handshake. Symptom from the field: `[SYM] session: handshake complete` → `[SYM] peer: connected` → `cannot add handler to N from M - dropping` (libdispatch warning from the zombie's read loop) → `session: disconnected: Connection closed` → 6s pending → reconnect, repeat. No application payloads delivered across the affected connection during any cycle. Fix: `addPeer` now detects a prior bonjour transport and applies a deterministic tie-break — the lower nodeId acts as client and keeps its outbound; the higher nodeId keeps the matching inbound. Both peers compute the same physical-connection winner without exchanging coordination frames. The losing session has its delegate detached before `disconnect()` so its eventual cancellation cannot ripple back through `removeTransport` and clobber the winner. Tie-break extracted as `SymNode.preferNewSessionInDualDial(...)` static helper so it is unit-testable without network mocking. Two new tests: - `testBothPeersPickSamePhysicalConnection` — symmetry: A's outbound is the same socket as B's inbound; both peers must agree. - `testLowerNodeIdKeepsOutbound` — anchors the convention so future refactors don't silently flip it. 70/70 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When two peers Bonjour-discover each other within ~50ms on a LAN and both initiate outbound TCP, each side ends up with two handshaked sessions for the same nodeId (one outbound, one inbound). Prior to this change the second handshake silently overwrote the first in
peers[nodeId].transportswithout cancelling itsNWConnection. The orphaned session kept a live read loop, eventually fireddidDisconnectWith, andremoveTransportstripped the surviving winner from the dict — killing the healthy connection ~1–2s after handshake.Symptom from the field (iPhone↔Mac Catalyst, both running sym-swift main):
Repeats every ~6 seconds indefinitely. Zero application payloads delivered across the affected connection in any cycle. A third peer with a stable connection to the same iPhone receives mood-bearing CMBs continuously, confirming the iPhone is broadcasting fine — only the dual-dial pair is affected.
Fix
addPeernow detects a priorbonjourtransport and applies a deterministic tie-break:(localNodeId, remoteNodeId)pair, without exchanging coordination frames.disconnect()so its eventual cancellation cannot ripple back throughremoveTransportand clobber the winner.The tie-break is extracted as a static helper
SymNode.preferNewSessionInDualDial(...)so it's unit-testable without network mocking.Tests
Two new tests in
SimultaneousDialDedupTests:testBothPeersPickSamePhysicalConnection— symmetry property: A's outbound is the same TCP socket as B's inbound; both peers must select the same socket consistently. Iterates through 4 nodeId pairs (ascii lowercase, realistic uuid7-prefix neighbors, reverse-ordered, full uuids) and verifies the four-way agreement on every pair.testLowerNodeIdKeepsOutbound— anchors the convention so future refactors don't silently flip it (which would still be locally correct under the symmetry test, but would interop-break against any other implementation following MMP convention).70/70 tests pass.
Test plan
swift test)🤖 Generated with Claude Code