Skip to content

fix: dedup simultaneous-dial sessions in addPeer to stop connection-flap loop#4

Merged
sym-bot merged 1 commit into
mainfrom
fix/simultaneous-dial-dedup
Apr 29, 2026
Merged

fix: dedup simultaneous-dial sessions in addPeer to stop connection-flap loop#4
sym-bot merged 1 commit into
mainfrom
fix/simultaneous-dial-dedup

Conversation

@sym-bot
Copy link
Copy Markdown
Owner

@sym-bot sym-bot commented Apr 29, 2026

Summary

When two peers Bonjour-discover each other within ~50ms on a LAN and both initiate outbound TCP, each side ends up with two handshaked sessions for the same nodeId (one outbound, one inbound). Prior to this change the second handshake silently overwrote the first in peers[nodeId].transports without cancelling its NWConnection. The orphaned session kept a live read loop, eventually fired didDisconnectWith, and removeTransport stripped the surviving winner from the dict — killing the healthy connection ~1–2s after handshake.

Symptom from the field (iPhone↔Mac Catalyst, both running sym-swift main):

[SYM] session: handshake complete with Hongwei (019dd87d)
[SYM] peer: connected: Hongwei (outbound, bonjour)
SYM peer joined: Hongwei (total: 1)
cannot add handler to 4 from 1 - dropping     ← libdispatch warning from zombie's read loop
cannot add handler to 3 from 1 - dropping
[SYM] session: disconnected: Connection closed
[SYM] peer: disconnected: Hongwei
SYM peer-left pending for Hongwei — waiting 6s for reconnect

Repeats every ~6 seconds indefinitely. Zero application payloads delivered across the affected connection in any cycle. A third peer with a stable connection to the same iPhone receives mood-bearing CMBs continuously, confirming the iPhone is broadcasting fine — only the dual-dial pair is affected.

Fix

addPeer now detects a prior bonjour transport and applies a deterministic tie-break:

  • The lower nodeId acts as client and keeps its outbound; the higher nodeId keeps the matching inbound.
  • Both peers compute the same physical-connection winner from the same (localNodeId, remoteNodeId) pair, without exchanging coordination frames.
  • The losing session has its delegate detached before disconnect() so its eventual cancellation cannot ripple back through removeTransport and clobber the winner.

The tie-break is extracted as a static helper SymNode.preferNewSessionInDualDial(...) so it's unit-testable without network mocking.

Tests

Two new tests in SimultaneousDialDedupTests:

  • testBothPeersPickSamePhysicalConnection — symmetry property: A's outbound is the same TCP socket as B's inbound; both peers must select the same socket consistently. Iterates through 4 nodeId pairs (ascii lowercase, realistic uuid7-prefix neighbors, reverse-ordered, full uuids) and verifies the four-way agreement on every pair.
  • testLowerNodeIdKeepsOutbound — anchors the convention so future refactors don't silently flip it (which would still be locally correct under the symmetry test, but would interop-break against any other implementation following MMP convention).

70/70 tests pass.

Test plan

  • All existing unit tests pass (swift test)
  • New dedup tests pass
  • Verified end-to-end on iPhone↔Mac Catalyst pair — connection no longer flaps; mood-bearing CMBs deliver; remote peer's mood (valence/arousal) appears on the local mood graph instead of being stuck at neutral

🤖 Generated with Claude Code

…lap loop

When two peers Bonjour-discover each other within ~50ms on a LAN and both
initiate outbound TCP, each side ends up with two handshaked sessions for
the same nodeId (one outbound, one inbound). Prior to this change the
second handshake silently overwrote the first in `peers[nodeId].transports`
without cancelling its NWConnection. The orphaned session kept a live read
loop, eventually fired `didDisconnectWith`, and `removeTransport` stripped
the surviving winner from the dict — killing the healthy connection ~1–2s
after handshake.

Symptom from the field: `[SYM] session: handshake complete` →
`[SYM] peer: connected` → `cannot add handler to N from M - dropping`
(libdispatch warning from the zombie's read loop) → `session: disconnected:
Connection closed` → 6s pending → reconnect, repeat. No application
payloads delivered across the affected connection during any cycle.

Fix: `addPeer` now detects a prior bonjour transport and applies a
deterministic tie-break — the lower nodeId acts as client and keeps its
outbound; the higher nodeId keeps the matching inbound. Both peers compute
the same physical-connection winner without exchanging coordination
frames. The losing session has its delegate detached before `disconnect()`
so its eventual cancellation cannot ripple back through `removeTransport`
and clobber the winner.

Tie-break extracted as `SymNode.preferNewSessionInDualDial(...)` static
helper so it is unit-testable without network mocking. Two new tests:
- `testBothPeersPickSamePhysicalConnection` — symmetry: A's outbound is
  the same socket as B's inbound; both peers must agree.
- `testLowerNodeIdKeepsOutbound` — anchors the convention so future
  refactors don't silently flip it.

70/70 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sym-bot sym-bot merged commit 67b5eb3 into main Apr 29, 2026
1 check passed
@sym-bot sym-bot deleted the fix/simultaneous-dial-dedup branch April 29, 2026 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant