Skip to content

Fix overlay network IP assignment using Peers-based positioning#57

Closed
firecow wants to merge 9 commits intomainfrom
fix/peers-based-ip-assignment
Closed

Fix overlay network IP assignment using Peers-based positioning#57
firecow wants to merge 9 commits intomainfrom
fix/peers-based-ip-assignment

Conversation

@firecow
Copy link
Member

@firecow firecow commented Mar 3, 2026

Summary

  • Replace FNV hash-based node offset with Peers list-based deterministic positioning for overlay network IP assignment
  • Each overlay container gets a unique overlay index, selecting its designated IP from the node's band (no more competing for the same candidates)
  • Multi-round candidate generation: 3 rounds of fallback IPs per container, spaced across the top of the subnet
  • Pre-filter IPs already visible on this node to skip known conflicts instantly (avoids 20s Docker timeout)
  • Remove Docker-assigned fallback: skip the network instead of getting a low IP that collides with Swarm

Test plan

  • All 38 unit tests pass (broadcastAddr, addToIP, computeIPCandidates, multi-round, overlay index selection)
  • Multi-round: verified no IP overlap across peers, rounds, and container lanes
  • Tested on staging (swarm-node1-stage-lp1.spilnu.dk): heartbeat gets .254, metricbeat gets .253 on all networks except those where Swarm holds the IP
  • Verify on production after deploy

Replace the FNV hash-based nodeOffset with a deterministic
position derived from the network's Peers list. Each node gets
a tight band of IPs at the very top of the subnet based on its
sorted index among peers, eliminating collisions on large
networks where the hash spread overlapped with Swarm allocations.
@firecow firecow self-assigned this Mar 3, 2026
firecow added 8 commits March 3, 2026 21:32
Extract computeIPCandidates as a pure function for testability.
Tests cover /21, /24, /16, /28, /30 subnets with 1-100 peers
and 1-5 containers, including edge cases like subnet overflow,
byte boundary crossing, unknown node, and ordering invariance.
- Clone peerIPs before sorting in computeIPCandidates to avoid
  mutating the caller's slice
- Fix test expectations that relied on lexicographic sort side effects
- Add tests: input mutation guards, subnet bounds, contiguous IPs,
  gateway avoidance, peer join stability, real-world spilnu-shared
  scenario, /30 edge cases
Lexicographic sort caused 10.0.0.10 to sort before 10.0.0.2,
which meant adding a 10th node to a 9-node cluster would shift
all existing nodes' IP bands. Numeric sort (bytes.Compare on
parsed IPs) ensures new higher-IP nodes always append, so
existing bands are never disrupted.

Also remove unused containerName parameter from highIPCandidates.
Instead of all containers trying the same candidate list and competing
(causing allocation failures), each container now gets only its specific
IP based on its overlay index within the config.
Tests the production scenario: multiple overlay containers on the same
node get different designated IPs, including simulation of the run()
loop's index counting with mixed overlay and non-overlay containers.
…allback

When an overlay NetworkConnect with a specific IP fails (e.g. IP held by
a Swarm service on another node), Docker returns context deadline exceeded
after 20 seconds instead of an immediate error. The old code then fell
back to Docker-assigned, which gives low IPs that collide with Swarm's
bottom-up allocation, causing "could not allocate IP from IPAM: Address
already in use" task allocation failures.

Three changes:

1. Multi-round candidates: computeIPCandidatesMultiRound generates IPs
   across 3 rounds (spaced below all peers' primary bands). Each container
   gets 3 fallback IPs in its own lane instead of just 1.

2. Pre-filter locally visible IPs: highIPCandidates checks
   networkInfo.Containers to skip IPs already held on this node, avoiding
   the 20-second timeout for known conflicts.

3. Remove Docker-assigned fallback: if all candidate IPs fail, skip the
   network instead of getting a dangerous low IP. The next run cycle will
   retry.
@firecow firecow changed the title Fix overlay IP collisions using Peers-based positioning Fix overlay network IP assignment using Peers-based positioning Mar 3, 2026
@firecow firecow closed this Mar 3, 2026
@firecow firecow deleted the fix/peers-based-ip-assignment branch March 3, 2026 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant