Fix overlay network IP assignment using Peers-based positioning#57
Closed
Fix overlay network IP assignment using Peers-based positioning#57
Conversation
Replace the FNV hash-based nodeOffset with a deterministic position derived from the network's Peers list. Each node gets a tight band of IPs at the very top of the subnet based on its sorted index among peers, eliminating collisions on large networks where the hash spread overlapped with Swarm allocations.
Extract computeIPCandidates as a pure function for testability. Tests cover /21, /24, /16, /28, /30 subnets with 1-100 peers and 1-5 containers, including edge cases like subnet overflow, byte boundary crossing, unknown node, and ordering invariance.
- Clone peerIPs before sorting in computeIPCandidates to avoid mutating the caller's slice - Fix test expectations that relied on lexicographic sort side effects - Add tests: input mutation guards, subnet bounds, contiguous IPs, gateway avoidance, peer join stability, real-world spilnu-shared scenario, /30 edge cases
Lexicographic sort caused 10.0.0.10 to sort before 10.0.0.2, which meant adding a 10th node to a 9-node cluster would shift all existing nodes' IP bands. Numeric sort (bytes.Compare on parsed IPs) ensures new higher-IP nodes always append, so existing bands are never disrupted. Also remove unused containerName parameter from highIPCandidates.
Instead of all containers trying the same candidate list and competing (causing allocation failures), each container now gets only its specific IP based on its overlay index within the config.
Tests the production scenario: multiple overlay containers on the same node get different designated IPs, including simulation of the run() loop's index counting with mixed overlay and non-overlay containers.
…allback When an overlay NetworkConnect with a specific IP fails (e.g. IP held by a Swarm service on another node), Docker returns context deadline exceeded after 20 seconds instead of an immediate error. The old code then fell back to Docker-assigned, which gives low IPs that collide with Swarm's bottom-up allocation, causing "could not allocate IP from IPAM: Address already in use" task allocation failures. Three changes: 1. Multi-round candidates: computeIPCandidatesMultiRound generates IPs across 3 rounds (spaced below all peers' primary bands). Each container gets 3 fallback IPs in its own lane instead of just 1. 2. Pre-filter locally visible IPs: highIPCandidates checks networkInfo.Containers to skip IPs already held on this node, avoiding the 20-second timeout for known conflicts. 3. Remove Docker-assigned fallback: if all candidate IPs fail, skip the network instead of getting a dangerous low IP. The next run cycle will retry.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan