- Multiple processes, multiple machines
- Can fail any two servers and continue to function
- Clear explanation as to how fail-over works in the code
- Solid test suite
- Documentation clear
- Bonus: Can add a new server into its set of replicas (using make run-server, see test instructions below)
After installing all the requirements (e.x. make install && make install-dev) and loading the new environment (as detailed in root README.md), you can start the first replica by running:
make run-server MODE=grpc SERVER_ID=server1 PORT=5555This will start first server in the network/cluster. To add more server replicas (say 2), run:
make run-server MODE=grpc SERVER_ID=server2 PORT=5556 PEERS=x.x.x.x:5555where x.x.x.x is the address of the first server (e.x. "localhost" if on the same machine, the actual address if on separate machine).
and
make run-server MODE=grpc SERVER_ID=server3 PORT=5557 PEERS=x.x.x.x:5555,y.y.y.y:5556where y.y.y.y is address of second replica.
An example of last command can be something like:
make run-server MODE=grpc SERVER_ID=server3 PORT=5557 PEERS=10.250.10.214:5555,10.250.10.214:5556┌─────────────────────────────────────────────────────────────────┐
│ Replication System │
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ ReplicaNode │ │ ReplicaNode │ │ ReplicaNode │ │
│ │ (Leader) │◄───┤ (Follower) │◄───┤ (Follower) │ │
│ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ │
│ │ │ │ │
│ │ creates │ creates │ creates │
│ ▼ ▼ ▼ │
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ ReplicaState │ │ ReplicaState │ │ ReplicaState │ │
│ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ │
│ │ │ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ElectionManager│ │ElectionManager│ │ElectionManager│ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │HeartbeatManager│ │HeartbeatManager│ │HeartbeatManager│ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ReplicationManager│ │ReplicationManager│ │ReplicationManager│ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
-
ReplicaNode: Main entry point that:
- Creates and initializes the ReplicaState
- Creates ElectionManager and HeartbeatManager
- Establishes cross-references between managers
- Handles network joining and communication
-
ReplicaState: Central state container that:
- Stores server identification (ID, address)
- Tracks current term and role (leader/follower/candidate)
- Maintains peer information and voting state
- Is shared with both managers
-
ElectionManager: Election coordinator that:
- Manages election timers
- Handles vote solicitation and collection
- Triggers state transitions based on election results
- Uses the ReplicaState for decision making
-
HeartbeatManager: Communication handler that:
- Sends heartbeats to followers (when leader)
- Monitors leader liveness (when follower)
- References the ElectionManager to trigger elections
- Uses the ReplicaState to track peer status
┌────────────┐ timeout ┌────────────┐ majority votes ┌────────────┐
│ │──────────────► │───────────────► │
│ Follower │ │ Candidate │ │ Leader │
│ │◄─────────────┤ │◄──────────────┤ │
└────────────┘ higher term └────────────┘ higher term └────────────┘
▲ │ │
│ │ │
└───────────────────────────┴────────────────────────────┘
higher term
-
Initialization: All nodes start as followers with random election timeouts
-
Election Trigger: A follower becomes candidate when:
- No heartbeat received before timeout
- Term is incremented
- Votes for itself
- Requests votes from peers
-
Leadership: A candidate becomes leader when:
- It receives majority votes
- Begins sending heartbeats
- Coordinates all write operations
┌─────────┐ Write ┌─────────┐ Replicate ┌─────────┐
│ Client │─────────────► Leader │─────────────► Follower│
└─────────┘ └─────────┘ └─────────┘
│ │
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ Apply │ │ Apply │
│ Changes │ │ Changes │
└─────────┘ └─────────┘
- Followers detect missing heartbeats
- New election starts after timeout
- New leader takes over
- Leader tracks missed acknowledgments
- Continues with remaining followers
- Rejoining followers catch up
- ELECTION_TIMEOUT_MIN/MAX: Random election timeout range
- HEARTBEAT_INTERVAL: Time between heartbeats
- MAX_MISSED_HEARTBEATS: Threshold for marking nodes as down