A lightweight, production-ready distributed cluster coordination library for Go
ClusterKit handles cluster coordination so you can focus on building your application.
Features • Quick Start • Service Discovery • Event Hooks • Documentation • Examples
Building a distributed system is hard. You need to solve:
- "Where does this data go?" - Partition assignment across nodes
- "Who's in charge?" - Leader election and consensus
- "Is everyone alive?" - Health checking and failure detection
- "What happens when nodes join/leave?" - Rebalancing and data migration
- "How do I know when to move data?" - Event notifications
Most developers end up either:
- ❌ Reinventing the wheel - Writing complex coordination logic from scratch
- ❌ Over-engineering - Using heavy frameworks that dictate your entire architecture
- ❌ Coupling tightly - Mixing coordination logic with business logic
ClusterKit provides just the coordination layer - nothing more, nothing less.
┌─────────────────────────────────────────────────────────┐
│ Your Application (Storage, Replication, Business Logic) │
├─────────────────────────────────────────────────────────┤
│ ClusterKit (Coordination, Partitioning, Consensus) │
└─────────────────────────────────────────────────────────┘
You get:
- ✅ Production-ready coordination (Raft consensus, health checking)
- ✅ Simple API (7 methods + hooks)
- ✅ Complete flexibility (bring your own storage/replication)
- ✅ Zero lock-in (just a library, not a framework)
ClusterKit is a coordination library that manages the distributed aspects of your cluster while giving you complete control over data storage and replication. It handles:
- ✅ Partition Management - Consistent hashing to determine which partition owns a key
- ✅ Node Discovery - Automatic cluster membership and health monitoring
- ✅ Service Discovery - Register and discover application services across nodes
- ✅ Leader Election - Raft-based consensus for cluster decisions
- ✅ Rebalancing - Automatic partition redistribution when nodes join/leave
- ✅ Event Hooks - Rich notifications for partition changes, node lifecycle events
- ✅ Failure Detection - Automatic health checking and node removal
- ✅ Rejoin Handling - Smart detection and data sync for returning nodes
You control:
- 🔧 Data storage (PostgreSQL, Redis, files, memory, etc.)
- 🔧 Replication protocol (HTTP, gRPC, TCP, etc.)
- 🔧 Consistency model (strong, eventual, causal, etc.)
- 🔧 Business logic
- Simple API - 7 core methods + rich event hooks
- Minimal Configuration - Only 2 required fields (NodeID, HTTPAddr)
- Service Discovery - Register multiple services per node (HTTP, gRPC, WebSocket, etc.)
- Production-Ready - WAL, snapshots, crash recovery, metrics
- Health Checking - Automatic failure detection and node removal
- Smart Rejoin - Detects returning nodes and triggers data sync
- Rich Context - Events include timestamps, reasons, offline duration, partition ownership
- 7 Lifecycle Hooks - OnPartitionChange, OnNodeJoin, OnNodeRejoin, OnNodeLeave, OnRebalanceStart, OnRebalanceComplete, OnClusterHealthChange
- Async Execution - Hooks run in background goroutines (max 50 concurrent)
- Panic Recovery - Hooks are isolated and won't crash your application
- Raft Consensus - Built on HashiCorp Raft for strong consistency
- Consistent Hashing - MD5-based partition assignment
- Configurable Replication - Set replication factor (default: 3)
- HTTP API - RESTful endpoints for cluster information
ClusterKit uses a layered architecture combining Raft consensus with consistent hashing:
┌─────────────────────────────────────────────────────────────┐
│ Your Application Layer │
│ (KV Store, Cache, Queue, Custom Logic) │
└────────────────────┬────────────────────────────────────────┘
│ API Calls
┌────────────────────▼────────────────────────────────────────┐
│ ClusterKit Public API │
│ GetPartition() • IsPrimary() • GetReplicas() • Hooks │
└────────────────────┬────────────────────────────────────────┘
│
┌────────────────────▼────────────────────────────────────────┐
│ Coordination Layer │
│ ┌──────────────┬──────────────┬──────────────────────┐ │
│ │ Partition │ Health │ Hook │ │
│ │ Manager │ Checker │ Manager │ │
│ │ │ │ │ │
│ │ • Consistent │ • Heartbeats │ • Event dispatch │ │
│ │ Hashing │ • Failure │ • Async execution │ │
│ │ • Rebalance │ detection │ • 7 lifecycle hooks │ │
│ └──────┬───────┴──────┬───────┴──────────┬───────────┘ │
│ │ │ │ │
│ ┌──────▼──────────────▼──────────────────▼───────────┐ │
│ │ Raft Consensus Layer │ │
│ │ • Leader election │ │
│ │ • Log replication │ │
│ │ • State machine (cluster state) │ │
│ └──────────────────┬──────────────────────────────────┘ │
└─────────────────────┼──────────────────────────────────────┘
│
┌─────────────────────▼──────────────────────────────────────┐
│ Persistence Layer │
│ • WAL (Write-Ahead Log) │
│ • Snapshots (cluster state) │
│ • JSON state files │
└────────────────────────────────────────────────────────────┘
- Partition Assignment - MD5 hash of key → partition ID (0-63)
- Node Selection - Consistent hashing assigns partitions to nodes
- Consensus - Raft ensures all nodes agree on cluster state
- Health Monitoring - Periodic checks detect failures
- Rebalancing - Automatic when topology changes
- Event Notification - Hooks fire for lifecycle events
See docs/architecture.md for detailed design
See ClusterKit in action with a 3-node cluster:
# Clone the repository
git clone https://github.com/skshohagmiah/clusterkit
cd clusterkit/example/sync
# Start 3-node cluster
./run.sh
# Output shows:
# ✅ Node formation
# ✅ Leader election
# ✅ Partition distribution
# ✅ Data replication
# ✅ Automatic rebalancingExample Output:
🚀 Starting node-1 (bootstrap) on ports 8080/9080
[RAFT] Becoming leader
[CLUSTER] Leader elected: node-1
🔗 Starting node-2 (joining) on ports 8081/9081
[JOIN] node-2 joining via node-1
[RAFT] Adding voter: node-2
[REBALANCE] Starting rebalance (trigger: node_join)
[PARTITION] partition-0: node-1 → node-2
[PARTITION] partition-15: node-1 → node-2
[REBALANCE] Complete (moved 21 partitions in 2.3s)
🔗 Starting node-3 (joining) on ports 8082/9082
[JOIN] node-3 joining via node-1
[REBALANCE] Starting rebalance (trigger: node_join)
[PARTITION] partition-5: node-1 → node-3
[REBALANCE] Complete (moved 14 partitions in 1.8s)
✅ Cluster ready: 3 nodes, 64 partitions, RF=3
go get github.com/skshohagmiah/clusterkitpackage main
import (
"log"
"time"
"github.com/skshohagmiah/clusterkit"
)
func main() {
// Create first node - only 2 fields required!
ck, err := clusterkit.New(clusterkit.Options{
NodeID: "node-1",
HTTPAddr: ":8080",
// Optional: Register application services
Services: map[string]string{
"kv": ":9080", // Your KV store API
"api": ":3000", // Your REST API
"grpc": ":50051", // Your gRPC service
},
// Optional: Enable health checking
HealthCheck: clusterkit.HealthCheckConfig{
Enabled: true,
Interval: 5 * time.Second,
Timeout: 2 * time.Second,
FailureThreshold: 3,
},
})
if err != nil {
log.Fatal(err)
}
if err := ck.Start(); err != nil {
log.Fatal(err)
}
defer ck.Stop()
log.Println("✅ Bootstrap node started on :8080")
select {} // Keep running
}ck, err := clusterkit.New(clusterkit.Options{
NodeID: "node-2",
HTTPAddr: ":8081",
JoinAddr: "localhost:8080", // Bootstrap node address
HealthCheck: clusterkit.HealthCheckConfig{
Enabled: true,
Interval: 5 * time.Second,
Timeout: 2 * time.Second,
FailureThreshold: 3,
},
})// 1. Get partition for a key
partition, err := ck.GetPartition(key string) (*Partition, error)
// 2. Get primary node for partition
primary := ck.GetPrimary(partition *Partition) *Node
// 3. Get replica nodes for partition
replicas := ck.GetReplicas(partition *Partition) []Node
// 4. Get all nodes (primary + replicas)
nodes := ck.GetNodes(partition *Partition) []Node
// 5. Check if I'm the primary
isPrimary := ck.IsPrimary(partition *Partition) bool
// 6. Check if I'm a replica
isReplica := ck.IsReplica(partition *Partition) bool
// 7. Get my node ID
myID := ck.GetMyNodeID() string// Get cluster information
cluster := ck.GetCluster() *Cluster
// Trigger manual rebalancing
err := ck.RebalancePartitions() error
// Get metrics
metrics := ck.GetMetrics() *Metrics
// Health check
health := ck.HealthCheck() *HealthStatusClusterKit includes built-in service discovery to help smart clients find your application services across the cluster.
// Register multiple services per node
ck, err := clusterkit.New(clusterkit.Options{
NodeID: "node-1",
HTTPAddr: ":8080", // ClusterKit coordination API
Services: map[string]string{
"kv": ":9080", // Key-Value store
"api": ":3000", // REST API
"grpc": ":50051", // gRPC service
"websocket": ":8081", // WebSocket server
"metrics": ":9090", // Prometheus metrics
},
})// Get cluster topology with services
resp, err := http.Get("http://localhost:8080/cluster")
var cluster ClusterResponse
json.NewDecoder(resp.Body).Decode(&cluster)
// Route requests to appropriate services
for _, node := range cluster.Cluster.Nodes {
kvAddr := node.Services["kv"] // ":9080"
apiAddr := node.Services["api"] // ":3000"
grpcAddr := node.Services["grpc"] // ":50051"
// Route different request types to different services
routeKVRequest(node.ID, "localhost"+kvAddr)
routeAPIRequest(node.ID, "localhost"+apiAddr)
routeGRPCRequest(node.ID, "localhost"+grpcAddr)
}The /cluster endpoint returns service information for each node:
{
"cluster": {
"nodes": [
{
"id": "node-1",
"ip": ":8080",
"name": "Server-1",
"status": "active",
"services": {
"kv": ":9080",
"api": ":3000",
"grpc": ":50051"
}
}
]
}
}- 🎯 No hardcoded ports - Services are explicitly registered and discoverable
- 🔧 Multi-service nodes - Support HTTP, gRPC, WebSocket, etc. on same node
- 📡 Dynamic discovery - Clients automatically find services as nodes join/leave
- ⚖️ Load balancing - Route different request types to different services
- 🚀 Zero configuration - Services field is optional and backward compatible
ClusterKit provides a comprehensive event system with rich context for all cluster lifecycle events.
Triggered when: Partitions are reassigned due to rebalancing
ck.OnPartitionChange(func(event *clusterkit.PartitionChangeEvent) {
// Only act if I'm the destination node
if event.CopyToNode.ID != myNodeID {
return
}
log.Printf("📦 Partition %s moving (reason: %s)",
event.PartitionID, event.ChangeReason)
log.Printf(" From: %d nodes", len(event.CopyFromNodes))
log.Printf(" Primary changed: %s → %s", event.OldPrimary, event.NewPrimary)
// Fetch and merge data from all source nodes
for _, source := range event.CopyFromNodes {
data := fetchPartitionData(source, event.PartitionID)
mergeData(data)
}
})Event Structure:
type PartitionChangeEvent struct {
PartitionID string // e.g., "partition-5"
CopyFromNodes []*Node // Nodes that have the data
CopyToNode *Node // Node that needs the data
ChangeReason string // "node_join", "node_leave", "rebalance"
OldPrimary string // Previous primary node ID
NewPrimary string // New primary node ID
Timestamp time.Time // When the change occurred
}Use Cases:
- Migrate data when partitions move
- Update local indexes
- Trigger background sync jobs
Triggered when: A brand new node joins the cluster
ck.OnNodeJoin(func(event *clusterkit.NodeJoinEvent) {
log.Printf("🎉 Node %s joined (cluster size: %d)",
event.Node.ID, event.ClusterSize)
if event.IsBootstrap {
log.Println(" This is the bootstrap node - initializing cluster")
initializeSchema()
}
// Update monitoring dashboards
updateNodeCount(event.ClusterSize)
})Event Structure:
type NodeJoinEvent struct {
Node *Node // The joining node
ClusterSize int // Total nodes after join
IsBootstrap bool // Is this the first node?
Timestamp time.Time // When the node joined
}Use Cases:
- Initialize cluster-wide resources on bootstrap
- Update monitoring/alerting systems
- Trigger capacity planning checks
Triggered when: A node that was previously in the cluster rejoins
ck.OnNodeRejoin(func(event *clusterkit.NodeRejoinEvent) {
if event.Node.ID == myNodeID {
log.Printf("🔄 I'm rejoining after %v offline", event.OfflineDuration)
log.Printf(" Last seen: %v", event.LastSeenAt)
log.Printf(" Had %d partitions before leaving",
len(event.PartitionsBeforeLeave))
// Clear stale data
clearAllLocalData()
// Wait for OnPartitionChange to sync fresh data
log.Println(" Ready for partition reassignment")
} else {
log.Printf("📡 Node %s rejoined after %v",
event.Node.ID, event.OfflineDuration)
}
})Event Structure:
type NodeRejoinEvent struct {
Node *Node // The rejoining node
OfflineDuration time.Duration // How long it was offline
LastSeenAt time.Time // When it was last seen
PartitionsBeforeLeave []string // Partitions it had before
Timestamp time.Time // When it rejoined
}Use Cases:
- Clear stale local data before rebalancing
- Decide sync strategy based on offline duration
- Log rejoin events for debugging
- Alert if offline duration was too long
Important: This hook fires BEFORE rebalancing. Use it to prepare (clear data), then let OnPartitionChange handle the actual data sync with correct partition assignments.
Triggered when: A node is removed from the cluster (failure or graceful shutdown)
ck.OnNodeLeave(func(event *clusterkit.NodeLeaveEvent) {
log.Printf("❌ Node %s left (reason: %s)",
event.Node.ID, event.Reason)
log.Printf(" Owned %d partitions (primary)", len(event.PartitionsOwned))
log.Printf(" Replicated %d partitions", len(event.PartitionsReplica))
// Clean up connections
closeConnectionTo(event.Node.IP)
// Alert if critical
if event.Reason == "health_check_failure" && len(event.PartitionsOwned) > 10 {
alertOps("High partition loss - node failed!")
}
})Event Structure:
type NodeLeaveEvent struct {
Node *Node // Full node info
Reason string // "health_check_failure", "graceful_shutdown", "removed_by_admin"
PartitionsOwned []string // Partitions this node was primary for
PartitionsReplica []string // Partitions this node was replica for
Timestamp time.Time // When it left
}Use Cases:
- Clean up network connections
- Alert operations team
- Update capacity planning
- Log failure events
Triggered when: Partition rebalancing operation starts
ck.OnRebalanceStart(func(event *clusterkit.RebalanceEvent) {
log.Printf("⚖️ Rebalance starting (trigger: %s)", event.Trigger)
log.Printf(" Triggered by: %s", event.TriggerNodeID)
log.Printf(" Partitions to move: %d", event.PartitionsToMove)
log.Printf(" Nodes affected: %v", event.NodesAffected)
// Pause background jobs during rebalance
pauseBackgroundJobs()
// Increase operation timeouts
increaseTimeouts()
})Event Structure:
type RebalanceEvent struct {
Trigger string // "node_join", "node_leave", "manual"
TriggerNodeID string // Which node caused it
PartitionsToMove int // How many partitions will move
NodesAffected []string // Which nodes are affected
Timestamp time.Time // When rebalance started
}Triggered when: Partition rebalancing operation completes
ck.OnRebalanceComplete(func(event *clusterkit.RebalanceEvent, duration time.Duration) {
log.Printf("✅ Rebalance completed in %v", duration)
log.Printf(" Moved %d partitions", event.PartitionsToMove)
// Resume background jobs
resumeBackgroundJobs()
// Reset timeouts
resetTimeouts()
// Update metrics
recordRebalanceDuration(duration)
})Triggered when: Overall cluster health status changes
ck.OnClusterHealthChange(func(event *clusterkit.ClusterHealthEvent) {
log.Printf("🏥 Cluster health: %s", event.Status)
log.Printf(" Healthy: %d/%d nodes", event.HealthyNodes, event.TotalNodes)
if event.Status == "critical" {
log.Printf(" Unhealthy nodes: %v", event.UnhealthyNodeIDs)
alertOps("Cluster in critical state!")
enableReadOnlyMode()
} else if event.Status == "healthy" {
log.Println(" All systems operational")
disableReadOnlyMode()
}
})Event Structure:
type ClusterHealthEvent struct {
HealthyNodes int // Number of healthy nodes
UnhealthyNodes int // Number of unhealthy nodes
TotalNodes int // Total nodes in cluster
Status string // "healthy", "degraded", "critical"
UnhealthyNodeIDs []string // IDs of unhealthy nodes
Timestamp time.Time // When health changed
}1. New node starts and sends join request
↓
2. OnNodeJoin fires
- Event includes: node info, cluster size, bootstrap flag
↓
3. OnRebalanceStart fires
- Event includes: trigger reason, partitions to move
↓
4. Partitions are reassigned
↓
5. OnPartitionChange fires (multiple times, once per partition)
- Event includes: partition ID, source nodes, destination node, reason
↓
6. OnRebalanceComplete fires
- Event includes: duration, partitions moved
Your Application:
- In
OnNodeJoin: Log event, update monitoring - In
OnPartitionChange: Migrate data for assigned partitions - In
OnRebalanceComplete: Resume normal operations
1. Health checker detects node failure (3 consecutive failures)
↓
2. OnNodeLeave fires
- Event includes: node info, reason="health_check_failure", partitions owned
↓
3. OnRebalanceStart fires
↓
4. Partitions are reassigned to remaining nodes
↓
5. OnPartitionChange fires (for each reassigned partition)
↓
6. OnRebalanceComplete fires
Your Application:
- In
OnNodeLeave: Clean up connections, alert ops team - In
OnPartitionChange: Take ownership of reassigned partitions - Data already exists on replicas, so migration is fast!
1. Failed node restarts and rejoins
↓
2. OnNodeRejoin fires
- Event includes: node info, offline duration, partitions before leave
↓
3. Your app clears stale local data
↓
4. OnRebalanceStart fires
↓
5. Partitions are reassigned (may be different than before!)
↓
6. OnPartitionChange fires (for each assigned partition)
- Your app fetches fresh data from replicas
↓
7. OnRebalanceComplete fires
Your Application:
- In
OnNodeRejoin: Clear ALL stale data (important!) - In
OnPartitionChange: Fetch fresh data for NEW partition assignments - Don't assume you'll get the same partitions you had before!
Why clear data?
- Node was offline - data is stale
- Partition assignments may have changed
- Other nodes have the latest data
- Clean slate ensures consistency
package main
import (
"encoding/json"
"fmt"
"log"
"net/http"
"sync"
"time"
"github.com/skshohagmiah/clusterkit"
)
type KVStore struct {
ck *clusterkit.ClusterKit
data map[string]string
mu sync.RWMutex
nodeID string
isRejoining bool
rejoinMu sync.Mutex
}
func NewKVStore(ck *clusterkit.ClusterKit, nodeID string) *KVStore {
kv := &KVStore{
ck: ck,
data: make(map[string]string),
nodeID: nodeID,
}
// Register all hooks
ck.OnPartitionChange(func(event *clusterkit.PartitionChangeEvent) {
kv.handlePartitionChange(event)
})
ck.OnNodeRejoin(func(event *clusterkit.NodeRejoinEvent) {
if event.Node.ID == kv.nodeID {
kv.handleRejoin(event)
}
})
ck.OnNodeLeave(func(event *clusterkit.NodeLeaveEvent) {
log.Printf("[KV] Node %s left (reason: %s, partitions: %d)",
event.Node.ID, event.Reason,
len(event.PartitionsOwned)+len(event.PartitionsReplica))
})
return kv
}
func (kv *KVStore) handleRejoin(event *clusterkit.NodeRejoinEvent) {
kv.rejoinMu.Lock()
defer kv.rejoinMu.Unlock()
if kv.isRejoining {
return // Already rejoining
}
kv.isRejoining = true
log.Printf("[KV] 🔄 Rejoining after %v offline", event.OfflineDuration)
log.Printf("[KV] 🗑️ Clearing stale data")
// Clear ALL stale data
kv.mu.Lock()
kv.data = make(map[string]string)
kv.mu.Unlock()
log.Printf("[KV] ✅ Ready for partition reassignment")
}
func (kv *KVStore) handlePartitionChange(event *clusterkit.PartitionChangeEvent) {
if event.CopyToNode.ID != kv.nodeID {
return
}
// Check if rejoining
kv.rejoinMu.Lock()
isRejoining := kv.isRejoining
kv.rejoinMu.Unlock()
if len(event.CopyFromNodes) == 0 {
log.Printf("[KV] New partition %s assigned", event.PartitionID)
return
}
log.Printf("[KV] 🔄 Migrating partition %s (reason: %s)",
event.PartitionID, event.ChangeReason)
// Fetch data from source nodes
for _, source := range event.CopyFromNodes {
data := kv.fetchPartitionData(source, event.PartitionID)
kv.mu.Lock()
for key, value := range data {
kv.data[key] = value
}
kv.mu.Unlock()
log.Printf("[KV] ✅ Migrated %d keys from %s", len(data), source.ID)
break // Successfully migrated
}
// Clear rejoin flag after first partition
if isRejoining {
kv.rejoinMu.Lock()
kv.isRejoining = false
kv.rejoinMu.Unlock()
log.Printf("[KV] ✅ Rejoin complete")
}
}
func (kv *KVStore) fetchPartitionData(node *clusterkit.Node, partitionID string) map[string]string {
url := fmt.Sprintf("http://%s/migrate?partition=%s", node.IP, partitionID)
resp, err := http.Get(url)
if err != nil {
return nil
}
defer resp.Body.Close()
var result map[string]string
json.NewDecoder(resp.Body).Decode(&result)
return result
}
func (kv *KVStore) Set(key, value string) error {
partition, err := kv.ck.GetPartition(key)
if err != nil {
return err
}
if kv.ck.IsPrimary(partition) || kv.ck.IsReplica(partition) {
kv.mu.Lock()
kv.data[key] = value
kv.mu.Unlock()
return nil
}
// Forward to primary
primary := kv.ck.GetPrimary(partition)
return kv.forwardToPrimary(primary, key, value)
}
func (kv *KVStore) Get(key string) (string, error) {
partition, err := kv.ck.GetPartition(key)
if err != nil {
return "", err
}
if kv.ck.IsPrimary(partition) || kv.ck.IsReplica(partition) {
kv.mu.RLock()
defer kv.mu.RUnlock()
value, exists := kv.data[key]
if !exists {
return "", fmt.Errorf("key not found")
}
return value, nil
}
// Forward to primary
primary := kv.ck.GetPrimary(partition)
return kv.readFromPrimary(primary, key)
}
func main() {
ck, _ := clusterkit.New(clusterkit.Options{
NodeID: "node-1",
HTTPAddr: ":8080",
HealthCheck: clusterkit.HealthCheckConfig{
Enabled: true,
Interval: 5 * time.Second,
Timeout: 2 * time.Second,
FailureThreshold: 3,
},
})
ck.Start()
defer ck.Stop()
kv := NewKVStore(ck, "node-1")
// Use the KV store
kv.Set("user:123", "John Doe")
value, _ := kv.Get("user:123")
fmt.Println("Value:", value)
select {}
}ck, _ := clusterkit.New(clusterkit.Options{
NodeID: "node-1", // Required
HTTPAddr: ":8080", // Required
})ck, _ := clusterkit.New(clusterkit.Options{
// Required
NodeID: "node-1",
HTTPAddr: ":8080",
// Cluster Formation
JoinAddr: "node-1:8080", // For non-bootstrap nodes
Bootstrap: false, // Auto-detected
// Partitioning
PartitionCount: 64, // More partitions = better distribution
ReplicationFactor: 3, // Survive 2 node failures
// Storage
DataDir: "/var/lib/clusterkit",
// Health Checking
HealthCheck: clusterkit.HealthCheckConfig{
Enabled: true,
Interval: 5 * time.Second, // Check every 5s
Timeout: 2 * time.Second, // Request timeout
FailureThreshold: 3, // Remove after 3 failures
},
})ClusterKit exposes RESTful endpoints:
# Get cluster state (includes service discovery)
curl http://localhost:8080/cluster
# Get metrics
curl http://localhost:8080/metrics
# Get detailed health
curl http://localhost:8080/health/detailed
# Check if ready
curl http://localhost:8080/readyThe /cluster endpoint returns comprehensive cluster information including:
- Node membership and status
- Service discovery - All registered services per node
- Partition assignments and replica locations
- Cluster configuration and hash settings
ClusterKit includes 3 complete examples:
# SYNC - Strong consistency (quorum-based)
cd example/sync && ./run.sh
# ASYNC - Maximum throughput (eventual consistency)
cd example/async && ./run.sh
# Server-Side - Simple HTTP clients
cd example/server-side && ./run.shEach example demonstrates:
- ✅ Cluster formation (10 nodes)
- ✅ Data distribution (1000 keys)
- ✅ Automatic rebalancing
- ✅ Health checking and failure recovery
- ✅ Node rejoin handling
version: '3.8'
services:
node-1:
image: your-registry/clusterkit:latest
environment:
- NODE_ID=node-1
- HTTP_PORT=8080
- DATA_DIR=/data
ports:
- "8080:8080"
volumes:
- node1-data:/data
node-2:
image: your-registry/clusterkit:latest
environment:
- NODE_ID=node-2
- HTTP_PORT=8080
- JOIN_ADDR=node-1:8080
- DATA_DIR=/data
ports:
- "8081:8080"
volumes:
- node2-data:/data
depends_on:
- node-1
volumes:
node1-data:
node2-data:Comprehensive guides in the docs/ directory:
- Architecture - Detailed system design, Raft + consistent hashing
- Partitioning - How data is distributed across nodes
- Replication - Replication strategies and consistency models
- Rebalancing - How partitions move when topology changes
- Node Rejoin - Handling stale data when nodes return
- Health Checking - Failure detection and recovery
- Hooks Guide - Complete event system reference
- Production Deployment - Best practices for production
- SYNC Mode - Strong consistency (quorum-based)
- ASYNC Mode - Maximum throughput (eventual consistency)
- Server-Side - Simple HTTP clients
Contributions welcome! Please read CONTRIBUTING.md first.
MIT License - see LICENSE for details.
Simple - 7 methods + hooks, minimal config
Flexible - Bring your own storage and replication
Production-Ready - Raft consensus, health checking, metrics
Well-Documented - Comprehensive guides and examples
Battle-Tested - Used in production distributed systems
Start building your distributed system today! 🚀