The shared-nothing architecture library is designed to provide maximum performance for concurrent workloads by eliminating all shared state between workers. This document describes the architectural decisions and performance optimizations.
Principle: Workers never share memory, preventing contention and data races.
Implementation:
- Each worker runs in its own thread with isolated state
- State types must be
Send + 'staticbut notSync - Communication happens only through message passing
- No
Arc<Mutex<T>>or other shared memory primitives
Benefits:
- Zero lock contention
- Perfect cache locality
- Linear scalability
- Fault isolation
Principle: Use lock-free data structures for inter-worker communication.
Implementation:
- Built on
flumeandcrossbeamchannels - SPSC, MPSC, and MPMC variants
- Bounded and unbounded options
- Cache-line aligned for performance
Performance Characteristics:
SPSC: ~10ns per message (single core)
MPSC: ~30ns per message (multi-core)
MPMC: ~50ns per message (multi-core)
Principle: Minimize cache coherency traffic between cores.
Implementation:
- Cache-line padding (64 bytes) for shared structures
- Worker state aligned to cache lines
- Statistics counters use separate cache lines
- CPU affinity to keep workers on same core
Memory Layout:
ββββββββββββββββββββββββββββββββββββββββββββββ Cache Line 0
β Worker State (exclusive to core) β
ββββββββββββββββββββββββββββββββββββββββββββββ€ Cache Line 1
β Padding (prevents false sharing) β
ββββββββββββββββββββββββββββββββββββββββββββββ€ Cache Line 2
β Channel metadata (shared) β
ββββββββββββββββββββββββββββββββββββββββββββββ
Principle: Distribute work evenly while maintaining key affinity.
Strategies:
- Use Case: General purpose, consistent mapping
- Algorithm:
hash(key) % num_workers - Pros: Simple, fast, uniform distribution
- Cons: All keys reassigned if workers change
- Use Case: Dynamic worker pools
- Algorithm: Virtual nodes on hash ring
- Pros: Minimal redistribution on worker changes
- Cons: Slightly more overhead
- Use Case: Ordered data, range queries
- Algorithm: Map ranges to workers
- Pros: Locality for range operations
- Cons: Can create hot spots
- Use Case: Load balancing without affinity
- Algorithm: Sequential distribution
- Pros: Perfect balance
- Cons: No key affinity
βββββββββββ spawn() βββββββββββββββ
β Factory β βββββββββββββββ>β Thread β
βββββββββββ β Spawn β
ββββββββ¬βββββββ
β
ββββββββΌβββββββ
β init() β
β (Setup) β
ββββββββ¬βββββββ
β
ββββββββΌβββββββ
β Message β
β Loop β<ββββ
ββββββββ¬βββββββ β
β β
ββββββββΌβββββββ β
β handle_ β β
β message() ββββββ
ββββββββ¬βββββββ
β
ββββββββΌβββββββ
β shutdown() β
ββββββββ¬βββββββ
β
ββββββββΌβββββββ
β Thread β
β Join β
βββββββββββββββ
Client
β
β send_partitioned(key, msg)
βΌ
PartitionerMessageRouter
β
β partition(key) β worker_id
βΌ
Channel (Lock-Free)
β
β Bounded queue (cache-aligned)
βΌ
Worker Thread
β
β recv() β Message
βΌ
Message Handler
β
β Process with isolated state
βΌ
[Optional: Send to other workers]
Each worker has exclusive ownership of its state:
struct WorkerState {
// All fields are owned, no Arc/Mutex needed
data: HashMap<K, V>,
counters: Vec<u64>,
cache: LruCache<K, V>,
}Messages are moved (not cloned) when possible:
// Message is moved into channel
tx.send(expensive_message)?;
// Receiver takes ownership
let msg = rx.recv()?;Read-only or atomic-only access:
#[repr(align(64))]
struct ChannelStats {
messages_sent: AtomicU64, // Atomic updates
_padding: [u8; 56], // Prevent false sharing
}| Scenario | Channel Type | Reason |
|---|---|---|
| Single sender, single receiver | SPSC | Fastest, no contention |
| Multiple senders, single receiver | MPSC | Common pattern, optimized |
| Multiple senders, multiple receivers | MPMC | Most flexible |
Pin workers to specific CPU cores:
WorkerConfig::new()
.with_cpu_affinity(core_id)Benefits:
- Warmer caches
- Reduced context switching
- Predictable performance
- Better NUMA locality
Process multiple messages per iteration:
fn handle_batch(&mut self, state: &mut State) -> Result<()> {
let mut batch = Vec::with_capacity(100);
// Drain available messages
while let Ok(msg) = self.rx.try_recv() {
batch.push(msg);
if batch.len() >= 100 { break; }
}
// Process as batch
self.process_batch(state, batch)
}- Use
&[u8]for large data withBytescrate - Pass ownership instead of cloning
- Use
MaybeUninitfor uninitialized buffers - Memory-map files for large datasets
Built-in statistics for monitoring:
let stats = channel.stats();
println!("Sent: {}, Received: {}",
stats.sent(),
stats.received()
);Linear until: Number of workers = physical cores
Bottlenecks:
- Memory bandwidth (>16 cores)
- Cache coherency (>32 cores)
- NUMA effects (>64 cores)
Mitigation:
- Use CPU affinity
- NUMA-aware allocation
- Minimize cross-core communication
The library provides building blocks for distributed systems:
- Serialize messages (with
serializationfeature) - Network workers handle socket I/O
- Partitioning extends across machines
- Consistent hashing handles machine failures
pub enum Error {
WorkerNotRunning, // Worker lifecycle
WorkerAlreadyRunning,
WorkerPanicked(String),
SendError(String), // Channel errors
ReceiveError(String),
Timeout,
InvalidConfig(String), // Configuration
PoolFull,
WorkerNotFound(u64),
PartitionError(String), // Partitioning
Other(String), // Catch-all
}Workers are isolated:
- Panic in one worker doesn't affect others
- Channel disconnection is handled gracefully
- Pool continues with remaining workers
- Retry: Resend message to same worker
- Failover: Send to different worker
- Restart: Spawn new worker
- Circuit Breaker: Stop sending after N failures
- Individual components in isolation
- Property-based testing with
proptest - Edge cases and error conditions
- Multi-worker scenarios
- Message ordering guarantees
- Shutdown sequences
- Throughput measurements
- Latency percentiles (p50, p99, p999)
- Comparison with alternatives
- Scaling characteristics
- Long-running scenarios
- High message rates
- Memory leak detection
- Thread safety verification
-
Async/Await Support
- Tokio integration
- Async message handlers
- Async I/O workers
-
Network Distribution
- TCP/UDP transport
- Protocol buffers
- Service discovery
-
Monitoring
- Prometheus metrics
- Distributed tracing
- Health checks
-
Advanced Partitioning
- Weighted partitioning
- Geo-aware routing
- Priority queues
-
Persistence
- Message durability
- State checkpointing
- Recovery from crashes
- SIMD Processing: Vectorized message processing
- GPU Offload: Move compute to GPU workers
- RDMA: Zero-copy network transfers
- eBPF: Kernel-level message routing
This architecture achieves high performance through:
- Elimination of locks: Lock-free data structures
- Cache optimization: Alignment and affinity
- Data locality: Partitioning strategies
- Zero sharing: Complete worker isolation
- Efficient messaging: Optimized channels
The result is a library that scales linearly with cores and provides predictable, low-latency performance for concurrent workloads.