This guide provides detailed information on optimizing performance when using the shared-nothing library.
Based on benchmarks on modern hardware (Apple M1/M2, Intel Xeon):
| Channel Type | Latency (median) | Throughput |
|---|---|---|
| SPSC | ~10-20ns | 50M+ msg/sec |
| MPSC | ~30-50ns | 20M+ msg/sec |
| MPMC | ~50-100ns | 10M+ msg/sec |
Note: Results vary based on message size, contention, and CPU architecture.
The library scales linearly with CPU cores up to the number of physical cores:
Cores Throughput Efficiency
1 1.0x 100%
2 1.98x 99%
4 3.92x 98%
8 7.76x 97%
16 15.2x 95%
32 28.8x 90%
Efficiency drops beyond 16-32 cores due to:
- Memory bandwidth saturation
- Cache coherency overhead
- NUMA effects
SPSC (Single Producer Single Consumer)
- Fastest option
- Use when exactly one sender and one receiver
- Example: Pipeline stages
let (tx, rx) = Channel::spsc(1024);MPSC (Multiple Producer Single Consumer)
- Most common pattern
- Use for collecting results from multiple workers
- Example: Aggregation, logging
let (tx, rx) = Channel::mpsc(1024);MPMC (Multiple Producer Multiple Consumer)
- Most flexible but slowest
- Use when multiple consumers process from same queue
- Example: Load balancing, work stealing
let (tx, rx) = Channel::mpmc(1024);Small Queues (64-256)
- Lower latency
- Better cache locality
- Risk of blocking on full queue
- Best for: Real-time systems, low-latency requirements
Medium Queues (1024-4096)
- Balanced latency/throughput
- Default choice for most applications
- Good backpressure characteristics
Large Queues (10000+)
- Maximum throughput
- Higher memory usage
- Can mask performance issues
- Best for: Batch processing, high-throughput systems
WorkerConfig::new()
.with_queue_capacity(1024) // Tune based on workloadPin workers to specific CPU cores for better cache locality:
let config = PoolConfig::new()
.with_num_workers(8)
.with_cpu_affinity(true); // Enable CPU pinningBenefits:
- 10-30% performance improvement
- Reduced cache misses
- More predictable latency
- Better NUMA locality
When to use:
- Dedicated servers
- High-performance computing
- Real-time systems
When NOT to use:
- Shared environments
- Container orchestration (K8s)
- Systems with dynamic workloads
Process multiple messages per iteration:
impl Worker for MyWorker {
type State = State;
type Message = Message;
fn tick(&mut self, state: &mut State) -> Result<()> {
let mut batch = Vec::with_capacity(100);
// Collect available messages
while batch.len() < 100 {
match self.try_recv_message() {
Ok(msg) => batch.push(msg),
Err(_) => break,
}
}
// Process batch
self.process_batch(state, batch)
}
}Benefits:
- Reduced per-message overhead
- Better CPU cache utilization
- Vectorization opportunities
- 2-5x throughput improvement
Small Messages (<64 bytes)
- Fits in cache line
- Fast to copy
- Prefer passing by value
#[derive(Clone, Copy)]
struct SmallMessage {
id: u64,
value: i32,
}Large Messages (>64 bytes)
- Use
Box<T>orArc<T> - Pass ownership, not copies
- Consider zero-copy techniques
struct LargeMessage {
data: Box<Vec<u8>>, // Heap allocation
}Hash Partitioning
Arc::new(HashPartitioner::new())- Pros: Fast, uniform distribution, key affinity
- Cons: All keys rehashed if workers change
- Best for: Static worker pools, key-based routing
Consistent Hash Partitioning
Arc::new(ConsistentHashPartitioner::new(num_workers, 150))- Pros: Minimal redistribution on worker changes
- Cons: Slightly more overhead, potential hotspots
- Best for: Dynamic worker pools, distributed systems
Round Robin Partitioning
Arc::new(RoundRobinPartitioner::new())- Pros: Perfect load balance
- Cons: No key affinity
- Best for: Stateless processing, load balancing
Keep State Compact
// Good: Compact state
struct State {
counter: u64,
cache: SmallVec<[Item; 16]>, // Stack-allocated small vec
}
// Bad: Large state
struct State {
data: HashMap<String, Vec<LargeStruct>>, // Heap-heavy
}Use Appropriate Data Structures
Vec<T>for sequential accessHashMap<K, V>for random accessBTreeMap<K, V>for ordered dataSmallVecfor small collectionsArrayVecfor fixed-size collections
Use Built-in Statistics
let stats = channel.stats();
println!("Messages sent: {}", stats.sent());
println!("Messages received: {}", stats.received());
println!("Send errors: {}", stats.send_errors());Run Benchmarks
cargo bench --bench message_passing
cargo bench --bench worker_poolProfile with perf/Instruments
# Linux
cargo build --release
perf record -g ./target/release/myapp
perf report
# macOS
cargo build --release
instruments -t "Time Profiler" ./target/release/myappSymptoms: Slow message processing, high p99 latency
Causes:
- Queue too large (messages wait too long)
- Workers doing synchronous I/O
- Lock contention in worker logic
- Large message copies
Solutions:
- Reduce queue capacity
- Use async I/O or separate I/O workers
- Review worker code for locks
- Pass messages by ownership, not copy
Symptoms: Not utilizing all CPU cores
Causes:
- Too few workers
- Imbalanced partitioning
- Small messages with high overhead
- Workers blocked on I/O
Solutions:
- Increase number of workers
- Use different partitioner
- Batch message processing
- Separate I/O from compute workers
Symptoms: 100% CPU but low throughput
Causes:
- Busy-waiting in worker loop
- Too many context switches
- Cache thrashing
- False sharing
Solutions:
- Add small sleep in worker tick()
- Reduce number of workers
- Enable CPU affinity
- Check for false sharing with perf c2c
Symptoms: Increasing memory usage over time
Causes:
- Unbounded queues
- Memory leaks in worker state
- Messages not being consumed
Solutions:
- Use bounded queues
- Profile with valgrind/heaptrack
- Monitor queue depths
- Add backpressure handling
- Use SPSC channels where possible
- Tune queue capacity for workload
- Enable CPU affinity on dedicated hardware
- Batch message processing when appropriate
- Keep messages small (<64 bytes) or use indirection
- Choose partitioner based on use case
- Minimize allocations in hot paths
- Profile before optimizing
- Monitor queue depths and statistics
- Test with realistic workloads
- 64-byte cache lines
- Strong memory ordering
- Good branch prediction
- NUMA awareness important on multi-socket systems
Recommendations:
- Enable CPU affinity for NUMA locality
- Pin memory to socket (numactl)
- Use transparent huge pages
- 128-byte cache lines on some CPUs
- Weaker memory ordering
- Excellent power efficiency
- Unified memory architecture
Recommendations:
- May need larger cache line padding
- Memory fences more critical
- Excellent for mobile/edge deployments
- Variable cache line sizes
- Emerging architecture
- Growing ecosystem
Recommendations:
- Test on actual hardware
- Profile cache behavior
- Monitor as ecosystem matures
For data-parallel workloads:
use std::simd::*;
fn process_batch_simd(data: &[f32]) -> Vec<f32> {
// Process 8 floats at once
data.chunks_exact(8)
.map(|chunk| {
let vec = f32x8::from_slice(chunk);
let result = vec * f32x8::splat(2.0);
result.to_array()
})
.flatten()
.collect()
}For network I/O:
use bytes::{Bytes, BytesMut};
struct Message {
data: Bytes, // Zero-copy slice
}For frequent allocations:
use typed_arena::Arena;
struct WorkerState {
arena: Arena<Message>,
}
impl Worker for MyWorker {
fn handle_message(&mut self, state: &mut State, msg: Envelope<Message>) -> Result<()> {
let processed = state.arena.alloc(process(msg.payload));
// Arena allocates in chunks, reducing malloc overhead
Ok(())
}
}The shared-nothing library is designed for maximum performance, but achieving optimal results requires:
- Understanding your workload: Profile and measure
- Choosing appropriate patterns: Match channel type to use case
- Tuning parameters: Queue sizes, worker counts, affinity
- Iterative optimization: Measure, optimize, repeat
Start with sensible defaults, profile your application, and optimize the hot paths. The library provides the primitives for building extremely fast concurrent systems.