This document provides guidance for running and interpreting benchmarks.
# Run all benchmarks
make bench
# Run with multiple iterations for variance analysis
make bench-count
# Run specific package
go test -bench=. -benchmem ./internal/cancelFor consistent, reproducible results:
# 1. Set CPU governor to performance (prevents frequency scaling)
sudo cpupower frequency-set -g performance
# 2. Disable turbo boost (for consistent clock speed)
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
# 3. Verify CPU frequency is stable
watch -n1 "cat /proc/cpuinfo | grep MHz | head -4"
# 4. Check for background processes
top -bn1 | head -20Control how many OS threads execute Go code:
# Single-threaded execution (lowest variance, no goroutine scheduling noise)
GOMAXPROCS=1 go test -bench=. ./internal/...
# Match physical cores (no hyperthreading)
GOMAXPROCS=4 go test -bench=. ./internal/...
# Default: uses all logical CPUs (GOMAXPROCS=runtime.NumCPU())
go test -bench=. ./internal/...When to use:
GOMAXPROCS=1: Best for measuring raw single-threaded performanceGOMAXPROCS=N: For parallel benchmarks (b.RunParallel)- Default: For realistic multi-core scenarios
# Run on CPU 0 only
taskset -c 0 go test -bench=. ./internal/...
# Combined: single core + single GOMAXPROCS (ultimate isolation)
taskset -c 0 GOMAXPROCS=1 go test -bench=. ./internal/...Increase process priority to reduce interference from other processes:
# Run with highest priority (requires root)
sudo nice -n -20 go test -bench=. ./internal/...
# Or renice an existing process
sudo renice -n -20 -p $(pgrep -f "go test")Nice values:
-20: Highest priority (most CPU time)0: Default priority19: Lowest priority (least CPU time)
Combined with CPU pinning for maximum isolation:
sudo nice -n -20 taskset -c 0 GOMAXPROCS=1 go test -bench=. ./internal/...Note: High priority alone doesn't prevent context switches. For true isolation, combine with CPU pinning and consider isolating CPU cores from the scheduler (
isolcpuskernel parameter).
# Disable App Nap (can affect timing)
defaults write NSGlobalDomain NSAppSleepDisabled -bool YES
# Run with elevated priority (macOS equivalent of nice)
sudo nice -n -20 go test -bench=. ./internal/...For the most stable benchmarks on dedicated machines:
# 1. Add to kernel boot parameters (GRUB)
# isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3
# 2. After reboot, CPUs 2-3 are isolated from scheduler
# Run benchmarks on isolated CPU:
sudo taskset -c 2 nice -n -20 GOMAXPROCS=1 go test -bench=. ./internal/...This removes the CPUs from general scheduling entirely.
go test -bench=. -benchmem ./internal/...Run 10 iterations and analyze with benchstat:
# Install benchstat
go install golang.org/x/perf/cmd/benchstat@latest
# Run benchmarks
go test -bench=. -count=10 ./internal/... > results.txt
# Analyze
benchstat results.txt# Before changes
go test -bench=. -count=10 ./internal/... > old.txt
# Make changes...
# After changes
go test -bench=. -count=10 ./internal/... > new.txt
# Compare
benchstat old.txt new.txtBenchmarkCancel_Atomic_Done_Direct-24 1000000000 0.34 ns/op 0 B/op 0 allocs/op
-24: Number of CPUs used (GOMAXPROCS)1000000000: Iterations run0.34 ns/op: Time per operation0 B/op: Bytes allocated per operation0 allocs/op: Heap allocations per operation
- Good: < 2% variance
- Acceptable: 2-5% variance
- Investigate: > 5% variance
High variance causes and mitigations:
| Cause | Mitigation |
|---|---|
| Background processes | nice -n -20, close browsers/IDEs |
| CPU frequency scaling | Set governor to performance |
| Thermal throttling | Let CPU cool between runs |
| Memory pressure | Close memory-heavy apps |
| Goroutine scheduling | GOMAXPROCS=1 |
| OS scheduler preemption | taskset -c 0 + nice -n -20 |
| Hyperthreading noise | Pin to physical core |
- Allocations should be 0 for hot-path operations
- Relative ordering should be stable across runs
- TSC results may vary with CPU frequency changes
Compare context cancellation checking:
go run ./cmd/context -n 10000000Compare queue implementations:
go run ./cmd/channel -n 10000000 -size 1024Compare ticker implementations:
go run ./cmd/ticker -n 10000000Combined benchmark (most realistic):
go run ./cmd/context-ticker -n 10000000Results on AMD Ryzen Threadripper PRO 3945WX:
| Component | Standard | Optimized | Speedup |
|---|---|---|---|
| Cancel check | ~10 ns | ~0.3 ns | 30x |
| Tick check | ~100 ns | ~6 ns (batch) | 16x |
| Combined | ~96 ns | ~5 ns | 18x |
- Micro-benchmarks measure one dimension — Real applications have many factors
- Results are hardware-dependent — Your mileage will vary
- go:linkname may break —
runtime.nanotimeis internal - TSC requires calibration — Accuracy depends on CPU frequency stability
go test -bench=BenchmarkCancel -cpuprofile=cpu.prof ./internal/cancel
go tool pprof -http=:8080 cpu.profgo test -bench=BenchmarkQueue -memprofile=mem.prof ./internal/queue
go tool pprof -http=:8080 mem.profgo test -bench=BenchmarkCombined -trace=trace.out ./internal/combined
go tool trace trace.out