- Date: 2026-03-21
- Additional benchmark runs: 2026-03-26
- Machine / CPU: Apple Silicon (8 cores reported by benchmark)
- OS: macOS (Darwin 25.x)
- Compiler: AppleClang (C++20)
- Build flags: Release,
-O3 -march=native - Command:
scripts/run_all.shplus targeted runs for new benchmarks
- Key numbers: stride 1/4/16/64 ->
3.49G / 3.33G / 1.01G / 0.346G items/s - Observation: throughput drops significantly as stride increases.
- Conclusion: reduced spatial locality makes memory access more latency-bound.
- Key takeaway: sequential access is cache-efficient; large strides cause major throughput loss.
- Key numbers: sequential
16.7G items/svs pointer chasing12.8M items/s - Observation: random pointer chasing is roughly three orders of magnitude slower.
- Conclusion: hardware prefetching is ineffective and latency dominates.
- Key takeaway: irregular access patterns can overwhelm any compute-side optimization.
- Key numbers: adjacent
155.7Mvs padded509.4M items/s - Observation:
alignas(64)increases throughput by about 3.3x. - Conclusion: independent logical writes on the same cache line cause heavy coherence traffic.
- Key takeaway: cache-line ownership, not variable ownership, controls write scalability.
- Key numbers:
- Base case: AoS
1.318Gvs SoA1.303G items/s(close) - Wide-struct, two fields used: AoS
0.983Gvs SoA1.396G items/s
- Base case: AoS
- Observation: SoA advantage becomes clear when only a subset of fields is used.
- Conclusion: SoA improves effective bandwidth when field utilization is sparse.
- Key takeaway: layout choice should follow access density, not a fixed rule.
- Key numbers:
- Atomic 1/2/4 threads:
571.9M / 163.3M / 44.9M - Mutex 1/2/4 threads:
221.0M / 62.7M / 6.67M items/s
- Atomic 1/2/4 threads:
- Observation: both degrade under contention; mutex degrades faster.
- Conclusion: both pay coherence costs, and mutex adds lock/unlock overhead.
- Key takeaway: contention on shared-write state dominates synchronization choice.
- Key numbers: 4KB
1.01G-> 256KB280M-> 1MB206M-> 16MB88M-> 64MB12.7M items/s - Observation: clear multi-stage drops as working set increases.
- Conclusion: miss cost progressively dominates when crossing cache hierarchy boundaries.
- Key takeaway: working-set size is a primary performance control variable.
- Key numbers: dependent
670Mvs independent2.30G items/s - Observation: independent streams provide about 3.4x higher throughput.
- Conclusion: dependency chains limit out-of-order overlap and instruction-level parallelism.
- Key takeaway: reducing dependencies can deliver larger gains than micro-tuning instructions.
- Key numbers:
- always-taken branch:
7.49G items/s - alternating branch:
7.55G items/s - pseudo-random branch:
7.56G items/s - branchless pseudo-random:
7.88G items/s - Observation: all four variants are close on this Apple Silicon and AppleClang
-O3run, with branchless only modestly ahead. - Conclusion: this code shape does not expose a large branch-prediction penalty on the local platform, so the valid result here is a narrow spread rather than a dramatic textbook gap.
- Key takeaway: if the branch and branchless forms converge, the right conclusion is "no strong signal here", not "branch prediction never matters".
- Key numbers:
- forced inline:
1.22G items/s - forced noinline:
1.23G items/s - function pointer:
1.24G items/s - Observation: the three call shapes land essentially on top of each other in this tight arithmetic chain benchmark.
- Conclusion: with this compiler and this code shape, forced inline vs noinline does not produce a meaningful standalone throughput difference; the stronger dispatch benchmarks remain the more useful signal.
- Key takeaway: when inlining results stay close, the repo still keeps the benchmark as a local conclusion rather than deleting the topic entirely.
- Key numbers:
- Friendly lines=64:
928M items/s - Conflict lines=64:
254M items/s - Sharp drop appears at conflict lines=32/64
- Friendly lines=64:
- Observation: conflict-stride access collapses throughput beyond a threshold.
- Conclusion: this is conflict-miss behavior (associativity overflow).
- Key takeaway: sufficient cache capacity does not prevent set-mapping conflict penalties.
- Key numbers (aggregate mean):
- Mutex (
batch=64, backoff=0):111.6M ops/s - SPSC (
batch=8, backoff=0):174.1M ops/s
- Mutex (
- Observation: with tuned parameters, SPSC outperforms mutex queue.
- Conclusion: in 1P1C transfer, ring buffers can reduce lock contention and critical-section cost.
- Key takeaway: queue performance is highly sensitive to backoff and batching policy.
- Key numbers:
new/delete:109.3M ops/s- locked pool:
21.3M ops/s - thread-local pool:
103.1M ops/s
- Observation: shared locked pool is the slowest; thread-local pool is close to allocator baseline.
- Conclusion: pooling can underperform when global synchronization cost is high.
- Key takeaway: allocator strategy should prioritize thread locality first.
- Key numbers:
- sequential
read:6.70 GiB/s - random
pread:5.51 GiB/s - sequential
mmap:676.4 GiB/s
- sequential
- Observation: mapped-file scanning is dramatically faster than syscall-based reads on the warm-cache path measured here.
- Conclusion: once mapping is established, direct memory access removes per-chunk syscall overhead; random
preadremains slower because access locality is weaker. - Key takeaway:
mmapis a strong baseline for repeated warm reads, but interpretation must account for page-cache state and first-touch fault cost.
- Key numbers:
- latest rerun:
fetch_add1 thread: relaxed527.5M, acq_rel527.5M, seq_cst528.1M items/sfetch_add4 threads: relaxed25.9M, acq_rel38.9M, seq_cst24.0M items/s - flag handoff: acq_rel
25.9Mvs seq_cst26.4M handoffs/s - publish/consume: release-acquire
13.3Mvs seq_cst13.3M publishes/s - SPSC ring metadata: acq_rel
30.9Mvs seq_cst23.9M ring ops/s - message passing litmus:
relaxed bad reads mean
1.4 / 100krelease-acquire bad reads mean0 / 100k - store-buffering litmus:
relaxed both-zero mean
96.9k / 100krelease-acquire both-zero mean96.7k / 100kseq_cst both-zero mean0 / 100k
- latest rerun:
- Observation: the 4-thread
fetch_addcase is dominated by cache-line contention, sorelaxedvs acq_rel is noisy there, but the correctness litmus tests show the semantic differences cleanly. - Conclusion:
relaxedis not safe for publication patterns, release/acquire fixes single-variable message passing, and release/acquire is still weaker thanseq_cstwhen you need a single global order across multiple atomics. - Key takeaway: use throughput tests to measure cost, but use litmus tests to show what can actually go wrong with weaker memory orders.
- Key numbers:
- default transfer:
29.9M ops/s - shared-placement hint:
30.7M ops/s - split-placement hint:
30.6M ops/s - placement requests:
0 - placement verified:
0
- default transfer:
- Observation: the tightened benchmark now reports whether a placement request was actually issued and verified; on this macOS run, neither happened.
- Conclusion: this machine/runtime combination is not honoring the placement path used by the benchmark, so the three variants should be interpreted as the same baseline.
- Key takeaway: placement experiments must self-report whether the OS accepted the requested policy, otherwise the benchmark can look valid while measuring nothing.
- Key numbers:
- pipe handoff:
2.31M msgs/s - shared-memory mailbox:
10.6M msgs/s
- pipe handoff:
- Observation: the shared-memory mailbox is about 4.6x faster than the pipe path in this thread-to-thread handoff test.
- Conclusion: syscall-heavy message transfer pays a large fixed cost relative to a lock-free shared-memory handoff.
- Key takeaway: for small messages and tight loops, avoiding kernel crossings can materially improve throughput.
- Key numbers:
- template dispatch:
3.10G items/s - function pointer:
2.95G items/s - virtual dispatch:
1.14G items/s
- template dispatch:
- Observation: template and function-pointer dispatch are close, while virtual dispatch is about 2.7x slower in this tight loop.
- Conclusion: when the compiler can keep the call path simple, compile-time or direct-call forms preserve much higher throughput than virtual dispatch.
- Key takeaway: polymorphism choice can materially affect hot-loop throughput, especially when the per-item work is small.
- Key numbers:
- lambda:
3.57G items/s - functor:
3.60G items/s - function pointer:
3.63G items/s std::function:1.23G items/s
- lambda:
- Observation: erased callable dispatch through
std::functionis roughly 3x slower than the other forms measured here. - Conclusion: lightweight callable abstractions remain near direct-call speed, while type erasure introduces a visible hot-path cost.
- Key takeaway:
std::functionis convenient, but it should not be the default in throughput-critical inner loops.
- Key numbers:
steady_clock::now():70.1M calls/ssystem_clock::now():65.6M calls/sclock_gettime:73.9M calls/sgettimeofday:98.2M calls/s
- Observation: all four timing APIs are in the same rough cost band, with
gettimeofdayfastest in this run. - Conclusion: time-source selection still matters in very tight loops, but the gap is tens of millions of calls per second rather than orders of magnitude.
- Key takeaway: timestamping is not free; measure the exact clock path used by a latency-sensitive loop.
- Key numbers:
- MPSC mutex queue:
8.10M msgs/s - MPSC bounded MPMC queue:
9.55M msgs/s - MPMC mutex queue:
21.2M msgs/s - MPMC bounded MPMC queue:
7.46M msgs/s
- MPSC mutex queue:
- Observation: the bounded lock-free queue is modestly faster in the 4P1C case, but the mutex queue is much faster in the 4P4C run on this machine.
- Conclusion: queue algorithm choice remains workload- and implementation-sensitive; lock-free does not imply universally higher throughput.
- Key takeaway: match queue design to the actual producer-consumer topology instead of assuming one structure wins everywhere.
- Key numbers:
- spin handoff:
17.5M handoffs/s - yield handoff:
253.8k handoffs/s - condition variable handoff:
107.2k handoffs/s
- spin handoff:
- Observation: busy spinning is orders of magnitude faster than yielding or blocking in this tight ping-pong benchmark.
- Conclusion: when both sides stay active and handoffs are frequent, scheduler-mediated wakeups dominate the cost.
- Key takeaway: blocking primitives save CPU, but for extremely hot handoff loops they can impose a large throughput penalty.
- Key numbers:
- contiguous page walk:
121.5M items/s - page-stride walk:
873.8M items/s - random page walk:
671.2M items/s
- contiguous page walk:
- Observation: randomized page traversal is materially slower than deterministic page-stride access.
- Conclusion: once a benchmark is dominated by page-level access, TLB and page-walk behavior become visible even when every access touches only one value per page.
- Key takeaway: page access order matters; page-locality loss can reduce throughput well before bandwidth is saturated.
- Key numbers:
- error-code no-fail:
3.50G items/s - exception no-fail:
2.66G items/s - error-code rare-fail:
2.46G items/s - exception rare-fail:
398M items/s
- error-code no-fail:
- Observation: the no-throw exception path is somewhat slower than optional-style signaling, and actual throws are dramatically slower even at low failure frequency.
- Conclusion: exception handling changes the cost model sharply once failures occur.
- Key takeaway: exceptions can be acceptable on cold paths, but they are expensive for frequently evaluated or moderately hot error paths.
- Key numbers:
- work=1, 1 thread: mutex
205M, spinlock575M, ticket lock528M items/s - work=1, 4 threads: mutex
26.7M, spinlock3.18M, ticket lock4.33M items/s - work=32, 1 thread: mutex
37.7M, spinlock39.6M, ticket lock39.7M items/s - work=32, 4 threads: mutex
4.77M, spinlock2.48M, ticket lock2.69M items/s
- work=1, 1 thread: mutex
- Observation: once the critical section becomes larger, the uncontended advantage of spin and ticket locks mostly disappears, while their contended behavior remains poor on this machine.
- Conclusion: critical-section size matters as much as lock algorithm choice.
- Key takeaway: benchmark lock variants under both tiny and non-trivial work inside the lock; uncontended lock speed alone is not enough.
- Key numbers:
- private first-touch write:
7.27 GiB/s - private rewrite on already-dirtied pages:
48.5 GiB/s - shared write without
msync:26.2 GiB/s - shared write with
msync:11.9 GiB/s
- private first-touch write:
- Observation: first-touch private writes are far slower than rewriting already-private pages, and forcing
msynccuts shared-write throughput sharply. - Conclusion: copy-on-write fault cost, dirty-page state, and flush policy all matter enough that a single mapped-write benchmark is too coarse.
- Key takeaway: mapped-write benchmarks should separate first-touch, steady-state rewrite, and explicit durability cost.
- Key numbers:
- pipe:
2.43M msgs/s - Unix stream
socketpair:1.42M msgs/s
- pipe:
- Observation: the pipe path is still clearly faster than the Unix-domain stream socket pair after tightening the benchmark to the reliable stream case.
- Conclusion: even same-host kernel communication paths have a measurable abstraction ladder.
- Key takeaway: prefer the narrowest IPC primitive that matches the dataflow and semantics you need.
- Key numbers:
- refined rerun:
new/delete20.8M ops/smalloc/free25.6M ops/slocked pool7.01M ops/spmr::synchronized_pool_resource882.8k ops/s
- refined rerun:
- Observation: general-purpose allocation remains much faster than the synchronized pool-style paths once allocation and free happen on different threads, and the PMR synchronized pool is by far the slowest in this benchmark.
- Conclusion: cross-thread ownership transfer is one of the harshest allocator stress patterns because it turns internal synchronization into the dominant cost.
- Key takeaway: allocator strategies that look good in single-owner benchmarks can collapse completely once producer and consumer ownership split across threads.
- Key numbers:
std::variantdispatch:553M items/s- virtual hierarchy:
656M items/s
- Observation: after removing per-object heap-allocation bias and dispatching through stable preallocated objects, the virtual hierarchy is still faster in this mixed-operation benchmark.
- Conclusion: the earlier result was not just an allocation artifact; for this code shape,
std::variantvisitation still loses to virtual dispatch. - Key takeaway: compare real dispatch patterns directly instead of assuming sum types are always cheaper.
- Key numbers:
- small hot set:
map 123M,unordered_map 1.22G, sorted vector59.8M items/s - large mixed set with 50% misses:
map 15.5M,unordered_map 448M, sorted vector17.1M items/s
- small hot set:
- Observation:
unordered_mapwins in both regimes here, but the gap narrows betweenmapand sorted-vector lookup in the larger mixed hit/miss case. - Conclusion: lookup behavior depends materially on keyset size and miss rate, not just the container class name.
- Key takeaway: always test both hot-hit and larger mixed workloads before choosing a lookup container.
- Key numbers:
- unidirectional stream:
TCP
1.57M msgs/sUnix stream1.54M msgs/s - request/response ping-pong:
TCP
45.5k round trips/sUnix stream209.6k round trips/s
- unidirectional stream:
TCP
- Observation: the transport choice barely matters in the one-way stream test, but matters a great deal in the request/response latency shape.
- Conclusion: throughput and round-trip latency can rank the same transports very differently.
- Key takeaway: network-path benchmarks should include both streaming and ping-pong shapes, not just one direction of traffic.
- Key numbers:
- first-touch mapped access:
15.2 GiB/s - prefaulted mapped access:
33.0 GiB/s mlockpath:129.9 GiB/s,mlock_ok=1
- first-touch mapped access:
- Observation: prefaulting still removes a large part of the first-touch cost, and this rerun successfully obtained locked memory on the current machine.
- Conclusion: page-fault cost is substantial enough to dominate the first pass over a region, and memory locking meaningfully changes the residency story when it actually succeeds.
- Key takeaway: separate first-touch, prefaulted, and locked-memory cases, and always record whether locking actually worked.
- Key numbers:
vector:12.7G items/sdeque:3.23G items/slist:934M items/s
- Observation: contiguous iteration dominates segmented and pointer-linked iteration in this scan-heavy workload.
- Conclusion: sequence-container iteration cost is primarily a locality story, not an API story.
- Key takeaway: for scan-heavy hot paths,
vectoris the default baseline and other sequence containers need a concrete reason to justify their overhead.
- Key numbers:
- refined rerun with additional PMR pool path:
new/delete33.3M ops/smalloc/free44.3M ops/spmr::monotonic_buffer_resource133.8M ops/spmr::unsynchronized_pool_resource14.7k ops/sarena pool391.8M ops/s
- refined rerun with additional PMR pool path:
- Observation: in this fixed-size single-owner benchmark, the simple arena pool is the clear winner, while the unsynchronized PMR pool resource performs extremely poorly in the current setup.
- Conclusion: allocator abstractions with different recycling policies can land in completely different performance regimes even within the same PMR family.
- Key takeaway: allocator benchmarking has to be specific about object size, recycling policy, and lifetime shape; “PMR” is not one performance point.
- Key numbers:
- mixed-size
new/delete:36.2M ops/s - mixed-size
malloc/free:47.7M ops/s - mixed-size
pmr::unsynchronized_pool_resource:133.5M ops/s
- mixed-size
- Observation: the PMR unsynchronized pool is strong in this mixed-size benchmark, which is the opposite of its behavior in the fixed-size allocator benchmark above.
- Conclusion: allocator performance can flip completely when the size distribution and recycling pattern change.
- Key takeaway: allocator selection has to be benchmarked against the actual allocation mix, not just a single synthetic size class.
- Key numbers:
- enum-tag dispatch:
988.6M items/s dynamic_castdispatch:68.0M items/s
- enum-tag dispatch:
- Observation: RTTI-based type dispatch is roughly an order of magnitude slower than the equivalent tag-based dispatch in this benchmark.
- Conclusion: repeated runtime type checks can dominate hot-path cost when the work per element is small.
- Key takeaway:
dynamic_castis fine for cold or structural code paths, but it is a poor default for tight dispatch loops.
- Key numbers:
- 256-byte payload, unbatched:
mutex
23.2M msgs/s,5.93 GiB/sSPSC34.5M msgs/s,8.83 GiB/s - 256-byte payload, batched-by-8:
mutex
40.3M msgs/s,10.3 GiB/sSPSC36.5M msgs/s,9.35 GiB/s
- 256-byte payload, unbatched:
mutex
- Observation: real batching materially improves the mutex queue in this workload, while the SPSC ring changes only modestly.
- Conclusion: batching can compensate for lock overhead much more than it helps an already lightweight queue path.
- Key takeaway: queue benchmarks should test batching as an explicit algorithmic parameter, not just payload size.
- Key numbers:
- potential alias:
10.11G items/s restrict-style no-alias:10.27G items/s- output aliases input:
3.45G items/s
- potential alias:
- Observation: the no-alias signature is only slightly ahead here, but the true aliasing case where output overlaps input is much slower.
- Conclusion: the biggest aliasing penalty in this benchmark comes from overlapping read/write streams, not from the mere possibility of aliasing in the function signature.
- Key takeaway: aliasing benchmarks should include an actual overlap case; otherwise the result may say more about compiler heuristics than data dependence.
- Cache/locality: stride, pointer chasing, and associativity all show that access pattern sets the upper bound.
- Latency vs throughput: regular sequential access is throughput-friendly; random/conflicting access is latency-dominated.
- Contention/synchronization: false sharing, mutex/atomic, and queue tests all expose shared-write hotspots.
- Data layout: AoS vs SoA should be decided by field utilization and vectorization opportunities.
- Allocation strategy: memory pools need thread-aware design, otherwise synchronization overhead can erase gains.
- Syscall boundary:
mmapwarm-path scans and shared-memory handoff both outperform syscall-heavy alternatives in these runs. - Language overhead: virtual dispatch and
std::functionboth show clear hot-path cost relative to simpler call forms. - Coordination strategy: spinning is far faster than blocking in the hottest handoff loop, but that result comes with obvious CPU-usage tradeoffs.
- Error signaling: rare thrown exceptions are already expensive enough to materially reshape throughput.
- IPC and transport: shared memory, pipes, Unix-domain sockets, and TCP loopback form a visible cost ladder on the same machine.
- Data structures and allocators:
unordered_mapwins this lookup workload, while globally synchronized pools lose badly in cross-thread ownership transfer. - Memory residency and container layout: first-touch page cost and non-contiguous container traversal both show how strongly locality and residency shape throughput.
- Allocator and RTTI choice: allocator lifetime model and runtime type-check strategy can both dominate throughput once they enter a hot loop.
- Allocator variance: the same allocator family can look excellent or terrible depending on size mix and recycling policy.
- Queue measurements: synchronization choice, payload size, and topology all materially change which queue design wins.
- Lock behavior: lock rankings shift when the amount of work inside the critical section changes, so contention studies need more than one lock-scope size.
- Measurement discipline: placement and aliasing results both demonstrate that platform behavior must be validated before turning a benchmark into a general claim.
- Platform caveats: affinity and scheduling experiments need explicit validation because API support and enforcement vary by OS.