Performance

This page is split by branch / workload:

feature/go vs. feature/c — N-Queens head-to-head — 14–22 % Go advantage
feature/c — N-Queens sweep (2026-04-22) — the original C+Sylvan+Lace reference
feature/c — historical N=12 comparison
feature/go vs. feature/c — SRE-NDD full sweep (2026-04-24) — BGP fat-tree network verification

Every number is tagged with the branch it came from. The two libmtpnddjni.so artifacts are drop-in replacements at the JNI level but are not interchangeable for benchmarking.

`feature/go` vs. `feature/c` — N-Queens (6 workers)

A smaller apples-to-apples benchmark that both backends solve natively (no Java or JNI layer). feature/go uses cmd/nqueens-bench; feature/c uses mtpndd_nqueens_benchmark. Median of 3 runs.

N	`feature/c`	`feature/go` v1.18	Go / C	Go advantage
10	0.09 s	0.07 s	0.80×	20 % faster
11	0.36 s	0.30 s	0.86×	14 % faster
12	1.66 s	1.38 s	0.83×	17 % faster
13	9.43 s	7.40 s	0.78×	22 % faster

feature/go leads feature/c by 14–22 % across all tested N on n-queens. The gap is smaller than on the SRE workloads (1.4–2.9×) because n-queens is all BDD work — no NDD SatCount memo to exploit, no JNI-side goroutine-spawn pathology to fix.

`feature/c` — N-Queens sweep (2026-04-22)

All figures in this section are from the 2026-04-22 sweep run with ./bench_sweep.sh on the feature/c branch (Sylvan submodule 1.10.0 + Lace 1.6.2). Raw TSVs: bench_results_{serial,parallel}_{release,pgo}_*.tsv in the repo root.

Test Platform

HP Zhan 86 Pro G2 MT, 6 physical cores (12 threads), Linux 6.19
MTPNDD_LOG_LEVEL=1 (INFO — no stats instrumentation overhead)
CMAKE_BUILD_TYPE=Release (-O3) for the default build; PGO table uses scripts/build-pgo.sh output

Serial Benchmark — Release

mtpndd_nqueens_benchmark, default build.

N	1w	4w	6w	6w speedup	Solutions
10	0.237 s	0.113 s	0.093 s	2.55×	724
11	1.171 s	0.446 s	0.344 s	3.40×	2,680
12	6.021 s	2.132 s	1.540 s	3.91×	14,200
13	35.49 s	12.07 s	8.59 s	4.13×	73,712

Speedup tapers above 4 workers because at the top of the recursion the tree narrows and Lace steal attempts dominate.

Serial Benchmark — PGO + LTO

Built via ./scripts/build-pgo.sh (profile-generate → nqueens training at N=10..12, W=1,4,6 → profile-use).

N	1w	4w	6w	6w speedup	vs Release (6w)
10	0.217 s	0.104 s	0.098 s	2.21×	−5%
11	1.082 s	0.421 s	0.311 s	3.48×	−10%
12	5.711 s	2.033 s	1.478 s	3.86×	−4%
13	34.19 s	11.50 s	8.67 s	3.94×	−1%

Single-thread wins are largest (−8 to −12 %) because PGO helps the single hot recursion most; at 6 workers scheduling overhead masks the per-instruction savings.

Outer-Operation Parallel Variant — PGO + LTO

mtpndd_nqueens_parallel_benchmark — same core plus tree-parallel construction of the N-Queens formula. Useful at small N where the inner AND/OR recursion runs out of parallelism.

N	1w	6w	vs serial PGO (6w)
10	0.220 s	0.080 s	−18%
11	1.093 s	0.295 s	−5%
12	5.804 s	1.473 s	0%
13	33.66 s	8.23 s	−5%

The parallel variant under plain Release (no PGO) is 1–8 % slower than the PGO numbers above — the gap is largest at 1 worker (e.g. N=11 1w: 1.189 s Release → 1.093 s PGO, −8 %) and shrinks to ~1–3 % at 6 workers where scheduling overhead dominates.

Session-over-Session — N=12 (`feature/c`)

The feature/c codebase went through a large refactor in April 2026 (Sylvan + Lace upgrade from vendored 1.9.1 + 1.5.0 to upstream submodule 1.10.0 + 1.6.2, plus targeted perf work).

W	Pre-migration	Current (Release)	Current (PGO+LTO)
1	6.32 s	6.02 s (−5%)	5.71 s (−10%)
2	5.14 s	3.70 s (−28%)	3.47 s (−33%)
4	3.08 s	2.13 s (−31%)	2.03 s (−34%)
6	2.06 s	1.54 s (−25%)	1.48 s (−28%)

Pre-migration data lives at the legacy-vendored-sylvan tag. The biggest multi-worker win comes from per-worker mtbdd_refs sharding (no more global refs mutex) and the ref/deref coalescing cache.

Reproducing

# Default Release sweep
./bench_sweep.sh                   # N=7..13, workers=1..6

# PGO sweep
./scripts/build-pgo.sh             # one-time, ~minutes
./bench_sweep.sh --pgo

# Outer-op parallel sweep
./bench_sweep.sh --parallel

Results are written to bench_results_{serial|parallel}_{release|pgo}_<timestamp>.tsv in the repo root.

Historical Notes

Optimization write-ups and A/B audits from the April 2026 migration live in-repo under docs/archived/perf/. Highlights:

Migration to Sylvan 1.10.0 submodule + Lace 1.6.2 — enables per-worker mtbdd_refs sharding, eliminates global refs mutex contention at 4+ workers
Coalescing ref cache — mtbdd_ref / mtbdd_deref pairs cancel in a thread-local buffer before ever touching per-worker refs tables
Two-phase AND — filter edge-label pairs before recursing; 2–15 % on real workloads, biggest win on dense formulas
Lock-free operation cache — seqlock reads, no mutex on hit path
Slab memory pools with per-worker caches — eliminates cross-core cacheline bouncing on allocation
Nodetable bucket tuning — reduced hash collisions; average probe length from 238 to 1.77 on N-Queens
PGO + LTO build — an additional 4–10 % on top of everything above

SRE-NDD legacy improvements (`feature/c`, pre-migration)

Retained for reference — these deltas came from the two-phase AND commit on feature/c, measured before the Sylvan 1.10 / Lace 1.6 migration:

Dataset	MF	Best improvement (within `feature/c`)
bgp_fattree04	2	−15.4 % (w=6)
bgp_fattree08	1	−5.2 % (w=6)
bgp_fattree12	3	−14.0 % (w=1)

Migration-era (post-Sylvan-1.10) C-only SRE numbers have not been re-measured in isolation; the next section instead compares the post-migration feature/c to the feature/go port head-to-head.

`feature/go` vs. `feature/c` — SRE-NDD full sweep (2026-04-24)

Head-to-head comparison of both libmtpnddjni.so artifacts via sre-ndd/run.sh (Java NDDConfig2spec, same fat-tree workloads). One JVM at a time, w=4 (4 Java worker threads at the app layer).

feature/go artifact: commit 1df0688 (v1.18 dedicated bdd.Diff), ≈ 2.8 MB libmtpnddjni.so
feature/c artifact: feature/c HEAD, ≈ 454 KB libmtpnddjni.so (Sylvan 1.10 submodule + Lace 1.6)
Same 6-core / 31 GB host, JDK 23, -Xmx32768m (-Xmx24576m under 26 GB cgroup cap for the largest workloads to avoid OOM-killing the host)

workload	`feature/go` v1.18	`feature/c`	Go / C	notes
ft04 MF=1	0.50 s	0.49 s	1.02×	sub-second; JNI crossing dominates on both
ft04 MF=2	0.37 s	0.36 s	1.03×	sub-second
ft04 MF=3	0.37 s	0.36 s	1.03×	sub-second
ft08 MF=1	2.43 s	3.38 s	0.72×	Go 1.39× faster
ft08 MF=2	5.50 s	10.89 s	0.50×	Go 1.98× faster
ft08 MF=3	14.00 s (5-run median)	32.40 s	0.43×	Go 2.31× faster
ft12 MF=1	19.25 s	26.47 s	0.73×	Go 1.38× faster
ft12 MF=2	55.68 s	—	—	`feature/c` not re-measured this session
ft12 MF=3	198.29 s (3-run median: 197.76 / 198.29 / 199.25)	578.29 s (3-run median: 578.29 / 573.95 / 578.49)	0.34×	Go 2.92× faster
ft16 MF=1	84.84 s	—	—	`feature/c` not re-measured
ft16 MF=2	OOM	—	—	JVM killed at `-Xmx32G` on 31 GB host
ft16 MF=3	OOM	—	—	killed at 3:22, 22.8 GB peak under 26 GB cgroup cap
ft20 all MFs	skipped	—	—	`run.sh` comment indicates `-Xmx150G` is expected

Observations

ft04 is JNI-bound. Three MFs all run at ≈ 0.4–0.5 s on both branches — the per-call ≈ 2.5 µs cgo crossing on feature/go lands inside the same noise band as JNI on feature/c.
feature/go lead scales with workload. 1.0× (ft04) → 1.4–2.0× (ft08) → 1.4–2.9× (ft12). The ratio tracks how much work lives in pure solving rather than in the JNI layer.
ft12 MF=3's 2.92× is mostly SatCount. Per-op breakdown (see sre-ndd/docs/mtpndd-go-full-sweep.md): feature/go SatCount 10.4 s vs. feature/c 289 s (27.8× on that sub-op alone, thanks to the Go-side global memo). Excluding SatCount, feature/go is 1.39× faster on the rest.
31 GB host is the ceiling. ft16 MF=1 fits (84.8 s); MF=2 and MF=3 OOM on both branches' heap layout. ft20 requires ≈ 150 GB per run.sh and is not representative here.
Noise floor ≈ 3 %. The previously-recorded feature/c ft12 MF=3 number (565.8 s single-run) was ≈ 2 % below the refreshed 3-run median of 578.3 s.

Full experiment log, per-version contribution breakdown, and C-side per-op decomposition live at sre-ndd/docs/mtpndd-go-full-sweep.md.

Headline (feature/go v1.18 vs feature/c, w=4): ft08 MF=3 2.31× · ft12 MF=3 2.92× · nqueens N=13 22 % faster — see Performance.

feature/go @ b8e651c · feature/c @ head · Apache-2.0

Maintained by Augists (<augists@duck.com>). Issues & PRs: github.com/Augists/mtpndd.

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance

Performance

`feature/go` vs. `feature/c` — N-Queens (6 workers)

`feature/c` — N-Queens sweep (2026-04-22)

Test Platform

Serial Benchmark — Release

Serial Benchmark — PGO + LTO

Outer-Operation Parallel Variant — PGO + LTO

Session-over-Session — N=12 (`feature/c`)

Reproducing

Historical Notes

SRE-NDD legacy improvements (`feature/c`, pre-migration)

`feature/go` vs. `feature/c` — SRE-NDD full sweep (2026-04-24)

Observations

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Both branches

`feature/c` (C + Sylvan + Lace)

`feature/go` (pure Go, v1.18)

Clone this wiki locally

Performance

Performance

feature/go vs. feature/c — N-Queens (6 workers)

feature/c — N-Queens sweep (2026-04-22)

Test Platform

Serial Benchmark — Release

Serial Benchmark — PGO + LTO

Outer-Operation Parallel Variant — PGO + LTO

Session-over-Session — N=12 (feature/c)

Reproducing

Historical Notes

SRE-NDD legacy improvements (feature/c, pre-migration)

feature/go vs. feature/c — SRE-NDD full sweep (2026-04-24)

Observations

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Both branches

feature/c (C + Sylvan + Lace)

feature/go (pure Go, v1.18)

Clone this wiki locally

`feature/go` vs. `feature/c` — N-Queens (6 workers)

`feature/c` — N-Queens sweep (2026-04-22)

Session-over-Session — N=12 (`feature/c`)

SRE-NDD legacy improvements (`feature/c`, pre-migration)

`feature/go` vs. `feature/c` — SRE-NDD full sweep (2026-04-24)

`feature/c` (C + Sylvan + Lace)

`feature/go` (pure Go, v1.18)