-
Notifications
You must be signed in to change notification settings - Fork 0
Performance
This page is split by branch / workload:
-
feature/govs.feature/c— N-Queens head-to-head — 14–22 % Go advantage -
feature/c— N-Queens sweep (2026-04-22) — the original C+Sylvan+Lace reference feature/c— historical N=12 comparison-
feature/govs.feature/c— SRE-NDD full sweep (2026-04-24) — BGP fat-tree network verification
Every number is tagged with the branch it came from. The two
libmtpnddjni.so artifacts are drop-in replacements at the JNI level
but are not interchangeable for benchmarking.
A smaller apples-to-apples benchmark that both backends solve natively
(no Java or JNI layer). feature/go uses cmd/nqueens-bench;
feature/c uses mtpndd_nqueens_benchmark. Median of 3 runs.
| N | feature/c |
feature/go v1.18 |
Go / C | Go advantage |
|---|---|---|---|---|
| 10 | 0.09 s | 0.07 s | 0.80× | 20 % faster |
| 11 | 0.36 s | 0.30 s | 0.86× | 14 % faster |
| 12 | 1.66 s | 1.38 s | 0.83× | 17 % faster |
| 13 | 9.43 s | 7.40 s | 0.78× | 22 % faster |
feature/go leads feature/c by 14–22 % across all tested N on
n-queens. The gap is smaller than on the SRE workloads (1.4–2.9×)
because n-queens is all BDD work — no NDD SatCount memo to exploit,
no JNI-side goroutine-spawn pathology to fix.
All figures in this section are from the 2026-04-22 sweep run with ./bench_sweep.sh on the feature/c branch (Sylvan submodule 1.10.0 + Lace 1.6.2). Raw TSVs: bench_results_{serial,parallel}_{release,pgo}_*.tsv in the repo root.
- HP Zhan 86 Pro G2 MT, 6 physical cores (12 threads), Linux 6.19
-
MTPNDD_LOG_LEVEL=1(INFO — no stats instrumentation overhead) -
CMAKE_BUILD_TYPE=Release(-O3) for the default build; PGO table usesscripts/build-pgo.shoutput
mtpndd_nqueens_benchmark, default build.
| N | 1w | 4w | 6w | 6w speedup | Solutions |
|---|---|---|---|---|---|
| 10 | 0.237 s | 0.113 s | 0.093 s | 2.55× | 724 |
| 11 | 1.171 s | 0.446 s | 0.344 s | 3.40× | 2,680 |
| 12 | 6.021 s | 2.132 s | 1.540 s | 3.91× | 14,200 |
| 13 | 35.49 s | 12.07 s | 8.59 s | 4.13× | 73,712 |
Speedup tapers above 4 workers because at the top of the recursion the tree narrows and Lace steal attempts dominate.
Built via ./scripts/build-pgo.sh (profile-generate → nqueens training at N=10..12, W=1,4,6 → profile-use).
| N | 1w | 4w | 6w | 6w speedup | vs Release (6w) |
|---|---|---|---|---|---|
| 10 | 0.217 s | 0.104 s | 0.098 s | 2.21× | −5% |
| 11 | 1.082 s | 0.421 s | 0.311 s | 3.48× | −10% |
| 12 | 5.711 s | 2.033 s | 1.478 s | 3.86× | −4% |
| 13 | 34.19 s | 11.50 s | 8.67 s | 3.94× | −1% |
Single-thread wins are largest (−8 to −12 %) because PGO helps the single hot recursion most; at 6 workers scheduling overhead masks the per-instruction savings.
mtpndd_nqueens_parallel_benchmark — same core plus tree-parallel construction of the N-Queens formula. Useful at small N where the inner AND/OR recursion runs out of parallelism.
| N | 1w | 6w | vs serial PGO (6w) |
|---|---|---|---|
| 10 | 0.220 s | 0.080 s | −18% |
| 11 | 1.093 s | 0.295 s | −5% |
| 12 | 5.804 s | 1.473 s | 0% |
| 13 | 33.66 s | 8.23 s | −5% |
The parallel variant under plain Release (no PGO) is 1–8 % slower than the PGO numbers above — the gap is largest at 1 worker (e.g. N=11 1w: 1.189 s Release → 1.093 s PGO, −8 %) and shrinks to ~1–3 % at 6 workers where scheduling overhead dominates.
The feature/c codebase went through a large refactor in April 2026 (Sylvan + Lace upgrade from vendored 1.9.1 + 1.5.0 to upstream submodule 1.10.0 + 1.6.2, plus targeted perf work).
| W | Pre-migration | Current (Release) | Current (PGO+LTO) |
|---|---|---|---|
| 1 | 6.32 s | 6.02 s (−5%) | 5.71 s (−10%) |
| 2 | 5.14 s | 3.70 s (−28%) | 3.47 s (−33%) |
| 4 | 3.08 s | 2.13 s (−31%) | 2.03 s (−34%) |
| 6 | 2.06 s | 1.54 s (−25%) | 1.48 s (−28%) |
Pre-migration data lives at the legacy-vendored-sylvan tag. The biggest multi-worker win comes from per-worker mtbdd_refs sharding (no more global refs mutex) and the ref/deref coalescing cache.
# Default Release sweep
./bench_sweep.sh # N=7..13, workers=1..6
# PGO sweep
./scripts/build-pgo.sh # one-time, ~minutes
./bench_sweep.sh --pgo
# Outer-op parallel sweep
./bench_sweep.sh --parallelResults are written to bench_results_{serial|parallel}_{release|pgo}_<timestamp>.tsv in the repo root.
Optimization write-ups and A/B audits from the April 2026 migration live in-repo under docs/archived/perf/. Highlights:
- Migration to Sylvan 1.10.0 submodule + Lace 1.6.2 — enables per-worker mtbdd_refs sharding, eliminates global refs mutex contention at 4+ workers
-
Coalescing ref cache —
mtbdd_ref/mtbdd_derefpairs cancel in a thread-local buffer before ever touching per-worker refs tables - Two-phase AND — filter edge-label pairs before recursing; 2–15 % on real workloads, biggest win on dense formulas
- Lock-free operation cache — seqlock reads, no mutex on hit path
- Slab memory pools with per-worker caches — eliminates cross-core cacheline bouncing on allocation
- Nodetable bucket tuning — reduced hash collisions; average probe length from 238 to 1.77 on N-Queens
- PGO + LTO build — an additional 4–10 % on top of everything above
Retained for reference — these deltas came from the two-phase AND
commit on feature/c, measured before the Sylvan 1.10 / Lace 1.6
migration:
| Dataset | MF | Best improvement (within feature/c) |
|---|---|---|
| bgp_fattree04 | 2 | −15.4 % (w=6) |
| bgp_fattree08 | 1 | −5.2 % (w=6) |
| bgp_fattree12 | 3 | −14.0 % (w=1) |
Migration-era (post-Sylvan-1.10) C-only SRE numbers have not been
re-measured in isolation; the next section instead compares the
post-migration feature/c to the feature/go port head-to-head.
Head-to-head comparison of both libmtpnddjni.so artifacts via
sre-ndd/run.sh (Java NDDConfig2spec, same fat-tree workloads).
One JVM at a time, w=4 (4 Java worker threads at the app layer).
-
feature/goartifact: commit1df0688(v1.18 dedicatedbdd.Diff), ≈ 2.8 MBlibmtpnddjni.so -
feature/cartifact:feature/cHEAD, ≈ 454 KBlibmtpnddjni.so(Sylvan 1.10 submodule + Lace 1.6) - Same 6-core / 31 GB host, JDK 23,
-Xmx32768m(-Xmx24576munder 26 GB cgroup cap for the largest workloads to avoid OOM-killing the host)
| workload |
feature/go v1.18 |
feature/c |
Go / C | notes |
|---|---|---|---|---|
| ft04 MF=1 | 0.50 s | 0.49 s | 1.02× | sub-second; JNI crossing dominates on both |
| ft04 MF=2 | 0.37 s | 0.36 s | 1.03× | sub-second |
| ft04 MF=3 | 0.37 s | 0.36 s | 1.03× | sub-second |
| ft08 MF=1 | 2.43 s | 3.38 s | 0.72× | Go 1.39× faster |
| ft08 MF=2 | 5.50 s | 10.89 s | 0.50× | Go 1.98× faster |
| ft08 MF=3 | 14.00 s (5-run median) | 32.40 s | 0.43× | Go 2.31× faster |
| ft12 MF=1 | 19.25 s | 26.47 s | 0.73× | Go 1.38× faster |
| ft12 MF=2 | 55.68 s | — | — |
feature/c not re-measured this session |
| ft12 MF=3 | 198.29 s (3-run median: 197.76 / 198.29 / 199.25) | 578.29 s (3-run median: 578.29 / 573.95 / 578.49) | 0.34× | Go 2.92× faster |
| ft16 MF=1 | 84.84 s | — | — |
feature/c not re-measured |
| ft16 MF=2 | OOM | — | — | JVM killed at -Xmx32G on 31 GB host |
| ft16 MF=3 | OOM | — | — | killed at 3:22, 22.8 GB peak under 26 GB cgroup cap |
| ft20 all MFs | skipped | — | — |
run.sh comment indicates -Xmx150G is expected |
-
ft04 is JNI-bound. Three MFs all run at ≈ 0.4–0.5 s on both branches — the per-call ≈ 2.5 µs cgo crossing on
feature/golands inside the same noise band as JNI onfeature/c. -
feature/golead scales with workload. 1.0× (ft04) → 1.4–2.0× (ft08) → 1.4–2.9× (ft12). The ratio tracks how much work lives in pure solving rather than in the JNI layer. -
ft12 MF=3's 2.92× is mostly SatCount. Per-op breakdown (see
sre-ndd/docs/mtpndd-go-full-sweep.md):feature/goSatCount 10.4 s vs.feature/c289 s (27.8× on that sub-op alone, thanks to the Go-side global memo). Excluding SatCount,feature/gois 1.39× faster on the rest. -
31 GB host is the ceiling. ft16 MF=1 fits (84.8 s); MF=2 and MF=3 OOM on both branches' heap layout. ft20 requires ≈ 150 GB per
run.shand is not representative here. -
Noise floor ≈ 3 %. The previously-recorded
feature/cft12 MF=3 number (565.8 s single-run) was ≈ 2 % below the refreshed 3-run median of 578.3 s.
Full experiment log, per-version contribution breakdown, and C-side per-op decomposition live at
sre-ndd/docs/mtpndd-go-full-sweep.md.
Headline (feature/go v1.18 vs feature/c, w=4): ft08 MF=3 2.31× · ft12 MF=3 2.92× · nqueens N=13 22 % faster — see Performance.
feature/go @ b8e651c · feature/c @ head · Apache-2.0
Maintained by Augists (<augists@duck.com>). Issues & PRs: github.com/Augists/mtpndd.