Skip to content

Performance

Augists edited this page Apr 24, 2026 · 5 revisions

Performance

This page is split by branch / workload:

Every number is tagged with the branch it came from. The two libmtpnddjni.so artifacts are drop-in replacements at the JNI level but are not interchangeable for benchmarking.


feature/go vs. feature/c — N-Queens (6 workers)

A smaller apples-to-apples benchmark that both backends solve natively (no Java or JNI layer). feature/go uses cmd/nqueens-bench; feature/c uses mtpndd_nqueens_benchmark. Median of 3 runs.

N feature/c feature/go v1.18 Go / C Go advantage
10 0.09 s 0.07 s 0.80× 20 % faster
11 0.36 s 0.30 s 0.86× 14 % faster
12 1.66 s 1.38 s 0.83× 17 % faster
13 9.43 s 7.40 s 0.78× 22 % faster

feature/go leads feature/c by 14–22 % across all tested N on n-queens. The gap is smaller than on the SRE workloads (1.4–2.9×) because n-queens is all BDD work — no NDD SatCount memo to exploit, no JNI-side goroutine-spawn pathology to fix.

feature/c — N-Queens sweep (2026-04-22)

All figures in this section are from the 2026-04-22 sweep run with ./bench_sweep.sh on the feature/c branch (Sylvan submodule 1.10.0 + Lace 1.6.2). Raw TSVs: bench_results_{serial,parallel}_{release,pgo}_*.tsv in the repo root.

Test Platform

  • HP Zhan 86 Pro G2 MT, 6 physical cores (12 threads), Linux 6.19
  • MTPNDD_LOG_LEVEL=1 (INFO — no stats instrumentation overhead)
  • CMAKE_BUILD_TYPE=Release (-O3) for the default build; PGO table uses scripts/build-pgo.sh output

Serial Benchmark — Release

mtpndd_nqueens_benchmark, default build.

N 1w 4w 6w 6w speedup Solutions
10 0.237 s 0.113 s 0.093 s 2.55× 724
11 1.171 s 0.446 s 0.344 s 3.40× 2,680
12 6.021 s 2.132 s 1.540 s 3.91× 14,200
13 35.49 s 12.07 s 8.59 s 4.13× 73,712

Speedup tapers above 4 workers because at the top of the recursion the tree narrows and Lace steal attempts dominate.

Serial Benchmark — PGO + LTO

Built via ./scripts/build-pgo.sh (profile-generate → nqueens training at N=10..12, W=1,4,6 → profile-use).

N 1w 4w 6w 6w speedup vs Release (6w)
10 0.217 s 0.104 s 0.098 s 2.21× −5%
11 1.082 s 0.421 s 0.311 s 3.48× −10%
12 5.711 s 2.033 s 1.478 s 3.86× −4%
13 34.19 s 11.50 s 8.67 s 3.94× −1%

Single-thread wins are largest (−8 to −12 %) because PGO helps the single hot recursion most; at 6 workers scheduling overhead masks the per-instruction savings.

Outer-Operation Parallel Variant — PGO + LTO

mtpndd_nqueens_parallel_benchmark — same core plus tree-parallel construction of the N-Queens formula. Useful at small N where the inner AND/OR recursion runs out of parallelism.

N 1w 6w vs serial PGO (6w)
10 0.220 s 0.080 s −18%
11 1.093 s 0.295 s −5%
12 5.804 s 1.473 s 0%
13 33.66 s 8.23 s −5%

The parallel variant under plain Release (no PGO) is 1–8 % slower than the PGO numbers above — the gap is largest at 1 worker (e.g. N=11 1w: 1.189 s Release → 1.093 s PGO, −8 %) and shrinks to ~1–3 % at 6 workers where scheduling overhead dominates.

Session-over-Session — N=12 (feature/c)

The feature/c codebase went through a large refactor in April 2026 (Sylvan + Lace upgrade from vendored 1.9.1 + 1.5.0 to upstream submodule 1.10.0 + 1.6.2, plus targeted perf work).

W Pre-migration Current (Release) Current (PGO+LTO)
1 6.32 s 6.02 s (−5%) 5.71 s (−10%)
2 5.14 s 3.70 s (−28%) 3.47 s (−33%)
4 3.08 s 2.13 s (−31%) 2.03 s (−34%)
6 2.06 s 1.54 s (−25%) 1.48 s (−28%)

Pre-migration data lives at the legacy-vendored-sylvan tag. The biggest multi-worker win comes from per-worker mtbdd_refs sharding (no more global refs mutex) and the ref/deref coalescing cache.

Reproducing

# Default Release sweep
./bench_sweep.sh                   # N=7..13, workers=1..6

# PGO sweep
./scripts/build-pgo.sh             # one-time, ~minutes
./bench_sweep.sh --pgo

# Outer-op parallel sweep
./bench_sweep.sh --parallel

Results are written to bench_results_{serial|parallel}_{release|pgo}_<timestamp>.tsv in the repo root.

Historical Notes

Optimization write-ups and A/B audits from the April 2026 migration live in-repo under docs/archived/perf/. Highlights:

  1. Migration to Sylvan 1.10.0 submodule + Lace 1.6.2 — enables per-worker mtbdd_refs sharding, eliminates global refs mutex contention at 4+ workers
  2. Coalescing ref cachemtbdd_ref / mtbdd_deref pairs cancel in a thread-local buffer before ever touching per-worker refs tables
  3. Two-phase AND — filter edge-label pairs before recursing; 2–15 % on real workloads, biggest win on dense formulas
  4. Lock-free operation cache — seqlock reads, no mutex on hit path
  5. Slab memory pools with per-worker caches — eliminates cross-core cacheline bouncing on allocation
  6. Nodetable bucket tuning — reduced hash collisions; average probe length from 238 to 1.77 on N-Queens
  7. PGO + LTO build — an additional 4–10 % on top of everything above

SRE-NDD legacy improvements (feature/c, pre-migration)

Retained for reference — these deltas came from the two-phase AND commit on feature/c, measured before the Sylvan 1.10 / Lace 1.6 migration:

Dataset MF Best improvement (within feature/c)
bgp_fattree04 2 −15.4 % (w=6)
bgp_fattree08 1 −5.2 % (w=6)
bgp_fattree12 3 −14.0 % (w=1)

Migration-era (post-Sylvan-1.10) C-only SRE numbers have not been re-measured in isolation; the next section instead compares the post-migration feature/c to the feature/go port head-to-head.


feature/go vs. feature/c — SRE-NDD full sweep (2026-04-24)

Head-to-head comparison of both libmtpnddjni.so artifacts via sre-ndd/run.sh (Java NDDConfig2spec, same fat-tree workloads). One JVM at a time, w=4 (4 Java worker threads at the app layer).

  • feature/go artifact: commit 1df0688 (v1.18 dedicated bdd.Diff), ≈ 2.8 MB libmtpnddjni.so
  • feature/c artifact: feature/c HEAD, ≈ 454 KB libmtpnddjni.so (Sylvan 1.10 submodule + Lace 1.6)
  • Same 6-core / 31 GB host, JDK 23, -Xmx32768m (-Xmx24576m under 26 GB cgroup cap for the largest workloads to avoid OOM-killing the host)
workload feature/go v1.18 feature/c Go / C notes
ft04 MF=1 0.50 s 0.49 s 1.02× sub-second; JNI crossing dominates on both
ft04 MF=2 0.37 s 0.36 s 1.03× sub-second
ft04 MF=3 0.37 s 0.36 s 1.03× sub-second
ft08 MF=1 2.43 s 3.38 s 0.72× Go 1.39× faster
ft08 MF=2 5.50 s 10.89 s 0.50× Go 1.98× faster
ft08 MF=3 14.00 s (5-run median) 32.40 s 0.43× Go 2.31× faster
ft12 MF=1 19.25 s 26.47 s 0.73× Go 1.38× faster
ft12 MF=2 55.68 s feature/c not re-measured this session
ft12 MF=3 198.29 s (3-run median: 197.76 / 198.29 / 199.25) 578.29 s (3-run median: 578.29 / 573.95 / 578.49) 0.34× Go 2.92× faster
ft16 MF=1 84.84 s feature/c not re-measured
ft16 MF=2 OOM JVM killed at -Xmx32G on 31 GB host
ft16 MF=3 OOM killed at 3:22, 22.8 GB peak under 26 GB cgroup cap
ft20 all MFs skipped run.sh comment indicates -Xmx150G is expected

Observations

  • ft04 is JNI-bound. Three MFs all run at ≈ 0.4–0.5 s on both branches — the per-call ≈ 2.5 µs cgo crossing on feature/go lands inside the same noise band as JNI on feature/c.
  • feature/go lead scales with workload. 1.0× (ft04) → 1.4–2.0× (ft08) → 1.4–2.9× (ft12). The ratio tracks how much work lives in pure solving rather than in the JNI layer.
  • ft12 MF=3's 2.92× is mostly SatCount. Per-op breakdown (see sre-ndd/docs/mtpndd-go-full-sweep.md): feature/go SatCount 10.4 s vs. feature/c 289 s (27.8× on that sub-op alone, thanks to the Go-side global memo). Excluding SatCount, feature/go is 1.39× faster on the rest.
  • 31 GB host is the ceiling. ft16 MF=1 fits (84.8 s); MF=2 and MF=3 OOM on both branches' heap layout. ft20 requires ≈ 150 GB per run.sh and is not representative here.
  • Noise floor ≈ 3 %. The previously-recorded feature/c ft12 MF=3 number (565.8 s single-run) was ≈ 2 % below the refreshed 3-run median of 578.3 s.

Full experiment log, per-version contribution breakdown, and C-side per-op decomposition live at sre-ndd/docs/mtpndd-go-full-sweep.md.