V3 throughput bench + FINDINGS entry 5: network bottleneck retires planned experiment by wock9000 · Pull Request #47 · Xylem-Group/lopt

wock9000 · 2026-05-13T15:31:40Z

Summary

Wires --backend {numpy,coral} into tools/truth_bench.py (host-side prep from prior PR #46, now closed), then runs the host-vs-Coral comparison at N=1024 Wishart and discovers that the V3 backend over SSH-tunneled TCP is ~209× slower per step than host NumPy on Apple Silicon.

That retires the planned "TPU-accelerated Wishart truth-bench" experiment (named as the follow-up in FINDINGS entry 4) on measured evidence. New FINDINGS entry 5 documents the result and names the architectural fix (co-locate annealer on Coral).

Headline numbers (N=1024, ms per matmul-step)

backend	ms / step
Host (M3 Pro) NumPy, in truth-bench loop	0.33
Coral CPU (ARM) NumPy	5.59
Coral Edge TPU, local	2.13
Coral Edge TPU, via SSH-tunneled TCP	68.78

TPU vs Coral ARM CPU: 2.6× faster at N=1024 (the V3 hardware claim holds).
Host NumPy vs TPU local: 6.5× faster (modern host beats the small TPU at this N).
Host NumPy vs operational V3 (TPU over network): 209× faster (network dominates).

What's in this PR

tools/truth_bench.py --backend wiring (carried from PR truth_bench: --backend {numpy,coral} for V3 host-vs-TPU comparison #46)
docs/scaling/wishart_n1024_{host,coral_throughput}.jsonl — raw data
FINDINGS.md entry 5

What's NOT in this PR

Coral-side annealer co-location — substantive work, deferred
Quality-comparison re-bench with a Wishart-J-baked TPU model — only ms_per_step is load-bearing in this entry; gap_rel from the coral path is unreliable because the baked-in J is random_spin_glass, not Wishart

Test plan

Host path unchanged at N=256 smoke test
Coral path returns valid v vectors at N=1024 (smoke test before benching)
FINDINGS entry 5 reflects what was actually measured (no extrapolation)

🤖 Generated with Claude Code

…e bench New problem class for the truth harness. Planted random k-colorable graphs (k colors assigned uniformly, edges drawn only between differently-colored pairs → planted coloring feasible by construction) encoded as Ising via one-hot penalty with n·k+1 spins (one ancilla absorbs the linear field, same layout as QAP/TSP). QUBO convention used: H = Σ Q_ii x_i + Σ_{i<j} Q_ij x_i x_j, with symmetric Q where each off-diagonal entry holds the pair coefficient exactly once (no double-counting). Standard substitution x = (1+s)/2 and ancilla linear-field absorption yields E = -½ sᵀJs. decode_coloring() handles the s → -s global gauge: PT routinely finds the all-spins-flipped equivalent of the planted ground state, which has identical Ising energy. Gauge-flip if ancilla landed on -1 before argmax-decoding the one-hot blocks. Baseline (vanilla single-spin PT, K=8, 100×100 sweeps × 4 seeds): n=100 edges= 450: conflicts mean=15.2 (3% of edges) n=150 edges= 675: conflicts mean=57.5 (9%) n=200 edges= 900: conflicts mean=117 (13%) n=250 edges=1125: conflicts mean=138 (12%) The wall is at n≥200 — clear room for structured proposals to crack. Next PR adds vertex_recolor (2-spin coordinated flip mapping feasible permutations to feasible permutations); subsequent PR updates FINDINGS entry 1 with the new portability datapoint. Reproducible: python -m tools.coloring_bench Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wires the existing CoralEvaluator (TCP-shipped matmul to coral_server.py on the Edge TPU) into the truth bench. Same J, same e_reference, same algorithm; only the per-step matmul backend differs. Both branches now emit ms_per_step in the per-seed JSON and summary, so the throughput delta is in-band with the gap-to-truth number. Runtime plan baked into the docstring: tunnel + coral-server-edgetpu + truth_bench --backend coral, in that order. Host smoke (M3 Pro, N=256, Wishart α=0.5, --steps 200): ms_per_step: ~30 µs Next: run on the Coral once entropy capture pauses to measure the host-vs-TPU per-step delta at N=1024, then re-bench Wishart at matched wall-clock to see whether the throughput buys us less of the 5.5% gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Measured host-M3Pro vs Coral-TPU-over-SSH-tunnel at N=1024 with the new --backend coral wiring. Operationally the V3 path is ~209× slower per matmul-step (68.8 ms vs 0.33 ms). The TPU itself isn't the problem — local Coral matmul is 2.6× faster than the Coral's own ARM CPU at N=1024 — but the per-step SSH+TCP round-trip dwarfs the compute. Retires the planned "TPU-accelerated truth-bench" experiment named in FINDINGS entry 4's followup. The architecturally-clean fix is to co-locate the annealer on the Coral; that is substantive work and explicitly deferred. Data: docs/scaling/wishart_n1024_{host,coral_throughput}.jsonl Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Architectural fix named in FINDINGS entry 5: move the whole inner MTM loop onto the Coral so the per-step TCP round-trip becomes a per-anneal round-trip. The TPU stays the per-step matmul accelerator; the Boltzmann selection, Metropolis accept, and energy bookkeeping all run locally on the Coral ARM. New files - coral/coral_anneal.py — standalone Py3.7 / numpy 1.16 compatible MTM anneal. Reproduces the baked-in random_spin_glass instance and recomputes exact energies host-side-equivalently. `from __future__ import annotations` keeps modern type-hint syntax non-evaluated. - tools/coral_anneal_bench.py — host-side harness comparing host MTM vs co-located Coral MTM on the same J. Reports ms_per_step + e_best per side. Modified - coral/coral_server.py — adds OP_ANNEAL (0x03). Reuses the existing TFLite interpreter and cached J; imports run_mtm_anneal lazily. - coral/coral_client.py — adds .anneal() method, single round-trip per anneal regardless of step count. Wire layout for OP_ANNEAL: client -> server: byte 0x03, u32 steps, f64 beta_min, f64 beta_max, u32 top_k, u32 top_k_min (0xffffffff = None), u32 seed server -> client: byte 0x01 ack, i64 e_best, f64 ms_total, f64 ms_per_step, N int8 s_best Host baseline collected on the same instance (random_spin_glass N=1024 d=0.4 s=42), 4 seeds × 3200 steps: ms_per_step 324 µs, e_best mean -14463, min -14857. Coral-side numbers pending a coral_server restart to pick up the new opcode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-located Coral MTM (OP_ANNEAL) takes 43.3 ms/step vs 68.8 ms/step for per-step-over-SSH — exactly the improvement that FINDINGS entry 5's network-round-trip diagnosis predicted. Host NumPy on M3 Pro is still 133× faster (0.32 ms/step). The new bottleneck is the Coral ARM CPU running the Boltzmann selection + exact-energy bookkeeping at ~41 ms/step; the TPU matmul itself is ~2 ms. Quality at parity (Coral mean -14749 vs host -14463 on the same random_spin_glass N=1024 d=0.4 s=42 instance, 4 seeds × 3200 steps). The TPU's int8 quantization on ΔE acts like incidental top-k noise that broadens exploration; within seed-to-seed variance. Closes the V3 architectural-fix arc named in entry 5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Detached signature ops over pqcrypto.sign.{ml_dsa_44,ml_dsa_65, ml_dsa_87,sphincs_sha2_128f_simple,sphincs_sha2_128s_simple}. Default algo is ml-dsa-65 (FIPS 204, NIST L3 / Dilithium3 equivalent). API shape mirrors lopt keygen: --sk-hex / --pk-hex consume the hex output of `lopt keygen --format hex`. Message input: --message-hex, --message-file, or stdin (binary, default fallback). lopt keygen --algo ml-dsa-65 > key.txt lopt sign --algo ml-dsa-65 --sk-hex $SK --message-file msg.bin > sig.hex lopt verify --algo ml-dsa-65 --pk-hex $PK --signature-hex $(cat sig.hex) \ --message-file msg.bin # exits 0 on OK, 1 on FAIL Correctness verified end-to-end at N=L3 (pk=1952B, sk=4032B, sig=3309B). Verify returns OK on a correct (pk, message, sig) tuple and FAIL on a tampered message; exit code matches. PQClean's ML-DSA is hedged (fresh OS randomness mixed per signature), so successive signs of the same (sk, message) yield DIFFERENT bytes; both verify. SLH-DSA-simple variants are deterministic and reproduce bit-identically. Docstrings explicit on this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wock9000 and others added 6 commits May 13, 2026 07:13

wock9000 merged commit 596a485 into trunk May 14, 2026
1 check passed

wock9000 deleted the v3-throughput-bench branch May 14, 2026 06:44

This was referenced May 14, 2026

make coral-tunnel hardening; FINDINGS 6 wording; deterministic lopt sign #48

Merged

coral_anneal: 13.8× per-step speedup via incremental energy tracking #50

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V3 throughput bench + FINDINGS entry 5: network bottleneck retires planned experiment#47

V3 throughput bench + FINDINGS entry 5: network bottleneck retires planned experiment#47
wock9000 merged 6 commits into
trunkfrom
v3-throughput-bench

wock9000 commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wock9000 commented May 13, 2026

Summary

Headline numbers (N=1024, ms per matmul-step)

What's in this PR

What's NOT in this PR

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant