Skip to content

V3 throughput bench + FINDINGS entry 5: network bottleneck retires planned experiment#47

Merged
wock9000 merged 6 commits into
trunkfrom
v3-throughput-bench
May 14, 2026
Merged

V3 throughput bench + FINDINGS entry 5: network bottleneck retires planned experiment#47
wock9000 merged 6 commits into
trunkfrom
v3-throughput-bench

Conversation

@wock9000
Copy link
Copy Markdown
Contributor

Summary

Wires --backend {numpy,coral} into tools/truth_bench.py (host-side prep from prior PR #46, now closed), then runs the host-vs-Coral comparison at N=1024 Wishart and discovers that the V3 backend over SSH-tunneled TCP is ~209× slower per step than host NumPy on Apple Silicon.

That retires the planned "TPU-accelerated Wishart truth-bench" experiment (named as the follow-up in FINDINGS entry 4) on measured evidence. New FINDINGS entry 5 documents the result and names the architectural fix (co-locate annealer on Coral).

Headline numbers (N=1024, ms per matmul-step)

backend ms / step
Host (M3 Pro) NumPy, in truth-bench loop 0.33
Coral CPU (ARM) NumPy 5.59
Coral Edge TPU, local 2.13
Coral Edge TPU, via SSH-tunneled TCP 68.78

TPU vs Coral ARM CPU: 2.6× faster at N=1024 (the V3 hardware claim holds).
Host NumPy vs TPU local: 6.5× faster (modern host beats the small TPU at this N).
Host NumPy vs operational V3 (TPU over network): 209× faster (network dominates).

What's in this PR

What's NOT in this PR

  • Coral-side annealer co-location — substantive work, deferred
  • Quality-comparison re-bench with a Wishart-J-baked TPU model — only ms_per_step is load-bearing in this entry; gap_rel from the coral path is unreliable because the baked-in J is random_spin_glass, not Wishart

Test plan

  • Host path unchanged at N=256 smoke test
  • Coral path returns valid v vectors at N=1024 (smoke test before benching)
  • FINDINGS entry 5 reflects what was actually measured (no extrapolation)

🤖 Generated with Claude Code

wock9000 and others added 6 commits May 13, 2026 07:13
…e bench

New problem class for the truth harness. Planted random k-colorable
graphs (k colors assigned uniformly, edges drawn only between
differently-colored pairs → planted coloring feasible by construction)
encoded as Ising via one-hot penalty with n·k+1 spins (one ancilla
absorbs the linear field, same layout as QAP/TSP).

QUBO convention used: H = Σ Q_ii x_i + Σ_{i<j} Q_ij x_i x_j, with
symmetric Q where each off-diagonal entry holds the pair coefficient
exactly once (no double-counting). Standard substitution x = (1+s)/2
and ancilla linear-field absorption yields E = -½ sᵀJs.

decode_coloring() handles the s → -s global gauge: PT routinely finds
the all-spins-flipped equivalent of the planted ground state, which
has identical Ising energy. Gauge-flip if ancilla landed on -1 before
argmax-decoding the one-hot blocks.

Baseline (vanilla single-spin PT, K=8, 100×100 sweeps × 4 seeds):

  n=100  edges= 450:  conflicts mean=15.2  (3% of edges)
  n=150  edges= 675:  conflicts mean=57.5  (9%)
  n=200  edges= 900:  conflicts mean=117   (13%)
  n=250  edges=1125:  conflicts mean=138   (12%)

The wall is at n≥200 — clear room for structured proposals to crack.
Next PR adds vertex_recolor (2-spin coordinated flip mapping feasible
permutations to feasible permutations); subsequent PR updates FINDINGS
entry 1 with the new portability datapoint.

Reproducible:
    python -m tools.coloring_bench

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the existing CoralEvaluator (TCP-shipped matmul to coral_server.py
on the Edge TPU) into the truth bench. Same J, same e_reference, same
algorithm; only the per-step matmul backend differs. Both branches now
emit ms_per_step in the per-seed JSON and summary, so the throughput
delta is in-band with the gap-to-truth number.

Runtime plan baked into the docstring: tunnel + coral-server-edgetpu +
truth_bench --backend coral, in that order.

Host smoke (M3 Pro, N=256, Wishart α=0.5, --steps 200):
    ms_per_step: ~30 µs

Next: run on the Coral once entropy capture pauses to measure the
host-vs-TPU per-step delta at N=1024, then re-bench Wishart at matched
wall-clock to see whether the throughput buys us less of the 5.5% gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Measured host-M3Pro vs Coral-TPU-over-SSH-tunnel at N=1024 with the
new --backend coral wiring. Operationally the V3 path is ~209× slower
per matmul-step (68.8 ms vs 0.33 ms). The TPU itself isn't the
problem — local Coral matmul is 2.6× faster than the Coral's own ARM
CPU at N=1024 — but the per-step SSH+TCP round-trip dwarfs the
compute.

Retires the planned "TPU-accelerated truth-bench" experiment named in
FINDINGS entry 4's followup. The architecturally-clean fix is to
co-locate the annealer on the Coral; that is substantive work and
explicitly deferred.

Data: docs/scaling/wishart_n1024_{host,coral_throughput}.jsonl

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Architectural fix named in FINDINGS entry 5: move the whole inner MTM
loop onto the Coral so the per-step TCP round-trip becomes a per-anneal
round-trip. The TPU stays the per-step matmul accelerator; the
Boltzmann selection, Metropolis accept, and energy bookkeeping all run
locally on the Coral ARM.

New files
- coral/coral_anneal.py — standalone Py3.7 / numpy 1.16 compatible
  MTM anneal. Reproduces the baked-in random_spin_glass instance and
  recomputes exact energies host-side-equivalently. `from __future__
  import annotations` keeps modern type-hint syntax non-evaluated.
- tools/coral_anneal_bench.py — host-side harness comparing host MTM
  vs co-located Coral MTM on the same J. Reports ms_per_step + e_best
  per side.

Modified
- coral/coral_server.py — adds OP_ANNEAL (0x03). Reuses the existing
  TFLite interpreter and cached J; imports run_mtm_anneal lazily.
- coral/coral_client.py — adds .anneal() method, single round-trip
  per anneal regardless of step count.

Wire layout for OP_ANNEAL:
  client -> server: byte 0x03, u32 steps, f64 beta_min, f64 beta_max,
                    u32 top_k, u32 top_k_min (0xffffffff = None), u32 seed
  server -> client: byte 0x01 ack, i64 e_best, f64 ms_total,
                    f64 ms_per_step, N int8 s_best

Host baseline collected on the same instance (random_spin_glass
N=1024 d=0.4 s=42), 4 seeds × 3200 steps: ms_per_step 324 µs,
e_best mean -14463, min -14857. Coral-side numbers pending a
coral_server restart to pick up the new opcode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-located Coral MTM (OP_ANNEAL) takes 43.3 ms/step vs 68.8 ms/step
for per-step-over-SSH — exactly the improvement that FINDINGS entry
5's network-round-trip diagnosis predicted. Host NumPy on M3 Pro is
still 133× faster (0.32 ms/step). The new bottleneck is the Coral
ARM CPU running the Boltzmann selection + exact-energy bookkeeping
at ~41 ms/step; the TPU matmul itself is ~2 ms.

Quality at parity (Coral mean -14749 vs host -14463 on the same
random_spin_glass N=1024 d=0.4 s=42 instance, 4 seeds × 3200 steps).
The TPU's int8 quantization on ΔE acts like incidental top-k noise
that broadens exploration; within seed-to-seed variance.

Closes the V3 architectural-fix arc named in entry 5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Detached signature ops over pqcrypto.sign.{ml_dsa_44,ml_dsa_65,
ml_dsa_87,sphincs_sha2_128f_simple,sphincs_sha2_128s_simple}. Default
algo is ml-dsa-65 (FIPS 204, NIST L3 / Dilithium3 equivalent).

API shape mirrors lopt keygen: --sk-hex / --pk-hex consume the hex
output of `lopt keygen --format hex`. Message input: --message-hex,
--message-file, or stdin (binary, default fallback).

  lopt keygen --algo ml-dsa-65 > key.txt
  lopt sign   --algo ml-dsa-65 --sk-hex $SK --message-file msg.bin > sig.hex
  lopt verify --algo ml-dsa-65 --pk-hex $PK --signature-hex $(cat sig.hex) \
              --message-file msg.bin   # exits 0 on OK, 1 on FAIL

Correctness verified end-to-end at N=L3 (pk=1952B, sk=4032B,
sig=3309B). Verify returns OK on a correct (pk, message, sig) tuple
and FAIL on a tampered message; exit code matches.

PQClean's ML-DSA is hedged (fresh OS randomness mixed per signature),
so successive signs of the same (sk, message) yield DIFFERENT bytes;
both verify. SLH-DSA-simple variants are deterministic and reproduce
bit-identically. Docstrings explicit on this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wock9000 wock9000 merged commit 596a485 into trunk May 14, 2026
1 check passed
@wock9000 wock9000 deleted the v3-throughput-bench branch May 14, 2026 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant