V3 throughput bench + FINDINGS entry 5: network bottleneck retires planned experiment#47
Merged
Conversation
…e bench
New problem class for the truth harness. Planted random k-colorable
graphs (k colors assigned uniformly, edges drawn only between
differently-colored pairs → planted coloring feasible by construction)
encoded as Ising via one-hot penalty with n·k+1 spins (one ancilla
absorbs the linear field, same layout as QAP/TSP).
QUBO convention used: H = Σ Q_ii x_i + Σ_{i<j} Q_ij x_i x_j, with
symmetric Q where each off-diagonal entry holds the pair coefficient
exactly once (no double-counting). Standard substitution x = (1+s)/2
and ancilla linear-field absorption yields E = -½ sᵀJs.
decode_coloring() handles the s → -s global gauge: PT routinely finds
the all-spins-flipped equivalent of the planted ground state, which
has identical Ising energy. Gauge-flip if ancilla landed on -1 before
argmax-decoding the one-hot blocks.
Baseline (vanilla single-spin PT, K=8, 100×100 sweeps × 4 seeds):
n=100 edges= 450: conflicts mean=15.2 (3% of edges)
n=150 edges= 675: conflicts mean=57.5 (9%)
n=200 edges= 900: conflicts mean=117 (13%)
n=250 edges=1125: conflicts mean=138 (12%)
The wall is at n≥200 — clear room for structured proposals to crack.
Next PR adds vertex_recolor (2-spin coordinated flip mapping feasible
permutations to feasible permutations); subsequent PR updates FINDINGS
entry 1 with the new portability datapoint.
Reproducible:
python -m tools.coloring_bench
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the existing CoralEvaluator (TCP-shipped matmul to coral_server.py
on the Edge TPU) into the truth bench. Same J, same e_reference, same
algorithm; only the per-step matmul backend differs. Both branches now
emit ms_per_step in the per-seed JSON and summary, so the throughput
delta is in-band with the gap-to-truth number.
Runtime plan baked into the docstring: tunnel + coral-server-edgetpu +
truth_bench --backend coral, in that order.
Host smoke (M3 Pro, N=256, Wishart α=0.5, --steps 200):
ms_per_step: ~30 µs
Next: run on the Coral once entropy capture pauses to measure the
host-vs-TPU per-step delta at N=1024, then re-bench Wishart at matched
wall-clock to see whether the throughput buys us less of the 5.5% gap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Measured host-M3Pro vs Coral-TPU-over-SSH-tunnel at N=1024 with the
new --backend coral wiring. Operationally the V3 path is ~209× slower
per matmul-step (68.8 ms vs 0.33 ms). The TPU itself isn't the
problem — local Coral matmul is 2.6× faster than the Coral's own ARM
CPU at N=1024 — but the per-step SSH+TCP round-trip dwarfs the
compute.
Retires the planned "TPU-accelerated truth-bench" experiment named in
FINDINGS entry 4's followup. The architecturally-clean fix is to
co-locate the annealer on the Coral; that is substantive work and
explicitly deferred.
Data: docs/scaling/wishart_n1024_{host,coral_throughput}.jsonl
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Architectural fix named in FINDINGS entry 5: move the whole inner MTM
loop onto the Coral so the per-step TCP round-trip becomes a per-anneal
round-trip. The TPU stays the per-step matmul accelerator; the
Boltzmann selection, Metropolis accept, and energy bookkeeping all run
locally on the Coral ARM.
New files
- coral/coral_anneal.py — standalone Py3.7 / numpy 1.16 compatible
MTM anneal. Reproduces the baked-in random_spin_glass instance and
recomputes exact energies host-side-equivalently. `from __future__
import annotations` keeps modern type-hint syntax non-evaluated.
- tools/coral_anneal_bench.py — host-side harness comparing host MTM
vs co-located Coral MTM on the same J. Reports ms_per_step + e_best
per side.
Modified
- coral/coral_server.py — adds OP_ANNEAL (0x03). Reuses the existing
TFLite interpreter and cached J; imports run_mtm_anneal lazily.
- coral/coral_client.py — adds .anneal() method, single round-trip
per anneal regardless of step count.
Wire layout for OP_ANNEAL:
client -> server: byte 0x03, u32 steps, f64 beta_min, f64 beta_max,
u32 top_k, u32 top_k_min (0xffffffff = None), u32 seed
server -> client: byte 0x01 ack, i64 e_best, f64 ms_total,
f64 ms_per_step, N int8 s_best
Host baseline collected on the same instance (random_spin_glass
N=1024 d=0.4 s=42), 4 seeds × 3200 steps: ms_per_step 324 µs,
e_best mean -14463, min -14857. Coral-side numbers pending a
coral_server restart to pick up the new opcode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-located Coral MTM (OP_ANNEAL) takes 43.3 ms/step vs 68.8 ms/step for per-step-over-SSH — exactly the improvement that FINDINGS entry 5's network-round-trip diagnosis predicted. Host NumPy on M3 Pro is still 133× faster (0.32 ms/step). The new bottleneck is the Coral ARM CPU running the Boltzmann selection + exact-energy bookkeeping at ~41 ms/step; the TPU matmul itself is ~2 ms. Quality at parity (Coral mean -14749 vs host -14463 on the same random_spin_glass N=1024 d=0.4 s=42 instance, 4 seeds × 3200 steps). The TPU's int8 quantization on ΔE acts like incidental top-k noise that broadens exploration; within seed-to-seed variance. Closes the V3 architectural-fix arc named in entry 5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Detached signature ops over pqcrypto.sign.{ml_dsa_44,ml_dsa_65,
ml_dsa_87,sphincs_sha2_128f_simple,sphincs_sha2_128s_simple}. Default
algo is ml-dsa-65 (FIPS 204, NIST L3 / Dilithium3 equivalent).
API shape mirrors lopt keygen: --sk-hex / --pk-hex consume the hex
output of `lopt keygen --format hex`. Message input: --message-hex,
--message-file, or stdin (binary, default fallback).
lopt keygen --algo ml-dsa-65 > key.txt
lopt sign --algo ml-dsa-65 --sk-hex $SK --message-file msg.bin > sig.hex
lopt verify --algo ml-dsa-65 --pk-hex $PK --signature-hex $(cat sig.hex) \
--message-file msg.bin # exits 0 on OK, 1 on FAIL
Correctness verified end-to-end at N=L3 (pk=1952B, sk=4032B,
sig=3309B). Verify returns OK on a correct (pk, message, sig) tuple
and FAIL on a tampered message; exit code matches.
PQClean's ML-DSA is hedged (fresh OS randomness mixed per signature),
so successive signs of the same (sk, message) yield DIFFERENT bytes;
both verify. SLH-DSA-simple variants are deterministic and reproduce
bit-identically. Docstrings explicit on this.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires
--backend {numpy,coral}intotools/truth_bench.py(host-side prep from prior PR #46, now closed), then runs the host-vs-Coral comparison at N=1024 Wishart and discovers that the V3 backend over SSH-tunneled TCP is ~209× slower per step than host NumPy on Apple Silicon.That retires the planned "TPU-accelerated Wishart truth-bench" experiment (named as the follow-up in FINDINGS entry 4) on measured evidence. New FINDINGS entry 5 documents the result and names the architectural fix (co-locate annealer on Coral).
Headline numbers (N=1024, ms per matmul-step)
TPU vs Coral ARM CPU: 2.6× faster at N=1024 (the V3 hardware claim holds).
Host NumPy vs TPU local: 6.5× faster (modern host beats the small TPU at this N).
Host NumPy vs operational V3 (TPU over network): 209× faster (network dominates).
What's in this PR
tools/truth_bench.py--backend wiring (carried from PR truth_bench: --backend {numpy,coral} for V3 host-vs-TPU comparison #46)docs/scaling/wishart_n1024_{host,coral_throughput}.jsonl— raw dataFINDINGS.mdentry 5What's NOT in this PR
ms_per_stepis load-bearing in this entry;gap_relfrom the coral path is unreliable because the baked-in J israndom_spin_glass, not WishartTest plan
🤖 Generated with Claude Code