Skip to content

coral_anneal: 13.8× per-step speedup via incremental energy tracking#50

Merged
wock9000 merged 1 commit into
trunkfrom
coral-side-perf
May 14, 2026
Merged

coral_anneal: 13.8× per-step speedup via incremental energy tracking#50
wock9000 merged 1 commit into
trunkfrom
coral-side-perf

Conversation

@wock9000
Copy link
Copy Markdown
Contributor

Summary

One-line algorithmic fix in `coral_anneal.py`: the previous `run_mtm_anneal` called `exact_energy(spins, J)` on every accepted Metropolis move (O(N²) — a full `-½sᵀJs` recompute), then took the delta. Replaced with incremental `e_cur += dE_true` since we already compute `dE_true` exactly in int64 arithmetic on each accept; the result is bit-equivalent without the recompute.

This was FINDINGS entry 6's "Coral-side per-step bookkeeping" bottleneck. The diagnosis (C/Cython rewrite of softmax) was wrong — it was an algorithmic bug in our own code, masquerading as Python/numpy overhead.

Result (same instance + seeds as FINDINGS 6: random_spin_glass N=1024, d=0.4, s=42; 4 seeds × 3200 steps)

variant ms/step e_best mean host-to-Coral ratio
Host (M3 Pro) NumPy 0.34 −14463
Coral over-network (PR #47) 68.78 (J-mismatch) 209×
Coral co-located v1 (FINDINGS 6) 43.33 −14749 133×
Coral co-located v2 (this PR) 3.14 −14749 9.4×

13.8× per-step speedup. Cumulative improvement from the original V3-over-network number: 22×. Quality unchanged.

What's NOT in this PR

  • FINDINGS revisions (the entry 5 over-claim about "retiring the experiment" and the new entry for this win) — separate PR so the perf number can be reviewed in isolation.
  • Wishart-J quality experiment — separate work; requires rebuilding the TPU model with a Wishart-J baked in.
  • Further Coral optimizations (TPU matmul pipelining, batched-replica matmul). The remaining headroom on this arc is ~3× more if pipelined, ~10× more if batched — but the next experiment to run is the quality one, not more optimization.

Test plan

  • e_best parity confirmed against FINDINGS 6's data (−14947 min, −14749 mean — exact match)
  • Quality unchanged (within seed variance)
  • Reproduces from `tools/coral_anneal_bench.py` end-to-end

🤖 Generated with Claude Code

The old code called exact_energy(spins, J) on every accepted move,
which is O(N²) per accept (~10–20ms at N=1024 on the Coral ARM).
Replaced with incremental e_cur += dE_true. Mathematically identical:
integer dE is exact for int8 J + int64 accumulator, no drift.

Also pre-promotes J and spins to int64 once outside the loop so each
per-step row dot product runs without per-call type coercion.

Measured on the same instance + seeds as FINDINGS 6 (random_spin_glass
N=1024 d=0.4 s=42, 4 seeds × 3200 steps):

  before (FINDINGS 6): 43.3 ms/step
  after (this commit):  3.14 ms/step
  speedup:              13.8×

Quality unchanged (e_best mean −14749 both runs). The previous
"C/Cython rewrite of ARM bookkeeping" diagnosis in FINDINGS 6 was
wrong — the bottleneck was an algorithmic O(N²) bug in our own code,
not Python/numpy overhead.

Host-vs-Coral gap closes from 133× to 9.4×. Coral is now fast enough
that running real experiments on it is economical, which un-blocks
the planned Wishart truth-bench (separate work).

Reproducible: tools/coral_anneal_bench.py, data in
docs/scaling/coral_colocated_anneal_v2.jsonl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wock9000 wock9000 merged commit 793e319 into trunk May 14, 2026
1 check passed
@wock9000 wock9000 deleted the coral-side-perf branch May 14, 2026 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant