coral_anneal: 13.8× per-step speedup via incremental energy tracking#50
Merged
Conversation
The old code called exact_energy(spins, J) on every accepted move, which is O(N²) per accept (~10–20ms at N=1024 on the Coral ARM). Replaced with incremental e_cur += dE_true. Mathematically identical: integer dE is exact for int8 J + int64 accumulator, no drift. Also pre-promotes J and spins to int64 once outside the loop so each per-step row dot product runs without per-call type coercion. Measured on the same instance + seeds as FINDINGS 6 (random_spin_glass N=1024 d=0.4 s=42, 4 seeds × 3200 steps): before (FINDINGS 6): 43.3 ms/step after (this commit): 3.14 ms/step speedup: 13.8× Quality unchanged (e_best mean −14749 both runs). The previous "C/Cython rewrite of ARM bookkeeping" diagnosis in FINDINGS 6 was wrong — the bottleneck was an algorithmic O(N²) bug in our own code, not Python/numpy overhead. Host-vs-Coral gap closes from 133× to 9.4×. Coral is now fast enough that running real experiments on it is economical, which un-blocks the planned Wishart truth-bench (separate work). Reproducible: tools/coral_anneal_bench.py, data in docs/scaling/coral_colocated_anneal_v2.jsonl. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
One-line algorithmic fix in `coral_anneal.py`: the previous `run_mtm_anneal` called `exact_energy(spins, J)` on every accepted Metropolis move (O(N²) — a full `-½sᵀJs` recompute), then took the delta. Replaced with incremental `e_cur += dE_true` since we already compute `dE_true` exactly in int64 arithmetic on each accept; the result is bit-equivalent without the recompute.
This was FINDINGS entry 6's "Coral-side per-step bookkeeping" bottleneck. The diagnosis (C/Cython rewrite of softmax) was wrong — it was an algorithmic bug in our own code, masquerading as Python/numpy overhead.
Result (same instance + seeds as FINDINGS 6: random_spin_glass N=1024, d=0.4, s=42; 4 seeds × 3200 steps)
13.8× per-step speedup. Cumulative improvement from the original V3-over-network number: 22×. Quality unchanged.
What's NOT in this PR
Test plan
🤖 Generated with Claude Code