You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Living overview of the project's training runs, results, infra status, and
pointers to deep-dives. Re-edited as numbers land. The repo-side authoritative
snapshot is OVERVIEW.md; this issue is the nav / at-a-glance page.
Maintenance. Runs table regenerated via ./tomat runs links
(slack-paste-ready markdown). The static markers <!-- BEGIN_RUNS_TABLE --> / <!-- END_RUNS_TABLE --> below let future
re-runs splice in updated tables without disturbing the rest of this issue.
./tomat runs links produces this table with one row per wandb run and direct
links to the wandb project + iris dashboard.
Infra status (week of 2026-05-11)
The Marin cluster has been rough for large-slice training:
v6e-16: healthy; cont7k-ext keeps making progress.
v6e-32: from-scratch 1B runs crashloop in the Zephyr cache-build
worker pool; even with shared cache, 1B-BS=256 silently stalled
post-JAX-init (no log output for >1 h, no ckpts).
v5p-16 / v5p-32: both hang after JAX coordination init
(/tensorflow.CoordinationService/RegisterTask deadline-exceeded).
Same signature across model sizes — not workload-dependent.
v4: avoid this week (chip-config errors reported by Tim).
Mitigations landed on the tomat side:
tomat train --share-cache: single cache_dir per data
label, so re-launches don't pay the cache-build cost again. Already
seeded for train-full-v3 (47 GiB, copied in-region from shuf1k).
#3 Sampling-weight study — open; per-mat vs per-electron NMAE
comparison not yet run.
#4 MFU on tomat training — open; ongoing bottleneck-profile
experiments tracker. See the v5p comment for this week's
attempts (v5p hangs at JAX init across slice sizes, so v5p vs v6e MFU
is still pending healthy cluster).
Living overview of the project's training runs, results, infra status, and
pointers to deep-dives. Re-edited as numbers land. The repo-side authoritative
snapshot is
OVERVIEW.md; this issue is the nav / at-a-glance page.Current best (200M, v3, M=64)
cont7k-ext: val NMAE 1.73 % @ step 21000 / NEMD 1.76 % @ step 20000.Chinchilla-optimal data budget); trajectory non-monotone but trending lower.
(~2 epochs through unique data). Last 10 k steps showed −14 % rel on
val NMAE.
Live runs table
train-full-v3-200M-bs128-emd-do-500-v5p16-shuf1k-smoketrain-full-v3-1B-bs512-emd-do-13k-tpu32-shuf1ktrain-full-v3-1B-bs256-emd-do-26k-tpu32-shuf1ktrain-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont7k-exttrain-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont6kwsdtrain-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont8kwsdtrain-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont7ktrain-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1ktrain-full-v3-200M-bs128-emd-do-8k-tpu16./tomat runs linksproduces this table with one row per wandb run and directlinks to the wandb project + iris dashboard.
Infra status (week of 2026-05-11)
The Marin cluster has been rough for large-slice training:
cont7k-extkeeps making progress.worker pool; even with shared cache, 1B-BS=256 silently stalled
post-JAX-init (no log output for >1 h, no ckpts).
(
/tensorflow.CoordinationService/RegisterTaskdeadline-exceeded).Same signature across model sizes — not workload-dependent.
Mitigations landed on the tomat side:
tomat train --share-cache: single cache_dir per datalabel, so re-launches don't pay the cache-build cost again. Already
seeded for
train-full-v3(47 GiB, copied in-region fromshuf1k).tomat train -r(resume) skips cache-build entirely.See #4 for the MFU side of training-perf work.
Sub-trackers
to R2 but the elvis-side
/mppage (the visible consumer)isn't deployed yet. Keeping MPDB: backfill
n_atoms/n_electrons, publish to R2 #2 open until that lands.comparison not yet run.
attempts (v5p hangs at JAX init across slice sizes, so v5p vs v6e MFU
is still pending healthy cluster).
Tooling
./tomat runs links— regenerate slack-paste-ready runs table../tomat runs nmae <substr>— full NMAE+NEMD curve for one run../tomat iris ls -f <substr>— iris-side job state../tomat train [-r] [--share-cache] LABEL …— launch / resume.scripts/backfill_eval_to_wandb.py— push per-mat eval JSONs fromGCS to wandb after re-evals.
scripts/plot_nmae_nemd_trajectory.py— regenerate trajectory plots.OVERVIEW.md— full repo-side snapshot.