tomat runs index — overview + live runs table

Living overview of the project's training runs, results, infra status, and
pointers to deep-dives. Re-edited as numbers land. The repo-side authoritative
snapshot is [`OVERVIEW.md`][overview]; this issue is the nav / at-a-glance page.

> **Maintenance.** Runs table regenerated via [`./tomat runs links`][tomat-cli]
> (slack-paste-ready markdown). The static markers
> `` / `` below let future
> re-runs splice in updated tables without disturbing the rest of this issue.

## Current best (200M, v3, M=64)

`cont7k-ext`: val **NMAE 1.73 % @ step 21000** / **NEMD 1.76 % @ step 20000**.

- Down from 2.02 / 2.14 last week.
- Constant-LR continuation past Chinchilla still pays off (~5–6× past
  Chinchilla-optimal data budget); trajectory non-monotone but trending lower.
- Currently re-resumed (5th survived preempt) and targeting step ≈ 38 k
  (~2 epochs through unique data). Last 10 k steps showed −14 % rel on
  val NMAE.

## Live runs table


| run | state | step | runtime | MFU | val NMAE | val NEMD | wandb | iris |
|---|---|---|---|---|---|---|---|---|
| `train-full-v3-200M-bs128-emd-do-500-v5p16-shuf1k-smoke` | running | - | 0.0h | - | - | - | [wb](https://wandb.ai/PrinceOA/tomat-lmq-P19/runs/train-full-v3-200M-bs128-emd-do-500-v5p16-shuf1k-smoke) | [iris](https://iris.oa.dev/#/job/%2Fryan%2Ftrain-full-v3-200M-bs128-emd-do-500-v5p16-shuf1k-smoke) |
| `train-full-v3-1B-bs512-emd-do-13k-tpu32-shuf1k` | crashed | 0 | 0.1h | - | - | - | [wb](https://wandb.ai/PrinceOA/tomat-lmq-P19/runs/train-full-v3-1B-bs512-emd-do-13k-tpu32-shuf1k) | [iris](https://iris.oa.dev/#/job/%2Fryan%2Ftrain-full-v3-1B-bs512-emd-do-13k-tpu32-shuf1k) |
| `train-full-v3-1B-bs256-emd-do-26k-tpu32-shuf1k` | crashed | 4 | 4.4h | - | - | - | [wb](https://wandb.ai/PrinceOA/tomat-lmq-P19/runs/train-full-v3-1B-bs256-emd-do-26k-tpu32-shuf1k) | [iris](https://iris.oa.dev/#/job/%2Fryan%2Ftrain-full-v3-1B-bs256-emd-do-26k-tpu32-shuf1k) |
| `train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont7k-ext` | crashed | 26098 | 11.7h | 8.7% | 1.73% @ 22000 | 1.80% @ 22000 | [wb](https://wandb.ai/PrinceOA/tomat-lmq-P19/runs/train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont7k-ext) | [iris](https://iris.oa.dev/#/job/%2Fryan%2Ftrain-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont7k-ext) |
| `train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont6kwsd` | finished | 8002 | 2.5h | 8.6% | 2.88% @ 7999 | 3.63% @ 7999 | [wb](https://wandb.ai/PrinceOA/tomat-lmq-P19/runs/train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont6kwsd) | [iris](https://iris.oa.dev/#/job/%2Fryan%2Ftrain-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont6kwsd) |
| `train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont8kwsd` | finished | 10004 | 2.4h | 8.7% | 2.69% @ 9999 | 3.19% @ 9999 | [wb](https://wandb.ai/PrinceOA/tomat-lmq-P19/runs/train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont8kwsd) | [iris](https://iris.oa.dev/#/job/%2Fryan%2Ftrain-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont8kwsd) |
| `train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont7k` | finished | 8000 | 0.8h | 8.6% | 2.27% @ 7999 | - | [wb](https://wandb.ai/PrinceOA/tomat-lmq-P19/runs/train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont7k) | [iris](https://iris.oa.dev/#/job/%2Fryan%2Ftrain-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont7k) |
| `train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k` | finished | 8009 | 7.1h | 8.7% | 2.22% @ 7000 | - | [wb](https://wandb.ai/PrinceOA/tomat-lmq-P19/runs/train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k) | [iris](https://iris.oa.dev/#/job/%2Fryan%2Ftrain-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k) |
| `train-full-v3-200M-bs128-emd-do-8k-tpu16` | finished | 7999 | 5.9h | 8.7% | - | - | [wb](https://wandb.ai/PrinceOA/tomat-lmq-P19/runs/train-full-v3-200M-bs128-emd-do-8k-tpu16) | [iris](https://iris.oa.dev/#/job/%2Fryan%2Ftrain-full-v3-200M-bs128-emd-do-8k-tpu16) |


`./tomat runs links` produces this table with one row per wandb run and direct
links to the [wandb project][wandb-project] + [iris dashboard][iris-dashboard].

## Infra status (week of 2026-05-11)

The Marin cluster has been rough for large-slice training:

- **v6e-16**: healthy; `cont7k-ext` keeps making progress.
- **v6e-32**: from-scratch 1B runs crashloop in the Zephyr cache-build
  worker pool; even with shared cache, 1B-BS=256 silently stalled
  post-JAX-init (no log output for >1 h, no ckpts).
- **v5p-16 / v5p-32**: both hang after JAX coordination init
  (`/tensorflow.CoordinationService/RegisterTask` deadline-exceeded).
  Same signature across model sizes — not workload-dependent.
- **v4**: avoid this week (chip-config errors reported by Tim).

Mitigations landed on the tomat side:
- [`tomat train --share-cache`][share-cache]: single cache_dir per data
  label, so re-launches don't pay the cache-build cost again. Already
  seeded for `train-full-v3` (47 GiB, copied in-region from `shuf1k`).
- `tomat train -r` (resume) skips cache-build entirely.

See [#4][mfu] for the MFU side of training-perf work.

## Sub-trackers

- [#1][i1] v3 patch tokenizer — **closed**; v3 is the current default data.
- [#2][i2] MPDB n_atoms/n_electrons + R2 publish — **open**; data shipped
  to R2 but the elvis-side [`/mp` page][elvis-mp] (the visible consumer)
  isn't deployed yet. Keeping #2 open until that lands.
- [#3][i3] Sampling-weight study — open; per-mat vs per-electron NMAE
  comparison not yet run.
- [#4][mfu] MFU on tomat training — open; ongoing bottleneck-profile
  + experiments tracker. See the [v5p comment][i4-v5p] for this week's
  attempts (v5p hangs at JAX init across slice sizes, so v5p vs v6e MFU
  is still pending healthy cluster).

## Tooling

- [`./tomat runs links`][tomat-cli] — regenerate slack-paste-ready runs table.
- `./tomat runs nmae <substr>` — full NMAE+NEMD curve for one run.
- `./tomat iris ls -f <substr>` — iris-side job state.
- `./tomat train [-r] [--share-cache] LABEL …` — launch / resume.
- [`scripts/backfill_eval_to_wandb.py`][backfill] — push per-mat eval JSONs from
  GCS to wandb after re-evals.
- [`scripts/plot_nmae_nemd_trajectory.py`][plot-traj] — regenerate trajectory plots.
- [`OVERVIEW.md`][overview] — full repo-side snapshot.

[overview]: https://github.com/Open-Athena/tomat/blob/main/OVERVIEW.md
[tomat-cli]: https://github.com/Open-Athena/tomat/blob/main/tomat
[share-cache]: https://github.com/Open-Athena/tomat/blob/main/marin/train_tomat_tpu.py
[backfill]: https://github.com/Open-Athena/tomat/blob/main/scripts/backfill_eval_to_wandb.py
[plot-traj]: https://github.com/Open-Athena/tomat/blob/main/scripts/plot_nmae_nemd_trajectory.py
[wandb-project]: https://wandb.ai/PrinceOA/tomat-lmq-P19
[iris-dashboard]: https://iris.oa.dev/
[i1]: https://github.com/Open-Athena/tomat/issues/1
[i2]: https://github.com/Open-Athena/tomat/issues/2
[i3]: https://github.com/Open-Athena/tomat/issues/3
[mfu]: https://github.com/Open-Athena/tomat/issues/4
[i4-v5p]: https://github.com/Open-Athena/tomat/issues/4#issuecomment-4423248328
[elvis-mp]: https://github.com/Open-Athena/elvis/blob/main/specs/mp-page.md

[Open-Athena/tomat#5]: https://github.com/Open-Athena/tomat/issues/5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tomat runs index — overview + live runs table #5

Current best (200M, v3, M=64)

Live runs table

Infra status (week of 2026-05-11)

Sub-trackers

Tooling

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

run	state	step	runtime	MFU	val NMAE	val NEMD	wandb	iris
`train-full-v3-200M-bs128-emd-do-500-v5p16-shuf1k-smoke`	running	-	0.0h	-	-	-	wb	iris
`train-full-v3-1B-bs512-emd-do-13k-tpu32-shuf1k`	crashed	0	0.1h	-	-	-	wb	iris
`train-full-v3-1B-bs256-emd-do-26k-tpu32-shuf1k`	crashed	4	4.4h	-	-	-	wb	iris
`train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont7k-ext`	crashed	26098	11.7h	8.7%	1.73% @ 22000	1.80% @ 22000	wb	iris
`train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont6kwsd`	finished	8002	2.5h	8.6%	2.88% @ 7999	3.63% @ 7999	wb	iris
`train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont8kwsd`	finished	10004	2.4h	8.7%	2.69% @ 9999	3.19% @ 9999	wb	iris
`train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont7k`	finished	8000	0.8h	8.6%	2.27% @ 7999	-	wb	iris
`train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k`	finished	8009	7.1h	8.7%	2.22% @ 7000	-	wb	iris
`train-full-v3-200M-bs128-emd-do-8k-tpu16`	finished	7999	5.9h	8.7%	-	-	wb	iris

tomat runs index — overview + live runs table #5

Description

Current best (200M, v3, M=64)

Live runs table

Infra status (week of 2026-05-11)

Sub-trackers

Tooling

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions