Skip to content

tomat runs index — overview + live runs table #5

@ryan-williams

Description

@ryan-williams

Living overview of the project's training runs, results, infra status, and
pointers to deep-dives. Re-edited as numbers land. The repo-side authoritative
snapshot is OVERVIEW.md; this issue is the nav / at-a-glance page.

Maintenance. Runs table regenerated via ./tomat runs links
(slack-paste-ready markdown). The static markers
<!-- BEGIN_RUNS_TABLE --> / <!-- END_RUNS_TABLE --> below let future
re-runs splice in updated tables without disturbing the rest of this issue.

Current best (200M, v3, M=64)

cont7k-ext: val NMAE 1.73 % @ step 21000 / NEMD 1.76 % @ step 20000.

  • Down from 2.02 / 2.14 last week.
  • Constant-LR continuation past Chinchilla still pays off (~5–6× past
    Chinchilla-optimal data budget); trajectory non-monotone but trending lower.
  • Currently re-resumed (5th survived preempt) and targeting step ≈ 38 k
    (~2 epochs through unique data). Last 10 k steps showed −14 % rel on
    val NMAE.

Live runs table

run state step runtime MFU val NMAE val NEMD wandb iris
train-full-v3-200M-bs128-emd-do-500-v5p16-shuf1k-smoke running - 0.0h - - - wb iris
train-full-v3-1B-bs512-emd-do-13k-tpu32-shuf1k crashed 0 0.1h - - - wb iris
train-full-v3-1B-bs256-emd-do-26k-tpu32-shuf1k crashed 4 4.4h - - - wb iris
train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont7k-ext crashed 26098 11.7h 8.7% 1.73% @ 22000 1.80% @ 22000 wb iris
train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont6kwsd finished 8002 2.5h 8.6% 2.88% @ 7999 3.63% @ 7999 wb iris
train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont8kwsd finished 10004 2.4h 8.7% 2.69% @ 9999 3.19% @ 9999 wb iris
train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k-cont7k finished 8000 0.8h 8.6% 2.27% @ 7999 - wb iris
train-full-v3-200M-bs128-emd-do-8k-tpu16-shuf1k finished 8009 7.1h 8.7% 2.22% @ 7000 - wb iris
train-full-v3-200M-bs128-emd-do-8k-tpu16 finished 7999 5.9h 8.7% - - wb iris

./tomat runs links produces this table with one row per wandb run and direct
links to the wandb project + iris dashboard.

Infra status (week of 2026-05-11)

The Marin cluster has been rough for large-slice training:

  • v6e-16: healthy; cont7k-ext keeps making progress.
  • v6e-32: from-scratch 1B runs crashloop in the Zephyr cache-build
    worker pool; even with shared cache, 1B-BS=256 silently stalled
    post-JAX-init (no log output for >1 h, no ckpts).
  • v5p-16 / v5p-32: both hang after JAX coordination init
    (/tensorflow.CoordinationService/RegisterTask deadline-exceeded).
    Same signature across model sizes — not workload-dependent.
  • v4: avoid this week (chip-config errors reported by Tim).

Mitigations landed on the tomat side:

  • tomat train --share-cache: single cache_dir per data
    label, so re-launches don't pay the cache-build cost again. Already
    seeded for train-full-v3 (47 GiB, copied in-region from shuf1k).
  • tomat train -r (resume) skips cache-build entirely.

See #4 for the MFU side of training-perf work.

Sub-trackers

  • #1 v3 patch tokenizer — closed; v3 is the current default data.
  • #2 MPDB n_atoms/n_electrons + R2 publish — open; data shipped
    to R2 but the elvis-side /mp page (the visible consumer)
    isn't deployed yet. Keeping MPDB: backfill n_atoms / n_electrons, publish to R2 #2 open until that lands.
  • #3 Sampling-weight study — open; per-mat vs per-electron NMAE
    comparison not yet run.
  • #4 MFU on tomat training — open; ongoing bottleneck-profile
    • experiments tracker. See the v5p comment for this week's
      attempts (v5p hangs at JAX init across slice sizes, so v5p vs v6e MFU
      is still pending healthy cluster).

Tooling

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions