Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,10 @@ env/
# Env / OS
.env
.DS_Store

# Local dev notes / scratch — explainers, plans, idea logs, experiment journals.
.scratch/

# Internal slot-claim metadata written by claim_slot.sh for cross-session
# coordination (session id + heartbeat). Not for upstream.
submissions/*/.CLAIMED
54 changes: 54 additions & 0 deletions MAINTAINING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Maintaining the leaderboard

Notes for whoever has push access to `cybertronai/wikitext`.

## Branching

- **`main`** — stable. Every row of `README.md`'s Record History was scored
under the same setup.
- **`dev`** — staging. Feature PRs (new submissions, new paradigms, harness
tweaks) target `dev` and merge as soon as review is green.
- **`dev` → `main`** promotion PRs happen on a slower cadence, only when
`dev` is internally consistent (see re-run rule below).

## The setup-change re-run rule

If a PR changes anything that can move where existing submissions land on
the leaderboard, the **prior leaderboard rows in `README.md` must be re-run
on the new setup before that PR merges to `main`**. Otherwise the
half-old/half-new comparison is meaningless.

| Change | Triggers re-run? |
|---|---|
| `EnergyMeter` semantics, idle-baseline default, scoring formula | **Yes** |
| Hardware pin (PCIe ↔ SXM4, A100 ↔ H100) | **Yes** |
| `MAX_TRAIN_SECONDS`, `ACC_MIN`, eval window | **Yes** |
| Container-image bump with numerical drift | **Maybe** — re-run if anything visibly drifts |
| New submission, doc/typo, `.scratch/`, internal refactor | No |
| Additive optional field on `result.json` (existing semantics intact) | No — but new field is `null` on old entries; mention in PR |

When in doubt, re-run. ~$0.50/submission on Modal A100 is cheaper than a
broken leaderboard.

## Process

1. Land the setup change on a branch (typically targeting `dev`); don't merge yet.
2. Re-run the rows currently in `README.md`'s Record History on the new
harness — `python submit.py submissions/<slot> --yes`, fire in parallel
(Modal cap: 10 concurrent).
3. When `result.json` files all reflect the new setup, append the re-run
rows to `README.md` (old rows stay as history) and add a dated banner
above the table noting the schema change.
4. Restate the leaderboard table in the promotion PR body, confirming all
rows shown are under the new setup. Then merge.

Don't: ship a half-new/half-old table; claim a new leader without re-running
the priors; silently overwrite old `result.json` files without a banner in
`README.md`.

## Reference: setup-change events

| Date | Change | PR | Re-ran upstream? |
|---|---|---|---|
| 2026-05-18 | Hardware pin: SXM4 → PCIe A100-80GB | (n/a) | partial — older SXM4 rows kept as history |
| 2026-05-19 | `EnergyMeter` gains `cpu_energy_J` + `total_energy_J` via CodeCarbon | #4 | yes — `lwta_k2`, `lwta_k4`, `modded_nanogpt` re-run |
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ python submit.py submissions/modded_nanogpt

## Record History

The `Energy (J)` column reports **`total_energy_J`** (GPU NVML net of idle baseline + CodeCarbon CPU estimate, floored at `duration_s × 50 W`) for rows dated **2026-05-20 and later**. Earlier rows report the prior NVML-only `training_energy_J`. The semantic change is the new total-system-energy rule per @yaroslavvb2's Telegram note; see `MAINTAINING.md` and the `EnergyMeter` source for details. Upstream-leaderboard rows from before the change have been re-run under the new harness — those re-runs appear below as the canonical entries for those submissions; the original rows are preserved for history.

| Date | Energy (J) | Val char-acc | GPU | Config | Submission | Contributor |
|------|-----------:|-------------:|-----|--------|------------|-------------|
| 2026-05-12 | 51,704 | 0.7374 | A100 80GB PCIe | modded_nanogpt | [dir](submissions/modded_nanogpt) | @KellerJordan |
Expand All @@ -33,6 +35,21 @@ python submit.py submissions/modded_nanogpt
| 2026-05-18 | 3,612 | DQ | A100 80GB PCIe | chunker_d1 | [dir](research/catalog/new_directions/chunker_d1) | @ab-10 |
| 2026-05-18 | 735 | DQ | A100 80GB PCIe | ppm_c | [dir](research/catalog/new_directions/ppm_c) | @ab-10 |
| 2026-05-17 | 70 | DQ | A100 80GB SXM4 | P2-A_random_projection | [dir](research/forward-forward-deep/runs/phase2/P2-A_random_projection) | @ab-10 |
| 2026-05-19 | 60,864 | DQ | A100 80GB PCIe | mamba_byte | [dir](submissions/mamba_byte) | @gabrielnan |
| 2026-05-20 | 1,752 | DQ | A100 80GB SXM4 | gpu_ngram_w31_k10 | [dir](submissions/gpu_ngram_w31_k10) | @gabrielnan |
| 2026-05-20 | 13,936 | DQ | A100 80GB SXM4 | chunker_phase1_v2 | [dir](submissions/chunker_phase1_v2) | @gabrielnan |
| 2026-05-20 | 24,417 | DQ | A100 80GB SXM4 | bpe_internal_nn_v2 | [dir](submissions/bpe_internal_nn_v2) | @gabrielnan |
| 2026-05-20 | 53,683 | 0.7246 | A100 80GB PCIe | lwta_k4 | [dir](submissions/lwta_k4) | @ab-10 (re-run on new harness; total_J = 44,329 gpu + 9,354 cpu) |
| 2026-05-20 | 54,614 | 0.7145 | A100 80GB PCIe | lwta_k2 | [dir](submissions/lwta_k2) | @ab-10 (re-run on new harness; total_J = 44,583 gpu + 10,031 cpu) |
| 2026-05-21 | 2,474 | 0.7031 | A100 80GB PCIe | subset_70_mkn | [dir](submissions/subset_70_mkn) | @gabrielnan |
| 2026-05-21 | 3,092 | 0.7050 | A100 80GB PCIe | gpu_ngram_w31_k11 | [dir](submissions/gpu_ngram_w31_k11) | @gabrielnan |
| 2026-05-21 | 4,607 | 0.7047 | A100 80GB PCIe | paq_mixer_v3 | [dir](submissions/paq_mixer_v3) | @gabrielnan |
| 2026-05-21 | 8,602 | 0.7184 | A100 80GB PCIe | gpu_ngram_o14_xorfix | [dir](submissions/gpu_ngram_o14_xorfix) | @gabrielnan |
| 2026-05-21 | 9,591 | 0.7063 | A100 80GB PCIe | chunker_phase1_v1 | [dir](submissions/chunker_phase1_v1) | @gabrielnan |
| 2026-05-21 | 14,578 | 0.7184 | A100 80GB PCIe | deep_backoff_kn | [dir](submissions/deep_backoff_kn) | @gabrielnan |
| 2026-05-21 | 19,922 | 0.7328 | A100 80GB SXM4 | lwta_k4_alpha_065 | [dir](submissions/lwta_k4_alpha_065) | @gabrielnan |
| 2026-05-21 | 20,743 | 0.7390 | A100 80GB SXM4 | alpha_06 | [dir](submissions/alpha_06) | @gabrielnan |
| 2026-05-21 | 62,006 | 0.7337 | A100 80GB SXM4 | modded_nanogpt | [dir](submissions/modded_nanogpt) | @ab-10 |


## Rules
Expand Down
6 changes: 6 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,9 @@ modal>=0.66
# Optional: tests run with stdlib if pytest is missing, but `pytest
# test_wikitext.py` gives nicer output.
pytest
# CodeCarbon: CPU energy estimation backend for EnergyMeter's
# total_energy_J field. EnergyMeter reads ``tracker._total_cpu_energy``
# after stop, which is internal to CodeCarbon — pin a minor range to
# keep that path stable. Required on the leaderboard (raises if NVML is
# available and this isn't); optional on dev boxes without a GPU.
codecarbon~=3.2
26 changes: 20 additions & 6 deletions run_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ def main() -> None:
if m is not None:
print(f"training duration : {m.duration_s:.1f}s")
if m.energy_joules is not None:
print(f"training energy (J): {m.energy_joules:,.1f} (at kill)")
print(f"training energy (J): {_fmt_training_energy(m)} (at kill)")
if args.results_json is not None:
payload = {
"submission": submission_name,
Expand All @@ -128,6 +128,8 @@ def main() -> None:
"max_train_seconds": args.max_train_seconds,
"training_energy_J": m.energy_joules if m is not None else None,
"training_duration_s": m.duration_s if m is not None else None,
"cpu_energy_J": m.cpu_energy_J if m is not None else None,
"total_energy_J": m.total_energy_J if m is not None else None,
"gpu_name": _gpu_name(),
"date_utc": _utc_now(),
}
Expand All @@ -153,7 +155,7 @@ def main() -> None:
f"below floor {args.acc_min:.4f}")
print(f"submission : {submission_name}")
if m.energy_joules is not None:
print(f"training energy (J): {m.energy_joules:,.1f}")
print(f"training energy (J): {_fmt_training_energy(m)}")
print(f"training duration : {m.duration_s:.1f}s")
if args.results_json is not None:
payload = {
Expand All @@ -165,6 +167,8 @@ def main() -> None:
"val_chars": val_result.n_chars,
"training_energy_J": m.energy_joules,
"training_duration_s": m.duration_s,
"cpu_energy_J": m.cpu_energy_J,
"total_energy_J": m.total_energy_J,
"gpu_name": _gpu_name(),
"date_utc": _utc_now(),
}
Expand All @@ -174,10 +178,7 @@ def main() -> None:

print("---")
print(f"submission : {submission_name}")
if m.energy_joules is not None:
print(f"training energy (J): {m.energy_joules:,.1f}")
else:
print("training energy (J): NOT MEASURED")
print(f"training energy (J): {_fmt_training_energy(m)}")
print(f"training duration : {m.duration_s:.1f}s")
print(f"val char-accuracy : {val_result.accuracy:.4f}")
print(f"val chars : {val_result.n_chars:,}")
Expand All @@ -187,6 +188,8 @@ def main() -> None:
"submission": submission_name,
"training_energy_J": m.energy_joules,
"training_duration_s": m.duration_s,
"cpu_energy_J": m.cpu_energy_J,
"total_energy_J": m.total_energy_J,
"val_char_accuracy": val_result.accuracy,
"val_chars": val_result.n_chars,
"gpu_name": _gpu_name(),
Expand All @@ -211,5 +214,16 @@ def _utc_now() -> str:
.replace(microsecond=0).isoformat().replace("+00:00", "Z"))


def _fmt_training_energy(m) -> str:
if (m.total_energy_J is not None
and m.energy_joules is not None
and m.cpu_energy_J is not None):
return (f"{m.total_energy_J:,.1f} "
f"({m.energy_joules:,.1f} GPU + {m.cpu_energy_J:,.1f} CPU)")
if m.energy_joules is not None:
return f"{m.energy_joules:,.1f}"
return "NOT MEASURED"


if __name__ == "__main__":
main()
Loading