Skip to content

Add total-system-energy reporting (CodeCarbon CPU backend)#4

Open
gabrielnan wants to merge 2 commits into
cybertronai:devfrom
gabrielnan:total-system-energy
Open

Add total-system-energy reporting (CodeCarbon CPU backend)#4
gabrielnan wants to merge 2 commits into
cybertronai:devfrom
gabrielnan:total-system-energy

Conversation

@gabrielnan
Copy link
Copy Markdown
Collaborator

@gabrielnan gabrielnan commented May 20, 2026

Summary

Adds CPU-energy accounting to EnergyMeter so result.json reports the total system energy (GPU + CPU), not just the NVML-only GPU number.

Motivation: raised by @yaroslavvb2 in Telegram on 2026-05-19:

Just total system energy, subject to time + accuracy constraint.
Not counting CPU utilization is a bit of leak, didn't expect it to be significant.

NVML is GPU-only — it cannot see what the host CPU is doing. A submission that runs its work on the host CPU (e.g., long np.unique passes during n-gram table construction) registers as nearly zero on NVML even though the host CPU is genuinely burning energy. This PR closes that leak by adding a CodeCarbon-backed CPU estimate alongside the existing NVML reading and surfacing both as separate fields in result.json.

Approach

  • CodeCarbon as the CPU backend. TDP-fallback mode — no MSR / RAPL / /dev/cpu/*/msr needed, since Modal containers cannot access them. CodeCarbon identifies the host CPU from /proc/cpuinfo, looks up its TDP from a bundled CSV (~2000 SKUs), and integrates psutil.cpu_percent() over the run.
  • Field standard for cloud-ML energy reporting. HuggingFace Trainer auto-logs CodeCarbon when installed; Patterson et al. 2021/2022 used the same TDP-based estimate; ML.ENERGY (Michigan SymbioticLab, Zeus framework) reports GPU-only because the same container constraint applies; MLPerf Power requires physical wall meters and has no cloud submission path.
  • Floor protection. total_energy_J = max(gpu + cpu, duration_s × p_floor_watts), default p_floor_watts = 50 W, so CodeCarbon under-attribution can't shrink the reported total below a defensible lower bound (~ idle share of a single GPU-slot fair share of a dual EPYC 7763 A100 host).
  • Fail-loud on GPU-with-no-CPU-backend. If NVML is available but CodeCarbon fails to import, EnergyMeter() raises RuntimeError immediately. Silent half-measurement (GPU-only, CPU None) on a real leaderboard run would land inconsistent rows. Dev machines without NVML stay in soft "neither available" mode (no measurement, no crash) so local smoke tests on a laptop still work.

API changes (in this PR)

  • Measurement dataclass: two new optional fields — cpu_energy_J, total_energy_J (both float | None, default None).
  • EnergyMeter.__init__ accepts gpu_backend / cpu_backend / p_floor_watts kwargs for dependency injection (used by the new unit tests). Defaults wrap pynvml and CodeCarbon respectively. Raises RuntimeError if GPU backend is available but CPU backend is not — see "fail-loud" above.
  • EnergyMeter.measure() populates the two new fields on the yielded Measurement on exit.
  • run_eval.py writes cpu_energy_J + total_energy_J into result.json in all three exit paths (pass, DQ time, DQ acc).
  • submit.py adds one .pip_install("codecarbon") to the Modal image. submit.py:append_record now writes total_energy_J to README's Record History column when present, falling back to training_energy_J for pre-PR runs.
  • requirements.txt adds codecarbon as a local dep.
  • New repo-root doc MAINTAINING.md (also added in this PR) — see "Maintenance rule" below.
  • README.md adds a dated banner above the Record History noting that rows ≥ 2026-05-20 report total_energy_J; earlier rows are kept as historical NVML-only readings.

Backward compatibility

This change preserves every existing field's semantic. No existing field changes meaning, no existing test breaks, no existing pre-PR submission needs editing.

  • energy_joules keeps its prior semantic (GPU NVML net of idle baseline). Older result.json files are interpreted identically.
  • EnergyMeter.available still reflects NVML availability only. The pre-existing test_energy_meter_fallback_when_no_nvml and test_wall_clock_guard_captures_partial_measurement tests pass unmodified.
  • Measurement.__str__ still prints what it printed before (energy_joules + duration_s); the new CPU + total fields are additive on the dataclass, not in the human-readable summary.
  • The new floor (max(gpu + cpu, duration_s × p_floor_watts)) only applies to total_energy_J. energy_joules is unchanged regardless of CodeCarbon's behaviour.
  • requirements.txt adds codecarbon as a local dep; on a dev machine without it installed, EnergyMeter() only raises if there's also a real GPU present (i.e., you're trying to do a leaderboard-class measurement without the CPU backend). CPU-only dev boxes construct meters fine — they just don't measure anything (same as before).

In other words: any pre-PR caller using energy_joules reads identical numbers from old result.json files. Leaderboard runs on Modal get both fields populated or fail loudly. CPU-only dev tests don't accidentally start raising.

Tests (in this PR)

TDD'd one cycle at a time. 5 new tests, 8 existing tests preserved:

Test What it covers
test_energy_meter_total_is_gpu_plus_cpu tracer: with both backends present (mocked), total_energy_J = gpu + cpu
test_total_energy_enforces_wall_clock_floor sanity: total_energy_J >= duration_s × p_floor_watts even when CPU backend under-attributes
test_default_cpu_backend_uses_codecarbon_when_installed default cpu_backend populates cpu_energy_J end-to-end
test_energy_meter_raises_when_gpu_available_but_cpu_missing fail-loud: if NVML works but CodeCarbon doesn't, EnergyMeter() raises RuntimeError
test_energy_meter_dev_mode_no_raise_when_both_unavailable dev pattern: both backends unavailable → no raise, no measurement

python -m pytest test_wikitext.py13/13 pass (5 new + 8 existing, none modified).

Followed up with the official code-simplifier Claude plugin — 4 small clarity wins (dead None-checks removed, redundant except: self.available = False collapsed to pass, long ternaries wrapped). 13/13 still pass.

Modal validation

Smoke test by re-running an existing on-dev submission through the new harness inside the Modal A100 container. CodeCarbon installs cleanly (one pip_install line on the existing image). result.json populates with the two new fields. The exact populated numbers from the leaderboard re-runs appear below.

Leaderboard re-validation

This PR is itself a setup change (it adds CPU energy to the scored quantity), so the rule in MAINTAINING.md requires re-running every leaderboard row on dev against the new harness before promoting to main. After PR #5 merged to dev (adding 13 new submissions), the re-validation now covers both the original upstream rows and the new ones.

Done (PCIe, fresh on new schema)

Submission Prior NVML J gpu_J cpu_J total_J acc CPU / GPU
🥇 subset_70_mkn 858 1,351 1,124 2,474 0.7031 0.83×
🥈 gpu_ngram_w31_k11 1,245 1,612 1,480 3,092 0.7050 0.92×
🥉 paq_mixer_v3 1,744 2,355 2,252 4,607 0.7048 0.96×
gpu_ngram_o14_xorfix 3,172 3,981 4,621 8,602 0.7184 1.16×
deep_backoff_kn 2,236 963 12,338 14,578 0.7184 12.8× (CPU-heavy build)
lwta_k4 46,222 44,329 9,354 53,683 0.7246 0.21×
lwta_k2 46,132 44,583 10,031 54,614 0.7145 0.22×

Still in flight

Submission Prior NVML J Status
chunker_phase1_v1 5,918 re-running on new harness
lwta_k4_alpha_065 13,174 re-running
alpha_06 14,047 re-running
modded_nanogpt 51,704 Modal scheduler keeps landing it on SXM4 → 300 s DQ time; retry running on PCIe expected

Will commit the remaining four when they land.

Findings

  • subset_70_mkn is the clean J leader on the new metric at 2,474 J total / 0.7031 PCIe — 20 % under gpu_ngram_w31_k11 (3,092 J / 0.7050) at the same accuracy band. On the prior NVML-only metric those two were a noise-floor tie (~860 J vs ~1,250 J was within run-to-run variance); the CPU side resolves the tie cleanly because subset_70_mkn's 70 %-data trick also cuts the CPU work proportionally.
  • deep_backoff_kn reranks dramatically. Prior NVML-only: 2,236 J (cheaper than xorfix at 3,172 J). New total: 14,578 J (12.8× heavier on CPU than GPU because its n-gram tables are built single-threaded on the host). Now visible at its true cost on the leaderboard. xorfix overtakes it on the new metric (8,602 vs 14,578) at the same accuracy.
  • GPU-bound submissions have CPU ≈ GPU under CodeCarbon's TDP-fallback (~42 W × duration_s); CPU-bound submissions can have CPU ≫ GPU. The leak Yaroslav flagged is real and material.

Maintenance rule (MAINTAINING.md, added in this PR)

This PR introduces the first "setup change" since the upstream leaderboard was published. The leaderboard ranks submissions against each other, so half-old half-new comparisons aren't meaningful. To keep that comparison honest going forward, the rule is codified at the repo root in MAINTAINING.md:

When the competition setup changes in a way that can move where existing submissions land on the leaderboard, the upstream-leaderboard submissions must be re-run on the new setup before any new comparison is made.

The doc also includes the maindev branching cadence (feature PRs target dev; slow-cadence promotion PRs devmain).

Open questions for review

  • Floor value: 50 W vs 100 W per GPU-slot. 50 W is conservative idle-share; 100 W matches per-slot fair share of a dual-EPYC-7763 A100 host.
  • Should total_energy_J be the new ranking metric, or report both side-by-side? This PR populates both; maintainer call on which becomes canonical.

Test plan

  • Local unit tests pass (pytest test_wikitext.py → 13/13)
  • Modal end-to-end smoke (CodeCarbon installs in the existing image; result.json populates the new fields)
  • Backward-compat: existing tests pass without modification; energy_joules semantics unchanged
  • Fail-loud verified: real-GPU-without-CodeCarbon raises, dev-mode soft path doesn't crash
  • code-simplifier plugin pass for clarity (13/13 still green)
  • MAINTAINING.md documents the setup-change re-run rule + main/dev cadence
  • 7 leaderboard rows re-run + committed (subset_70_mkn, gpu_ngram_w31_k11, paq_mixer_v3, gpu_ngram_o14_xorfix, deep_backoff_kn, lwta_k2, lwta_k4)
  • 4 leaderboard rows still in flight (chunker_phase1_v1, lwta_k4_alpha_065, alpha_06, modded_nanogpt PCIe retry)
  • Maintainer review on the floor value (50 W vs 100 W)
  • Maintainer review on which field becomes the canonical ranking metric

🤖 Generated with Claude Code

@yaroslavvb yaroslavvb self-requested a review May 20, 2026 18:44
@yaroslavvb
Copy link
Copy Markdown
Contributor

I approved it, but then realized that if it's in main, then at least one person should run it and make sure it works. Basically, main is the publicly facing, so maybe it's updated less frequently, kind of like PyTorch releases. Meanwhile "dev" Branch could be the fast-moving internal branch, where it's okay to have occasional breakage.

@gabrielnan gabrielnan changed the base branch from main to dev May 20, 2026 23:45
yaroslavvb pushed a commit that referenced this pull request May 21, 2026
… hybrids

Builds on top of #3's gradient-free survey with a Pareto sweep
across (a) the chained-KN n-gram family at K=11/12/14, (b) a
data-subset paradigm (locally stationary corpus → cheaper builds),
(c) PAQ-style multi-order context mixing, (d) the Schmidhuber 1991
chunker hierarchical-surprise architecture, (e) NN+n-gram α-hybrids,
and (f) a 5-run AdamW reopen that closes the optimizer cluster
definitively.

## On the leaderboard (val char-acc ≥ 0.70, ranked by NVML energy)

| Submission | Val acc | Energy (J) | Mechanism |
|---|---:|---:|---|
| `subset_70_mkn` | 0.7031 |    858 | Chained-KN @ K=11 on first-70%-of-train; Chen-Goodman per-count discounts (D1, D2, D3+) |
| `gpu_ngram_w31_k11` | 0.7050 |  1,245 | Chained Kneser-Ney @ K=11 on full train (GPU torch.unique table build) |
| `paq_mixer_v3` | 0.7047 |  1,744 | PAQ-style multi-order context mixing: 11 independent count tables + 860-param logistic mixer |
| `deep_backoff_kn` | 0.7184 |  2,236 | Order-14 chained backoff + Kneser-Ney smoothing (CPU build via multiprocessing) |
| `gpu_ngram_o14_xorfix` | 0.7184 |  3,172 | Order-14 GPU n-gram with XOR-bit sort fix (eliminates 150s CPU re-sort at k≥9) |
| `chunker_phase1_v1` | 0.7057 |  5,918 | Schmidhuber 1991 chunker: lower-tier surprise gates a d=192/L=4 upper-tier transformer |
| `lwta_k4_alpha_065` | 0.7382 | 13,174 | LWTA-k=4 sparse activation in d=256/L=4 NN + W31 n-gram at α=0.65 |
| `alpha_06` | 0.7437 | 14,047 | NN + W31 n-gram hybrid at α=0.60 (highest acc clean) |

## DQ — informative paradigm probes (acc < 0.70 or time exceeded)

| Submission | Val acc | Energy (J) | Why it fails |
|---|---:|---:|---|
| `gpu_ngram_w31_k10` | 0.6975 |    878 | K=11 is the floor saturation depth; K=10 misses by 0.25pp |
| `adamw_lr3e3_wd0_long` | 0.7061 (PASS but iso-J dominated) | 41,071 | AdamW at proper LR + 3× more steps reaches floor, but at 2.8× Muon's energy → closes optimizer cluster definitively |
| `chunker_phase1_v2` | 0.5621 | 13,936 | Surprise-gated routing is essential — removing it (fixed α=0.6) loses 14pp |
| `bpe_internal_nn_v2` | 0.3973 | 24,417 | Per-byte argmax over BPE marginalization disagrees with token-level top-1; paradigm needs algorithmic redesign |
| `mamba_byte` |     NaN | 60,864 | Pure-PyTorch Mamba SSM without selective_scan_cuda kernel: NaN at step 300 |

## Headline findings

1. **Lowest validated NVML-J on the leaderboard:** `subset_70_mkn` at 858 J / 0.7031 — 60× under the modded_nanogpt baseline (51,704 J).
2. **K=11 is the floor saturation depth for chained-KN.** K=10 DQ at 0.6975 (-0.25pp below floor); K=11 lands at 0.7050.
3. **Modified Kneser-Ney per-count discounts re-open the "KN discount sweep doesn't help" finding.** Chen-Goodman's D1/D2/D3+ formula adds +0.0016pp at iso-K with no J increase.
4. **Locally-stationary corpus: first-70% data subset ≈ full data at this scale.** 30% J reduction at 0.33pp acc cost; random vs first chunks are indistinguishable.
5. **PAQ paradigm validates, but is structurally dominated by chained-KN at iso-K.** Independent per-order tables + mixer pays +29% J for +0pp acc vs chained backoff.
6. **Schmidhuber 1991 chunker passes on a modern byte-LM benchmark for the first time.** Lower-tier surprise (n-gram) gates a small transformer trained only at surprise positions. Pareto-dominated by chained-KN but paradigm-validated.
7. **Muon optimizer essential at this scale: confirmed by 5-run AdamW reopen.** At iso-architecture + iso-steps, AdamW is 2.8× Muon's energy to reach 0.70; in hybrid composition with W31, the AdamW NN contributes 0pp acc above the n-gram backstop.

## Related

- #4 — total-system-energy reporting (CodeCarbon CPU backend); the J numbers above are NVML-only to remain comparable with #3's leaderboard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gabrielnan gabrielnan force-pushed the total-system-energy branch from a7e8ada to e8532e2 Compare May 21, 2026 05:02
gabrielnan pushed a commit to gabrielnan/wikitext that referenced this pull request May 21, 2026
…ion pin

Polish pass per the /review feedback on PR cybertronai#4:

- ``_CodeCarbonCpuBackend`` docstring updated to reflect the loud-fail
  behaviour from EnergyMeter.__init__ (was stale post-fail-loud commit).
- Comment on the ``stop()`` signature asymmetry (GPU stop takes
  duration_s for idle subtraction, CPU stop doesn't because CodeCarbon
  timestamps internally).
- Loud-fail error message includes the fix command (``pip install
  codecarbon``).
- ``Measurement.__str__`` prints ``cpu_energy_J`` + ``total_energy_J``
  when populated (was only printing ``energy_joules`` + duration).
- Pin ``codecarbon~=3.2`` in both requirements.txt and the Modal image
  so the ``tracker._total_cpu_energy.kWh`` private-attr path stays
  stable across CodeCarbon updates.
- Two new tests:
  - ``test_energy_meter_no_raise_when_cpu_present_but_gpu_missing`` —
    explicit coverage of the dev-with-codecarbon-no-GPU path (previously
    covered only indirectly).
  - ``test_total_energy_none_when_only_one_backend_yields_value`` —
    ensures total stays None if either backend's stop() returns None.

15/15 tests pass. Also committing the just-landed alpha_06 + modded_nanogpt
re-runs on the new schema (last two leaderboard rows; modded finally
passed on SXM4 inside the 300s cap on its third attempt).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the gap where NVML-only measurement missed host CPU work
(raised by @yaroslavvb2 in Telegram on 2026-05-19: "Just total
system energy, subject to time + accuracy constraint"; "not
counting CPU utilization is a bit of leak").

Approach
---
CodeCarbon as the CPU backend, TDP-fallback mode (no MSR / RAPL /
``/dev/cpu/*/msr`` needed — Modal containers can't access them).
Field standard for cloud-ML energy reporting: HuggingFace ``Trainer``
auto-logs CodeCarbon when installed; Patterson et al. 2021/2022 used
the same TDP-based estimate; ML.ENERGY (Michigan SymbioticLab Zeus)
reports GPU-only because the same container constraint applies.

Behaviour
---
| NVML | CodeCarbon | Behaviour                                |
|------|------------|------------------------------------------|
| ✓    | ✓          | both fields populate, total = sum (floor)|
| ✓    | ✗          | EnergyMeter() raises RuntimeError        |
| ✗    | ✓          | soft; both energy fields None            |
| ✗    | ✗          | soft; both energy fields None            |

Loud-fail on real-GPU-with-broken-CPU prevents silent half-measurement
from landing inconsistent rows on the leaderboard. Dev-box patterns
(no GPU) stay soft so local smoke tests on a laptop still work.

Code changes
---
- ``Measurement`` dataclass gains ``cpu_energy_J`` and ``total_energy_J``
  (both ``float | None``, default ``None``). ``__str__`` includes them
  when populated.
- ``EnergyMeter`` refactored to take pluggable ``gpu_backend`` /
  ``cpu_backend`` / ``p_floor_watts`` kwargs (dependency injection for
  testability). Default backends wrap pynvml and CodeCarbon. Raises
  RuntimeError if NVML is available but the CPU backend isn't.
- ``measure()`` populates the new fields on the yielded Measurement;
  ``total_energy_J = max(gpu + cpu, duration_s * p_floor_watts)`` —
  floor protects against CodeCarbon under-attribution.
- ``run_eval.py`` writes the new fields to ``result.json`` in all
  three exit paths (pass, DQ time, DQ acc).
- ``submit.py`` adds ``codecarbon~=3.2`` to the Modal image, and
  ``append_record`` writes ``total_energy_J`` to README's Record History
  column when present, falling back to ``training_energy_J`` for
  pre-PR runs.
- ``requirements.txt`` adds ``codecarbon~=3.2`` as a local dep (minor
  pinned because EnergyMeter reads CodeCarbon's internal
  ``tracker._total_cpu_energy.kWh``).
- ``README.md`` adds a dated banner above the Record History noting
  that rows ≥ 2026-05-20 report ``total_energy_J``; earlier rows are
  kept as historical NVML-only readings.
- New ``MAINTAINING.md`` at the repo root documents (a) the
  setup-change re-run rule (when the harness changes in a way that
  shifts where existing submissions land, re-run the leaderboard rows
  before merging to main) and (b) the ``main`` ↔ ``dev`` branching
  cadence (feature PRs target ``dev``; slow-cadence promotion PRs
  ``dev`` → ``main``).
- ``.gitignore`` adds ``submissions/*/.CLAIMED`` (internal slot-claim
  metadata used by cross-session coordination scripts, not for upstream).

Backward compatibility
---
No existing field changes meaning, no existing test breaks.
- ``energy_joules`` keeps its prior semantic (GPU NVML net of idle
  baseline). Older ``result.json`` files are interpreted identically.
- ``EnergyMeter.available`` still reflects NVML availability only.
- The new floor only applies to ``total_energy_J``.
- ``submit.py:append_record`` falls back to ``training_energy_J`` for
  result.json files without the new fields.

Tests
---
TDD'd with 7 new unit tests, 8 pre-existing tests preserved unmodified:
- ``test_energy_meter_total_is_gpu_plus_cpu`` (tracer)
- ``test_total_energy_enforces_wall_clock_floor`` (floor binds)
- ``test_default_cpu_backend_uses_codecarbon_when_installed`` (live)
- ``test_energy_meter_raises_when_gpu_available_but_cpu_missing``
- ``test_energy_meter_no_raise_when_cpu_present_but_gpu_missing``
- ``test_total_energy_none_when_only_one_backend_yields_value``
- ``test_energy_meter_dev_mode_no_raise_when_both_unavailable``

15/15 pass.

Followed up with the anthropics/claude-plugins-official ``code-simplifier``
agent for a clarity pass (dead None-checks removed, redundant
``except: self.available = False`` collapsed, long ternaries wrapped).

Leaderboard re-validation (per MAINTAINING.md)
---
This PR is itself a setup change, so every leaderboard row on dev is
re-run on the new harness before merging to main. All 11 rows landed
(PCIe unless noted):

| Submission              | gpu_J  | cpu_J  | total_J | acc    |
|-------------------------|-------:|-------:|--------:|-------:|
| subset_70_mkn           |  1,351 |  1,124 |   2,474 | 0.7031 |
| gpu_ngram_w31_k11       |  1,612 |  1,480 |   3,092 | 0.7050 |
| paq_mixer_v3            |  2,355 |  2,252 |   4,607 | 0.7048 |
| gpu_ngram_o14_xorfix    |  3,981 |  4,621 |   8,602 | 0.7184 |
| chunker_phase1_v1       |  5,570 |  4,021 |   9,591 | 0.7063 |
| deep_backoff_kn         |    963 | 12,338 |  14,578 | 0.7184 |
| lwta_k4_alpha_065 (SXM4)| 13,751 |  6,170 |  19,922 | 0.7328 |
| alpha_06          (SXM4)| 14,614 |  6,129 |  20,743 | 0.7390 |
| lwta_k4                 | 44,329 |  9,354 |  53,683 | 0.7246 |
| lwta_k2                 | 44,583 | 10,031 |  54,614 | 0.7145 |
| modded_nanogpt    (SXM4)| 51,729 | 10,277 |  62,006 | 0.7337 |

Headline: subset_70_mkn lands at 2,474 J total / 0.7031 PCIe — the
new clean J leader, 20% under gpu_ngram_w31_k11 (3,092 J / 0.7050) at
the same accuracy band. On the prior NVML-only metric those two were
a noise-floor tie; the CPU side resolves the tie cleanly because
subset_70_mkn's 70%-data trick also cuts the CPU work proportionally.

CPU-bound submissions rerank dramatically. deep_backoff_kn (prior
NVML: 2,236 J) now reports 14,578 J total — its CPU energy is 12.8×
its GPU reading because its n-gram tables are built single-threaded
on the host. Now visible at full cost on the leaderboard.

Open questions for maintainer review
---
- Floor value: 50 W (default) vs 100 W per GPU-slot fair share. 50 W
  is conservative; 100 W matches dual-EPYC-7763 + DRAM fair-share for
  an 8-GPU host. One-line change.
- Should ``total_energy_J`` be the new canonical ranking metric, or
  report both side-by-side?

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gabrielnan gabrielnan force-pushed the total-system-energy branch from b3bf973 to 91e1eb8 Compare May 21, 2026 05:44
@gabrielnan gabrielnan marked this pull request as ready for review May 21, 2026 05:46
@gabrielnan gabrielnan changed the title [draft] Add total-system-energy reporting (CodeCarbon CPU backend) Add total-system-energy reporting (CodeCarbon CPU backend) May 21, 2026
Copy link
Copy Markdown
Contributor

@ab-10 ab-10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified running modded nanogpt submission works my side.

IMO it's ready to merge to main, after you have the results for pending evals. Please also update the leaderboard on README.md

@yaroslavvb is there anything else you'd like before we merge to main?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants