Add total-system-energy reporting (CodeCarbon CPU backend) by gabrielnan · Pull Request #4 · cybertronai/wikitext

gabrielnan · 2026-05-20T16:58:21Z

Summary

Adds CPU-energy accounting to EnergyMeter so result.json reports the total system energy (GPU + CPU), not just the NVML-only GPU number.

Motivation: raised by @yaroslavvb2 in Telegram on 2026-05-19:

Just total system energy, subject to time + accuracy constraint.
Not counting CPU utilization is a bit of leak, didn't expect it to be significant.

NVML is GPU-only — it cannot see what the host CPU is doing. A submission that runs its work on the host CPU (e.g., long np.unique passes during n-gram table construction) registers as nearly zero on NVML even though the host CPU is genuinely burning energy. This PR closes that leak by adding a CodeCarbon-backed CPU estimate alongside the existing NVML reading and surfacing both as separate fields in result.json.

Approach

CodeCarbon as the CPU backend. TDP-fallback mode — no MSR / RAPL / /dev/cpu/*/msr needed, since Modal containers cannot access them. CodeCarbon identifies the host CPU from /proc/cpuinfo, looks up its TDP from a bundled CSV (~2000 SKUs), and integrates psutil.cpu_percent() over the run.
Field standard for cloud-ML energy reporting. HuggingFace Trainer auto-logs CodeCarbon when installed; Patterson et al. 2021/2022 used the same TDP-based estimate; ML.ENERGY (Michigan SymbioticLab, Zeus framework) reports GPU-only because the same container constraint applies; MLPerf Power requires physical wall meters and has no cloud submission path.
Floor protection. total_energy_J = max(gpu + cpu, duration_s × p_floor_watts), default p_floor_watts = 50 W, so CodeCarbon under-attribution can't shrink the reported total below a defensible lower bound (~ idle share of a single GPU-slot fair share of a dual EPYC 7763 A100 host).
Fail-loud on GPU-with-no-CPU-backend. If NVML is available but CodeCarbon fails to import, EnergyMeter() raises RuntimeError immediately. Silent half-measurement (GPU-only, CPU None) on a real leaderboard run would land inconsistent rows. Dev machines without NVML stay in soft "neither available" mode (no measurement, no crash) so local smoke tests on a laptop still work.

API changes (in this PR)

Measurement dataclass: two new optional fields — cpu_energy_J, total_energy_J (both float | None, default None).
EnergyMeter.__init__ accepts gpu_backend / cpu_backend / p_floor_watts kwargs for dependency injection (used by the new unit tests). Defaults wrap pynvml and CodeCarbon respectively. Raises RuntimeError if GPU backend is available but CPU backend is not — see "fail-loud" above.
EnergyMeter.measure() populates the two new fields on the yielded Measurement on exit.
run_eval.py writes cpu_energy_J + total_energy_J into result.json in all three exit paths (pass, DQ time, DQ acc).
submit.py adds one .pip_install("codecarbon") to the Modal image. submit.py:append_record now writes total_energy_J to README's Record History column when present, falling back to training_energy_J for pre-PR runs.
requirements.txt adds codecarbon as a local dep.
New repo-root doc MAINTAINING.md (also added in this PR) — see "Maintenance rule" below.
README.md adds a dated banner above the Record History noting that rows ≥ 2026-05-20 report total_energy_J; earlier rows are kept as historical NVML-only readings.

Backward compatibility

This change preserves every existing field's semantic. No existing field changes meaning, no existing test breaks, no existing pre-PR submission needs editing.

energy_joules keeps its prior semantic (GPU NVML net of idle baseline). Older result.json files are interpreted identically.
EnergyMeter.available still reflects NVML availability only. The pre-existing test_energy_meter_fallback_when_no_nvml and test_wall_clock_guard_captures_partial_measurement tests pass unmodified.
Measurement.__str__ still prints what it printed before (energy_joules + duration_s); the new CPU + total fields are additive on the dataclass, not in the human-readable summary.
The new floor (max(gpu + cpu, duration_s × p_floor_watts)) only applies to total_energy_J. energy_joules is unchanged regardless of CodeCarbon's behaviour.
requirements.txt adds codecarbon as a local dep; on a dev machine without it installed, EnergyMeter() only raises if there's also a real GPU present (i.e., you're trying to do a leaderboard-class measurement without the CPU backend). CPU-only dev boxes construct meters fine — they just don't measure anything (same as before).

In other words: any pre-PR caller using energy_joules reads identical numbers from old result.json files. Leaderboard runs on Modal get both fields populated or fail loudly. CPU-only dev tests don't accidentally start raising.

Tests (in this PR)

TDD'd one cycle at a time. 5 new tests, 8 existing tests preserved:

Test	What it covers
`test_energy_meter_total_is_gpu_plus_cpu`	tracer: with both backends present (mocked), `total_energy_J = gpu + cpu`
`test_total_energy_enforces_wall_clock_floor`	sanity: `total_energy_J >= duration_s × p_floor_watts` even when CPU backend under-attributes
`test_default_cpu_backend_uses_codecarbon_when_installed`	default `cpu_backend` populates `cpu_energy_J` end-to-end
`test_energy_meter_raises_when_gpu_available_but_cpu_missing`	fail-loud: if NVML works but CodeCarbon doesn't, `EnergyMeter()` raises `RuntimeError`
`test_energy_meter_dev_mode_no_raise_when_both_unavailable`	dev pattern: both backends unavailable → no raise, no measurement

python -m pytest test_wikitext.py → 13/13 pass (5 new + 8 existing, none modified).

Followed up with the official code-simplifier Claude plugin — 4 small clarity wins (dead None-checks removed, redundant except: self.available = False collapsed to pass, long ternaries wrapped). 13/13 still pass.

Modal validation

Smoke test by re-running an existing on-dev submission through the new harness inside the Modal A100 container. CodeCarbon installs cleanly (one pip_install line on the existing image). result.json populates with the two new fields. The exact populated numbers from the leaderboard re-runs appear below.

Leaderboard re-validation

This PR is itself a setup change (it adds CPU energy to the scored quantity), so the rule in MAINTAINING.md requires re-running every leaderboard row on dev against the new harness before promoting to main. After PR #5 merged to dev (adding 13 new submissions), the re-validation now covers both the original upstream rows and the new ones.

Done (PCIe, fresh on new schema)

Submission	Prior NVML J	gpu_J	cpu_J	total_J	acc	CPU / GPU
🥇 `subset_70_mkn`	858	1,351	1,124	2,474	0.7031	0.83×
🥈 `gpu_ngram_w31_k11`	1,245	1,612	1,480	3,092	0.7050	0.92×
🥉 `paq_mixer_v3`	1,744	2,355	2,252	4,607	0.7048	0.96×
`gpu_ngram_o14_xorfix`	3,172	3,981	4,621	8,602	0.7184	1.16×
`deep_backoff_kn`	2,236	963	12,338	14,578	0.7184	12.8× (CPU-heavy build)
`lwta_k4`	46,222	44,329	9,354	53,683	0.7246	0.21×
`lwta_k2`	46,132	44,583	10,031	54,614	0.7145	0.22×

Still in flight

Submission	Prior NVML J	Status
`chunker_phase1_v1`	5,918	re-running on new harness
`lwta_k4_alpha_065`	13,174	re-running
`alpha_06`	14,047	re-running
`modded_nanogpt`	51,704	Modal scheduler keeps landing it on SXM4 → 300 s DQ time; retry running on PCIe expected

Will commit the remaining four when they land.

Findings

subset_70_mkn is the clean J leader on the new metric at 2,474 J total / 0.7031 PCIe — 20 % under gpu_ngram_w31_k11 (3,092 J / 0.7050) at the same accuracy band. On the prior NVML-only metric those two were a noise-floor tie (~860 J vs ~1,250 J was within run-to-run variance); the CPU side resolves the tie cleanly because subset_70_mkn's 70 %-data trick also cuts the CPU work proportionally.
deep_backoff_kn reranks dramatically. Prior NVML-only: 2,236 J (cheaper than xorfix at 3,172 J). New total: 14,578 J (12.8× heavier on CPU than GPU because its n-gram tables are built single-threaded on the host). Now visible at its true cost on the leaderboard. xorfix overtakes it on the new metric (8,602 vs 14,578) at the same accuracy.
GPU-bound submissions have CPU ≈ GPU under CodeCarbon's TDP-fallback (~42 W × duration_s); CPU-bound submissions can have CPU ≫ GPU. The leak Yaroslav flagged is real and material.

Maintenance rule (`MAINTAINING.md`, added in this PR)

This PR introduces the first "setup change" since the upstream leaderboard was published. The leaderboard ranks submissions against each other, so half-old half-new comparisons aren't meaningful. To keep that comparison honest going forward, the rule is codified at the repo root in MAINTAINING.md:

When the competition setup changes in a way that can move where existing submissions land on the leaderboard, the upstream-leaderboard submissions must be re-run on the new setup before any new comparison is made.

The doc also includes the main ↔ dev branching cadence (feature PRs target dev; slow-cadence promotion PRs dev → main).

Open questions for review

Floor value: 50 W vs 100 W per GPU-slot. 50 W is conservative idle-share; 100 W matches per-slot fair share of a dual-EPYC-7763 A100 host.
Should total_energy_J be the new ranking metric, or report both side-by-side? This PR populates both; maintainer call on which becomes canonical.

Test plan

🤖 Generated with Claude Code

yaroslavvb · 2026-05-20T18:46:38Z

I approved it, but then realized that if it's in main, then at least one person should run it and make sure it works. Basically, main is the publicly facing, so maybe it's updated less frequently, kind of like PyTorch releases. Meanwhile "dev" Branch could be the fast-moving internal branch, where it's okay to have occasional breakage.

… hybrids Builds on top of #3's gradient-free survey with a Pareto sweep across (a) the chained-KN n-gram family at K=11/12/14, (b) a data-subset paradigm (locally stationary corpus → cheaper builds), (c) PAQ-style multi-order context mixing, (d) the Schmidhuber 1991 chunker hierarchical-surprise architecture, (e) NN+n-gram α-hybrids, and (f) a 5-run AdamW reopen that closes the optimizer cluster definitively. ## On the leaderboard (val char-acc ≥ 0.70, ranked by NVML energy) | Submission | Val acc | Energy (J) | Mechanism | |---|---:|---:|---| | `subset_70_mkn` | 0.7031 | 858 | Chained-KN @ K=11 on first-70%-of-train; Chen-Goodman per-count discounts (D1, D2, D3+) | | `gpu_ngram_w31_k11` | 0.7050 | 1,245 | Chained Kneser-Ney @ K=11 on full train (GPU torch.unique table build) | | `paq_mixer_v3` | 0.7047 | 1,744 | PAQ-style multi-order context mixing: 11 independent count tables + 860-param logistic mixer | | `deep_backoff_kn` | 0.7184 | 2,236 | Order-14 chained backoff + Kneser-Ney smoothing (CPU build via multiprocessing) | | `gpu_ngram_o14_xorfix` | 0.7184 | 3,172 | Order-14 GPU n-gram with XOR-bit sort fix (eliminates 150s CPU re-sort at k≥9) | | `chunker_phase1_v1` | 0.7057 | 5,918 | Schmidhuber 1991 chunker: lower-tier surprise gates a d=192/L=4 upper-tier transformer | | `lwta_k4_alpha_065` | 0.7382 | 13,174 | LWTA-k=4 sparse activation in d=256/L=4 NN + W31 n-gram at α=0.65 | | `alpha_06` | 0.7437 | 14,047 | NN + W31 n-gram hybrid at α=0.60 (highest acc clean) | ## DQ — informative paradigm probes (acc < 0.70 or time exceeded) | Submission | Val acc | Energy (J) | Why it fails | |---|---:|---:|---| | `gpu_ngram_w31_k10` | 0.6975 | 878 | K=11 is the floor saturation depth; K=10 misses by 0.25pp | | `adamw_lr3e3_wd0_long` | 0.7061 (PASS but iso-J dominated) | 41,071 | AdamW at proper LR + 3× more steps reaches floor, but at 2.8× Muon's energy → closes optimizer cluster definitively | | `chunker_phase1_v2` | 0.5621 | 13,936 | Surprise-gated routing is essential — removing it (fixed α=0.6) loses 14pp | | `bpe_internal_nn_v2` | 0.3973 | 24,417 | Per-byte argmax over BPE marginalization disagrees with token-level top-1; paradigm needs algorithmic redesign | | `mamba_byte` | NaN | 60,864 | Pure-PyTorch Mamba SSM without selective_scan_cuda kernel: NaN at step 300 | ## Headline findings 1. **Lowest validated NVML-J on the leaderboard:** `subset_70_mkn` at 858 J / 0.7031 — 60× under the modded_nanogpt baseline (51,704 J). 2. **K=11 is the floor saturation depth for chained-KN.** K=10 DQ at 0.6975 (-0.25pp below floor); K=11 lands at 0.7050. 3. **Modified Kneser-Ney per-count discounts re-open the "KN discount sweep doesn't help" finding.** Chen-Goodman's D1/D2/D3+ formula adds +0.0016pp at iso-K with no J increase. 4. **Locally-stationary corpus: first-70% data subset ≈ full data at this scale.** 30% J reduction at 0.33pp acc cost; random vs first chunks are indistinguishable. 5. **PAQ paradigm validates, but is structurally dominated by chained-KN at iso-K.** Independent per-order tables + mixer pays +29% J for +0pp acc vs chained backoff. 6. **Schmidhuber 1991 chunker passes on a modern byte-LM benchmark for the first time.** Lower-tier surprise (n-gram) gates a small transformer trained only at surprise positions. Pareto-dominated by chained-KN but paradigm-validated. 7. **Muon optimizer essential at this scale: confirmed by 5-run AdamW reopen.** At iso-architecture + iso-steps, AdamW is 2.8× Muon's energy to reach 0.70; in hybrid composition with W31, the AdamW NN contributes 0pp acc above the n-gram backstop. ## Related - #4 — total-system-energy reporting (CodeCarbon CPU backend); the J numbers above are NVML-only to remain comparable with #3's leaderboard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ion pin Polish pass per the /review feedback on PR cybertronai#4: - ``_CodeCarbonCpuBackend`` docstring updated to reflect the loud-fail behaviour from EnergyMeter.__init__ (was stale post-fail-loud commit). - Comment on the ``stop()`` signature asymmetry (GPU stop takes duration_s for idle subtraction, CPU stop doesn't because CodeCarbon timestamps internally). - Loud-fail error message includes the fix command (``pip install codecarbon``). - ``Measurement.__str__`` prints ``cpu_energy_J`` + ``total_energy_J`` when populated (was only printing ``energy_joules`` + duration). - Pin ``codecarbon~=3.2`` in both requirements.txt and the Modal image so the ``tracker._total_cpu_energy.kWh`` private-attr path stays stable across CodeCarbon updates. - Two new tests: - ``test_energy_meter_no_raise_when_cpu_present_but_gpu_missing`` — explicit coverage of the dev-with-codecarbon-no-GPU path (previously covered only indirectly). - ``test_total_energy_none_when_only_one_backend_yields_value`` — ensures total stays None if either backend's stop() returns None. 15/15 tests pass. Also committing the just-landed alpha_06 + modded_nanogpt re-runs on the new schema (last two leaderboard rows; modded finally passed on SXM4 inside the 300s cap on its third attempt). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@yaroslavvb2

Closes the gap where NVML-only measurement missed host CPU work (raised by @yaroslavvb2 in Telegram on 2026-05-19: "Just total system energy, subject to time + accuracy constraint"; "not counting CPU utilization is a bit of leak"). Approach --- CodeCarbon as the CPU backend, TDP-fallback mode (no MSR / RAPL / ``/dev/cpu/*/msr`` needed — Modal containers can't access them). Field standard for cloud-ML energy reporting: HuggingFace ``Trainer`` auto-logs CodeCarbon when installed; Patterson et al. 2021/2022 used the same TDP-based estimate; ML.ENERGY (Michigan SymbioticLab Zeus) reports GPU-only because the same container constraint applies. Behaviour --- | NVML | CodeCarbon | Behaviour | |------|------------|------------------------------------------| | ✓ | ✓ | both fields populate, total = sum (floor)| | ✓ | ✗ | EnergyMeter() raises RuntimeError | | ✗ | ✓ | soft; both energy fields None | | ✗ | ✗ | soft; both energy fields None | Loud-fail on real-GPU-with-broken-CPU prevents silent half-measurement from landing inconsistent rows on the leaderboard. Dev-box patterns (no GPU) stay soft so local smoke tests on a laptop still work. Code changes --- - ``Measurement`` dataclass gains ``cpu_energy_J`` and ``total_energy_J`` (both ``float | None``, default ``None``). ``__str__`` includes them when populated. - ``EnergyMeter`` refactored to take pluggable ``gpu_backend`` / ``cpu_backend`` / ``p_floor_watts`` kwargs (dependency injection for testability). Default backends wrap pynvml and CodeCarbon. Raises RuntimeError if NVML is available but the CPU backend isn't. - ``measure()`` populates the new fields on the yielded Measurement; ``total_energy_J = max(gpu + cpu, duration_s * p_floor_watts)`` — floor protects against CodeCarbon under-attribution. - ``run_eval.py`` writes the new fields to ``result.json`` in all three exit paths (pass, DQ time, DQ acc). - ``submit.py`` adds ``codecarbon~=3.2`` to the Modal image, and ``append_record`` writes ``total_energy_J`` to README's Record History column when present, falling back to ``training_energy_J`` for pre-PR runs. - ``requirements.txt`` adds ``codecarbon~=3.2`` as a local dep (minor pinned because EnergyMeter reads CodeCarbon's internal ``tracker._total_cpu_energy.kWh``). - ``README.md`` adds a dated banner above the Record History noting that rows ≥ 2026-05-20 report ``total_energy_J``; earlier rows are kept as historical NVML-only readings. - New ``MAINTAINING.md`` at the repo root documents (a) the setup-change re-run rule (when the harness changes in a way that shifts where existing submissions land, re-run the leaderboard rows before merging to main) and (b) the ``main`` ↔ ``dev`` branching cadence (feature PRs target ``dev``; slow-cadence promotion PRs ``dev`` → ``main``). - ``.gitignore`` adds ``submissions/*/.CLAIMED`` (internal slot-claim metadata used by cross-session coordination scripts, not for upstream). Backward compatibility --- No existing field changes meaning, no existing test breaks. - ``energy_joules`` keeps its prior semantic (GPU NVML net of idle baseline). Older ``result.json`` files are interpreted identically. - ``EnergyMeter.available`` still reflects NVML availability only. - The new floor only applies to ``total_energy_J``. - ``submit.py:append_record`` falls back to ``training_energy_J`` for result.json files without the new fields. Tests --- TDD'd with 7 new unit tests, 8 pre-existing tests preserved unmodified: - ``test_energy_meter_total_is_gpu_plus_cpu`` (tracer) - ``test_total_energy_enforces_wall_clock_floor`` (floor binds) - ``test_default_cpu_backend_uses_codecarbon_when_installed`` (live) - ``test_energy_meter_raises_when_gpu_available_but_cpu_missing`` - ``test_energy_meter_no_raise_when_cpu_present_but_gpu_missing`` - ``test_total_energy_none_when_only_one_backend_yields_value`` - ``test_energy_meter_dev_mode_no_raise_when_both_unavailable`` 15/15 pass. Followed up with the anthropics/claude-plugins-official ``code-simplifier`` agent for a clarity pass (dead None-checks removed, redundant ``except: self.available = False`` collapsed, long ternaries wrapped). Leaderboard re-validation (per MAINTAINING.md) --- This PR is itself a setup change, so every leaderboard row on dev is re-run on the new harness before merging to main. All 11 rows landed (PCIe unless noted): | Submission | gpu_J | cpu_J | total_J | acc | |-------------------------|-------:|-------:|--------:|-------:| | subset_70_mkn | 1,351 | 1,124 | 2,474 | 0.7031 | | gpu_ngram_w31_k11 | 1,612 | 1,480 | 3,092 | 0.7050 | | paq_mixer_v3 | 2,355 | 2,252 | 4,607 | 0.7048 | | gpu_ngram_o14_xorfix | 3,981 | 4,621 | 8,602 | 0.7184 | | chunker_phase1_v1 | 5,570 | 4,021 | 9,591 | 0.7063 | | deep_backoff_kn | 963 | 12,338 | 14,578 | 0.7184 | | lwta_k4_alpha_065 (SXM4)| 13,751 | 6,170 | 19,922 | 0.7328 | | alpha_06 (SXM4)| 14,614 | 6,129 | 20,743 | 0.7390 | | lwta_k4 | 44,329 | 9,354 | 53,683 | 0.7246 | | lwta_k2 | 44,583 | 10,031 | 54,614 | 0.7145 | | modded_nanogpt (SXM4)| 51,729 | 10,277 | 62,006 | 0.7337 | Headline: subset_70_mkn lands at 2,474 J total / 0.7031 PCIe — the new clean J leader, 20% under gpu_ngram_w31_k11 (3,092 J / 0.7050) at the same accuracy band. On the prior NVML-only metric those two were a noise-floor tie; the CPU side resolves the tie cleanly because subset_70_mkn's 70%-data trick also cuts the CPU work proportionally. CPU-bound submissions rerank dramatically. deep_backoff_kn (prior NVML: 2,236 J) now reports 14,578 J total — its CPU energy is 12.8× its GPU reading because its n-gram tables are built single-threaded on the host. Now visible at full cost on the leaderboard. Open questions for maintainer review --- - Floor value: 50 W (default) vs 100 W per GPU-slot fair share. 50 W is conservative; 100 W matches dual-EPYC-7763 + DRAM fair-share for an 8-GPU host. One-line change. - Should ``total_energy_J`` be the new canonical ranking metric, or report both side-by-side? Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ab-10

Verified running modded nanogpt submission works my side.

IMO it's ready to merge to main, after you have the results for pending evals. Please also update the leaderboard on README.md

@yaroslavvb is there anything else you'd like before we merge to main?

gabrielnan mentioned this pull request May 20, 2026

[draft] Chained-KN @ K=11, data-subset, PAQ, chunker — 13 new submissions #5

Merged

5 tasks

yaroslavvb approved these changes May 20, 2026

View reviewed changes

yaroslavvb self-requested a review May 20, 2026 18:44

gabrielnan changed the base branch from main to dev May 20, 2026 23:45

yaroslavvb approved these changes May 21, 2026

View reviewed changes

gabrielnan force-pushed the total-system-energy branch from a7e8ada to e8532e2 Compare May 21, 2026 05:02

gabrielnan force-pushed the total-system-energy branch from b3bf973 to 91e1eb8 Compare May 21, 2026 05:44

gabrielnan marked this pull request as ready for review May 21, 2026 05:46

gabrielnan changed the title ~~[draft] Add total-system-energy reporting (CodeCarbon CPU backend)~~ Add total-system-energy reporting (CodeCarbon CPU backend) May 21, 2026

ab-10 approved these changes May 22, 2026

View reviewed changes

Add CPU energy usage to total

e2a2544

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add total-system-energy reporting (CodeCarbon CPU backend)#4

Add total-system-energy reporting (CodeCarbon CPU backend)#4
gabrielnan wants to merge 2 commits into
cybertronai:devfrom
gabrielnan:total-system-energy

gabrielnan commented May 20, 2026 •

edited

Loading

Uh oh!

yaroslavvb commented May 20, 2026

Uh oh!

ab-10 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gabrielnan commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

API changes (in this PR)

Backward compatibility

Tests (in this PR)

Modal validation

Leaderboard re-validation

Done (PCIe, fresh on new schema)

Still in flight

Findings

Maintenance rule (MAINTAINING.md, added in this PR)

Open questions for review

Test plan

Uh oh!

yaroslavvb commented May 20, 2026

Uh oh!

ab-10 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gabrielnan commented May 20, 2026 •

edited

Loading

Maintenance rule (`MAINTAINING.md`, added in this PR)