Add total-system-energy reporting (CodeCarbon CPU backend)#4
Open
gabrielnan wants to merge 2 commits into
Open
Add total-system-energy reporting (CodeCarbon CPU backend)#4gabrielnan wants to merge 2 commits into
gabrielnan wants to merge 2 commits into
Conversation
5 tasks
yaroslavvb
approved these changes
May 20, 2026
Contributor
|
I approved it, but then realized that if it's in main, then at least one person should run it and make sure it works. Basically, main is the publicly facing, so maybe it's updated less frequently, kind of like PyTorch releases. Meanwhile "dev" Branch could be the fast-moving internal branch, where it's okay to have occasional breakage. |
yaroslavvb
pushed a commit
that referenced
this pull request
May 21, 2026
… hybrids Builds on top of #3's gradient-free survey with a Pareto sweep across (a) the chained-KN n-gram family at K=11/12/14, (b) a data-subset paradigm (locally stationary corpus → cheaper builds), (c) PAQ-style multi-order context mixing, (d) the Schmidhuber 1991 chunker hierarchical-surprise architecture, (e) NN+n-gram α-hybrids, and (f) a 5-run AdamW reopen that closes the optimizer cluster definitively. ## On the leaderboard (val char-acc ≥ 0.70, ranked by NVML energy) | Submission | Val acc | Energy (J) | Mechanism | |---|---:|---:|---| | `subset_70_mkn` | 0.7031 | 858 | Chained-KN @ K=11 on first-70%-of-train; Chen-Goodman per-count discounts (D1, D2, D3+) | | `gpu_ngram_w31_k11` | 0.7050 | 1,245 | Chained Kneser-Ney @ K=11 on full train (GPU torch.unique table build) | | `paq_mixer_v3` | 0.7047 | 1,744 | PAQ-style multi-order context mixing: 11 independent count tables + 860-param logistic mixer | | `deep_backoff_kn` | 0.7184 | 2,236 | Order-14 chained backoff + Kneser-Ney smoothing (CPU build via multiprocessing) | | `gpu_ngram_o14_xorfix` | 0.7184 | 3,172 | Order-14 GPU n-gram with XOR-bit sort fix (eliminates 150s CPU re-sort at k≥9) | | `chunker_phase1_v1` | 0.7057 | 5,918 | Schmidhuber 1991 chunker: lower-tier surprise gates a d=192/L=4 upper-tier transformer | | `lwta_k4_alpha_065` | 0.7382 | 13,174 | LWTA-k=4 sparse activation in d=256/L=4 NN + W31 n-gram at α=0.65 | | `alpha_06` | 0.7437 | 14,047 | NN + W31 n-gram hybrid at α=0.60 (highest acc clean) | ## DQ — informative paradigm probes (acc < 0.70 or time exceeded) | Submission | Val acc | Energy (J) | Why it fails | |---|---:|---:|---| | `gpu_ngram_w31_k10` | 0.6975 | 878 | K=11 is the floor saturation depth; K=10 misses by 0.25pp | | `adamw_lr3e3_wd0_long` | 0.7061 (PASS but iso-J dominated) | 41,071 | AdamW at proper LR + 3× more steps reaches floor, but at 2.8× Muon's energy → closes optimizer cluster definitively | | `chunker_phase1_v2` | 0.5621 | 13,936 | Surprise-gated routing is essential — removing it (fixed α=0.6) loses 14pp | | `bpe_internal_nn_v2` | 0.3973 | 24,417 | Per-byte argmax over BPE marginalization disagrees with token-level top-1; paradigm needs algorithmic redesign | | `mamba_byte` | NaN | 60,864 | Pure-PyTorch Mamba SSM without selective_scan_cuda kernel: NaN at step 300 | ## Headline findings 1. **Lowest validated NVML-J on the leaderboard:** `subset_70_mkn` at 858 J / 0.7031 — 60× under the modded_nanogpt baseline (51,704 J). 2. **K=11 is the floor saturation depth for chained-KN.** K=10 DQ at 0.6975 (-0.25pp below floor); K=11 lands at 0.7050. 3. **Modified Kneser-Ney per-count discounts re-open the "KN discount sweep doesn't help" finding.** Chen-Goodman's D1/D2/D3+ formula adds +0.0016pp at iso-K with no J increase. 4. **Locally-stationary corpus: first-70% data subset ≈ full data at this scale.** 30% J reduction at 0.33pp acc cost; random vs first chunks are indistinguishable. 5. **PAQ paradigm validates, but is structurally dominated by chained-KN at iso-K.** Independent per-order tables + mixer pays +29% J for +0pp acc vs chained backoff. 6. **Schmidhuber 1991 chunker passes on a modern byte-LM benchmark for the first time.** Lower-tier surprise (n-gram) gates a small transformer trained only at surprise positions. Pareto-dominated by chained-KN but paradigm-validated. 7. **Muon optimizer essential at this scale: confirmed by 5-run AdamW reopen.** At iso-architecture + iso-steps, AdamW is 2.8× Muon's energy to reach 0.70; in hybrid composition with W31, the AdamW NN contributes 0pp acc above the n-gram backstop. ## Related - #4 — total-system-energy reporting (CodeCarbon CPU backend); the J numbers above are NVML-only to remain comparable with #3's leaderboard. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yaroslavvb
approved these changes
May 21, 2026
a7e8ada to
e8532e2
Compare
gabrielnan
pushed a commit
to gabrielnan/wikitext
that referenced
this pull request
May 21, 2026
…ion pin Polish pass per the /review feedback on PR cybertronai#4: - ``_CodeCarbonCpuBackend`` docstring updated to reflect the loud-fail behaviour from EnergyMeter.__init__ (was stale post-fail-loud commit). - Comment on the ``stop()`` signature asymmetry (GPU stop takes duration_s for idle subtraction, CPU stop doesn't because CodeCarbon timestamps internally). - Loud-fail error message includes the fix command (``pip install codecarbon``). - ``Measurement.__str__`` prints ``cpu_energy_J`` + ``total_energy_J`` when populated (was only printing ``energy_joules`` + duration). - Pin ``codecarbon~=3.2`` in both requirements.txt and the Modal image so the ``tracker._total_cpu_energy.kWh`` private-attr path stays stable across CodeCarbon updates. - Two new tests: - ``test_energy_meter_no_raise_when_cpu_present_but_gpu_missing`` — explicit coverage of the dev-with-codecarbon-no-GPU path (previously covered only indirectly). - ``test_total_energy_none_when_only_one_backend_yields_value`` — ensures total stays None if either backend's stop() returns None. 15/15 tests pass. Also committing the just-landed alpha_06 + modded_nanogpt re-runs on the new schema (last two leaderboard rows; modded finally passed on SXM4 inside the 300s cap on its third attempt). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the gap where NVML-only measurement missed host CPU work (raised by @yaroslavvb2 in Telegram on 2026-05-19: "Just total system energy, subject to time + accuracy constraint"; "not counting CPU utilization is a bit of leak"). Approach --- CodeCarbon as the CPU backend, TDP-fallback mode (no MSR / RAPL / ``/dev/cpu/*/msr`` needed — Modal containers can't access them). Field standard for cloud-ML energy reporting: HuggingFace ``Trainer`` auto-logs CodeCarbon when installed; Patterson et al. 2021/2022 used the same TDP-based estimate; ML.ENERGY (Michigan SymbioticLab Zeus) reports GPU-only because the same container constraint applies. Behaviour --- | NVML | CodeCarbon | Behaviour | |------|------------|------------------------------------------| | ✓ | ✓ | both fields populate, total = sum (floor)| | ✓ | ✗ | EnergyMeter() raises RuntimeError | | ✗ | ✓ | soft; both energy fields None | | ✗ | ✗ | soft; both energy fields None | Loud-fail on real-GPU-with-broken-CPU prevents silent half-measurement from landing inconsistent rows on the leaderboard. Dev-box patterns (no GPU) stay soft so local smoke tests on a laptop still work. Code changes --- - ``Measurement`` dataclass gains ``cpu_energy_J`` and ``total_energy_J`` (both ``float | None``, default ``None``). ``__str__`` includes them when populated. - ``EnergyMeter`` refactored to take pluggable ``gpu_backend`` / ``cpu_backend`` / ``p_floor_watts`` kwargs (dependency injection for testability). Default backends wrap pynvml and CodeCarbon. Raises RuntimeError if NVML is available but the CPU backend isn't. - ``measure()`` populates the new fields on the yielded Measurement; ``total_energy_J = max(gpu + cpu, duration_s * p_floor_watts)`` — floor protects against CodeCarbon under-attribution. - ``run_eval.py`` writes the new fields to ``result.json`` in all three exit paths (pass, DQ time, DQ acc). - ``submit.py`` adds ``codecarbon~=3.2`` to the Modal image, and ``append_record`` writes ``total_energy_J`` to README's Record History column when present, falling back to ``training_energy_J`` for pre-PR runs. - ``requirements.txt`` adds ``codecarbon~=3.2`` as a local dep (minor pinned because EnergyMeter reads CodeCarbon's internal ``tracker._total_cpu_energy.kWh``). - ``README.md`` adds a dated banner above the Record History noting that rows ≥ 2026-05-20 report ``total_energy_J``; earlier rows are kept as historical NVML-only readings. - New ``MAINTAINING.md`` at the repo root documents (a) the setup-change re-run rule (when the harness changes in a way that shifts where existing submissions land, re-run the leaderboard rows before merging to main) and (b) the ``main`` ↔ ``dev`` branching cadence (feature PRs target ``dev``; slow-cadence promotion PRs ``dev`` → ``main``). - ``.gitignore`` adds ``submissions/*/.CLAIMED`` (internal slot-claim metadata used by cross-session coordination scripts, not for upstream). Backward compatibility --- No existing field changes meaning, no existing test breaks. - ``energy_joules`` keeps its prior semantic (GPU NVML net of idle baseline). Older ``result.json`` files are interpreted identically. - ``EnergyMeter.available`` still reflects NVML availability only. - The new floor only applies to ``total_energy_J``. - ``submit.py:append_record`` falls back to ``training_energy_J`` for result.json files without the new fields. Tests --- TDD'd with 7 new unit tests, 8 pre-existing tests preserved unmodified: - ``test_energy_meter_total_is_gpu_plus_cpu`` (tracer) - ``test_total_energy_enforces_wall_clock_floor`` (floor binds) - ``test_default_cpu_backend_uses_codecarbon_when_installed`` (live) - ``test_energy_meter_raises_when_gpu_available_but_cpu_missing`` - ``test_energy_meter_no_raise_when_cpu_present_but_gpu_missing`` - ``test_total_energy_none_when_only_one_backend_yields_value`` - ``test_energy_meter_dev_mode_no_raise_when_both_unavailable`` 15/15 pass. Followed up with the anthropics/claude-plugins-official ``code-simplifier`` agent for a clarity pass (dead None-checks removed, redundant ``except: self.available = False`` collapsed, long ternaries wrapped). Leaderboard re-validation (per MAINTAINING.md) --- This PR is itself a setup change, so every leaderboard row on dev is re-run on the new harness before merging to main. All 11 rows landed (PCIe unless noted): | Submission | gpu_J | cpu_J | total_J | acc | |-------------------------|-------:|-------:|--------:|-------:| | subset_70_mkn | 1,351 | 1,124 | 2,474 | 0.7031 | | gpu_ngram_w31_k11 | 1,612 | 1,480 | 3,092 | 0.7050 | | paq_mixer_v3 | 2,355 | 2,252 | 4,607 | 0.7048 | | gpu_ngram_o14_xorfix | 3,981 | 4,621 | 8,602 | 0.7184 | | chunker_phase1_v1 | 5,570 | 4,021 | 9,591 | 0.7063 | | deep_backoff_kn | 963 | 12,338 | 14,578 | 0.7184 | | lwta_k4_alpha_065 (SXM4)| 13,751 | 6,170 | 19,922 | 0.7328 | | alpha_06 (SXM4)| 14,614 | 6,129 | 20,743 | 0.7390 | | lwta_k4 | 44,329 | 9,354 | 53,683 | 0.7246 | | lwta_k2 | 44,583 | 10,031 | 54,614 | 0.7145 | | modded_nanogpt (SXM4)| 51,729 | 10,277 | 62,006 | 0.7337 | Headline: subset_70_mkn lands at 2,474 J total / 0.7031 PCIe — the new clean J leader, 20% under gpu_ngram_w31_k11 (3,092 J / 0.7050) at the same accuracy band. On the prior NVML-only metric those two were a noise-floor tie; the CPU side resolves the tie cleanly because subset_70_mkn's 70%-data trick also cuts the CPU work proportionally. CPU-bound submissions rerank dramatically. deep_backoff_kn (prior NVML: 2,236 J) now reports 14,578 J total — its CPU energy is 12.8× its GPU reading because its n-gram tables are built single-threaded on the host. Now visible at full cost on the leaderboard. Open questions for maintainer review --- - Floor value: 50 W (default) vs 100 W per GPU-slot fair share. 50 W is conservative; 100 W matches dual-EPYC-7763 + DRAM fair-share for an 8-GPU host. One-line change. - Should ``total_energy_J`` be the new canonical ranking metric, or report both side-by-side? Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b3bf973 to
91e1eb8
Compare
ab-10
approved these changes
May 22, 2026
Contributor
ab-10
left a comment
There was a problem hiding this comment.
Verified running modded nanogpt submission works my side.
IMO it's ready to merge to main, after you have the results for pending evals. Please also update the leaderboard on README.md
@yaroslavvb is there anything else you'd like before we merge to main?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds CPU-energy accounting to
EnergyMetersoresult.jsonreports the total system energy (GPU + CPU), not just the NVML-only GPU number.Motivation: raised by @yaroslavvb2 in Telegram on 2026-05-19:
NVML is GPU-only — it cannot see what the host CPU is doing. A submission that runs its work on the host CPU (e.g., long
np.uniquepasses during n-gram table construction) registers as nearly zero on NVML even though the host CPU is genuinely burning energy. This PR closes that leak by adding a CodeCarbon-backed CPU estimate alongside the existing NVML reading and surfacing both as separate fields inresult.json.Approach
/dev/cpu/*/msrneeded, since Modal containers cannot access them. CodeCarbon identifies the host CPU from/proc/cpuinfo, looks up its TDP from a bundled CSV (~2000 SKUs), and integratespsutil.cpu_percent()over the run.Trainerauto-logs CodeCarbon when installed; Patterson et al. 2021/2022 used the same TDP-based estimate; ML.ENERGY (Michigan SymbioticLab, Zeus framework) reports GPU-only because the same container constraint applies; MLPerf Power requires physical wall meters and has no cloud submission path.total_energy_J = max(gpu + cpu, duration_s × p_floor_watts), defaultp_floor_watts = 50 W, so CodeCarbon under-attribution can't shrink the reported total below a defensible lower bound (~ idle share of a single GPU-slot fair share of a dual EPYC 7763 A100 host).EnergyMeter()raisesRuntimeErrorimmediately. Silent half-measurement (GPU-only, CPUNone) on a real leaderboard run would land inconsistent rows. Dev machines without NVML stay in soft "neither available" mode (no measurement, no crash) so local smoke tests on a laptop still work.API changes (in this PR)
Measurementdataclass: two new optional fields —cpu_energy_J,total_energy_J(bothfloat | None, defaultNone).EnergyMeter.__init__acceptsgpu_backend/cpu_backend/p_floor_wattskwargs for dependency injection (used by the new unit tests). Defaults wrap pynvml and CodeCarbon respectively. RaisesRuntimeErrorif GPU backend is available but CPU backend is not — see "fail-loud" above.EnergyMeter.measure()populates the two new fields on the yieldedMeasurementon exit.run_eval.pywritescpu_energy_J+total_energy_Jintoresult.jsonin all three exit paths (pass, DQ time, DQ acc).submit.pyadds one.pip_install("codecarbon")to the Modal image.submit.py:append_recordnow writestotal_energy_Jto README's Record History column when present, falling back totraining_energy_Jfor pre-PR runs.requirements.txtaddscodecarbonas a local dep.MAINTAINING.md(also added in this PR) — see "Maintenance rule" below.README.mdadds a dated banner above the Record History noting that rows ≥ 2026-05-20 reporttotal_energy_J; earlier rows are kept as historical NVML-only readings.Backward compatibility
This change preserves every existing field's semantic. No existing field changes meaning, no existing test breaks, no existing pre-PR submission needs editing.
energy_jouleskeeps its prior semantic (GPU NVML net of idle baseline). Olderresult.jsonfiles are interpreted identically.EnergyMeter.availablestill reflects NVML availability only. The pre-existingtest_energy_meter_fallback_when_no_nvmlandtest_wall_clock_guard_captures_partial_measurementtests pass unmodified.Measurement.__str__still prints what it printed before (energy_joules+duration_s); the new CPU + total fields are additive on the dataclass, not in the human-readable summary.max(gpu + cpu, duration_s × p_floor_watts)) only applies tototal_energy_J.energy_joulesis unchanged regardless of CodeCarbon's behaviour.requirements.txtaddscodecarbonas a local dep; on a dev machine without it installed,EnergyMeter()only raises if there's also a real GPU present (i.e., you're trying to do a leaderboard-class measurement without the CPU backend). CPU-only dev boxes construct meters fine — they just don't measure anything (same as before).In other words: any pre-PR caller using
energy_joulesreads identical numbers from oldresult.jsonfiles. Leaderboard runs on Modal get both fields populated or fail loudly. CPU-only dev tests don't accidentally start raising.Tests (in this PR)
TDD'd one cycle at a time. 5 new tests, 8 existing tests preserved:
test_energy_meter_total_is_gpu_plus_cputotal_energy_J = gpu + cputest_total_energy_enforces_wall_clock_floortotal_energy_J >= duration_s × p_floor_wattseven when CPU backend under-attributestest_default_cpu_backend_uses_codecarbon_when_installedcpu_backendpopulatescpu_energy_Jend-to-endtest_energy_meter_raises_when_gpu_available_but_cpu_missingEnergyMeter()raisesRuntimeErrortest_energy_meter_dev_mode_no_raise_when_both_unavailablepython -m pytest test_wikitext.py→ 13/13 pass (5 new + 8 existing, none modified).Followed up with the official
code-simplifierClaude plugin — 4 small clarity wins (dead None-checks removed, redundantexcept: self.available = Falsecollapsed topass, long ternaries wrapped). 13/13 still pass.Modal validation
Smoke test by re-running an existing on-
devsubmission through the new harness inside the Modal A100 container. CodeCarbon installs cleanly (onepip_installline on the existing image).result.jsonpopulates with the two new fields. The exact populated numbers from the leaderboard re-runs appear below.Leaderboard re-validation
This PR is itself a setup change (it adds CPU energy to the scored quantity), so the rule in
MAINTAINING.mdrequires re-running every leaderboard row ondevagainst the new harness before promoting tomain. After PR #5 merged todev(adding 13 new submissions), the re-validation now covers both the original upstream rows and the new ones.Done (PCIe, fresh on new schema)
subset_70_mkngpu_ngram_w31_k11paq_mixer_v3gpu_ngram_o14_xorfixdeep_backoff_knlwta_k4lwta_k2Still in flight
chunker_phase1_v1lwta_k4_alpha_065alpha_06modded_nanogptWill commit the remaining four when they land.
Findings
subset_70_mknis the clean J leader on the new metric at 2,474 J total / 0.7031 PCIe — 20 % undergpu_ngram_w31_k11(3,092 J / 0.7050) at the same accuracy band. On the prior NVML-only metric those two were a noise-floor tie (~860 J vs ~1,250 J was within run-to-run variance); the CPU side resolves the tie cleanly becausesubset_70_mkn's 70 %-data trick also cuts the CPU work proportionally.deep_backoff_knreranks dramatically. Prior NVML-only: 2,236 J (cheaper thanxorfixat 3,172 J). New total: 14,578 J (12.8× heavier on CPU than GPU because its n-gram tables are built single-threaded on the host). Now visible at its true cost on the leaderboard.xorfixovertakes it on the new metric (8,602 vs 14,578) at the same accuracy.Maintenance rule (
MAINTAINING.md, added in this PR)This PR introduces the first "setup change" since the upstream leaderboard was published. The leaderboard ranks submissions against each other, so half-old half-new comparisons aren't meaningful. To keep that comparison honest going forward, the rule is codified at the repo root in
MAINTAINING.md:The doc also includes the
main↔devbranching cadence (feature PRs targetdev; slow-cadence promotion PRsdev→main).Open questions for review
total_energy_Jbe the new ranking metric, or report both side-by-side? This PR populates both; maintainer call on which becomes canonical.Test plan
pytest test_wikitext.py→ 13/13)result.jsonpopulates the new fields)energy_joulessemantics unchangedcode-simplifierplugin pass for clarity (13/13 still green)MAINTAINING.mddocuments the setup-change re-run rule + main/dev cadencesubset_70_mkn,gpu_ngram_w31_k11,paq_mixer_v3,gpu_ngram_o14_xorfix,deep_backoff_kn,lwta_k2,lwta_k4)chunker_phase1_v1,lwta_k4_alpha_065,alpha_06,modded_nanogptPCIe retry)🤖 Generated with Claude Code