Context
Today, training samples M patches per material with uniform weight
across the M slots, and then implicitly weights every material equally
in the loss (1 patch / sequence, sequences batched evenly). This means:
- Training loss ≈ per-mat-mean prediction error.
- The headline benchmark we're chasing — ChargE3Net's 0.523% per-mat
NMAE — is also per-mat-mean.
- So our training objective and our headline metric are aligned by
default. ✅
The v3 tokenizer adds a pluggable sampling-weight strategy. The most
natural alternative weight is per-electron: weight each material's
contribution to the loss by its electron count n_e (or equivalently,
sample M per material proportional to n_e).
The physical motivation: a heavy-element 100-atom material has ~10×
the electron count (and electron-density mass to predict) of a light
2-atom material. Per-mat weighting treats them equally; per-electron
weighting treats every electron equally.
The comparability concern
If we train per-electron, but evaluate per-mat (to stay a2a with
ChargE3Net / RHOAR-Net / electrAI), we're optimizing a different
metric than we report. Concretely:
- Heavy mats get more gradient → better per-voxel fit there → low
per-mat NMAE on heavy mats.
- Light mats get less gradient → worse fit → high per-mat NMAE on
light mats.
- Per-mat mean averages those equally, so the worse-on-light tail
dominates the headline.
In the limit, a per-electron-trained model can have lower per-electron-
weighted NMAE than a per-mat-trained model, while having a higher
per-mat-mean NMAE. Whether that actually happens depends on:
- The electron-count skew in the dataset (max:min ratio).
- The model's capacity to fit both regimes simultaneously (i.e.
whether re-weighting is zero-sum or both-improve).
Hypotheses
- H1. A per-electron-trained model will have ≥ 0.1pp higher
per-mat-mean NMAE than the same model trained per-mat at matched
budget.
- H2. It will have lower per-electron-weighted NMAE
(essentially by construction).
- H3. The per-mat NMAE gap (H1) will be larger on val mats with
low n_e (bottom decile) than on high n_e (top decile).
Plan
Phase 0: empirical preview (cheap, no training)
Compute the M-per-mat distribution under each candidate weighting on
the v3 train set:
uniform: M=64 per mat, every mat gets equal sequence budget.
electrons: M_eff per mat ∝ n_e, normalized so total budget matches.
Stat: distribution of M per mat. Min, p10, p50, p90, max. Ratio
max:min.
If max:min ratio under electrons is, say, > 30×, the per-mat-NMAE hit
will be substantial and we should think harder before defaulting to
per-electron.
Phase 1: v3 a2a baseline (uniform default)
Recommended: the first v3 training run uses weighting=uniform,
so the v3-vs-v2-lat comparison isolates tokenization changes from
weighting changes. Pre-registered hypothesis is in the v3 tokenizer
issue.
Phase 2: per-electron run
Train a second v3 model with weighting=electrons, matched
architecture / budget / steps. Eval both models on the same val_200
under three NMAE metrics:
- Per-mat-mean (the headline / public benchmark format).
- Per-electron-weighted (matches the per-electron training objective).
- Stratified per-mat-mean by n_e decile.
Reporting
Report all three metrics for both models. The "right" answer depends
on what we want to claim:
- "We match ChargE3Net on per-mat-mean NMAE" → uniform wins by
construction.
- "We're better at per-voxel charge density across the dataset" →
per-electron may win.
- "We're better on big mats, worse on small mats" → useful to know
even if not the headline.
Outcome possibilities
- Both models roughly equivalent on per-mat-mean. Means the
electron-count skew isn't large enough for re-weighting to matter
much. We can default to whichever is more convenient.
- Per-electron clearly worse on per-mat-mean. Stick with uniform
for headline numbers; per-electron is a niche tool for studying
per-voxel quality.
- Per-electron clearly better on per-mat-mean. Surprising —
would imply heavy mats were under-represented in capacity allocation,
not just gradient share. Worth digging into.
Implementation
Most of the machinery comes from the v3 tokenizer issue (pluggable
sampler at tokenize time). This issue tracks:
- The empirical preview (Phase 0).
- Pre-registration of the per-electron training run (Phase 2).
- Eval-side plumbing for per-electron-weighted NMAE in
marin/eval_mat_nmae.py.
- A short writeup of the outcome.
Out of scope
- Other weight strategies (n_e^(2/3), per-voxel, etc.) — no clean
physical motivation today; revisit if results suggest more
exploration is warranted.
Context
Today, training samples
Mpatches per material with uniform weightacross the M slots, and then implicitly weights every material equally
in the loss (1 patch / sequence, sequences batched evenly). This means:
NMAE — is also per-mat-mean.
default. ✅
The v3 tokenizer adds a pluggable sampling-weight strategy. The most
natural alternative weight is per-electron: weight each material's
contribution to the loss by its electron count
n_e(or equivalently,sample
Mper material proportional ton_e).The physical motivation: a heavy-element 100-atom material has ~10×
the electron count (and electron-density mass to predict) of a light
2-atom material. Per-mat weighting treats them equally; per-electron
weighting treats every electron equally.
The comparability concern
If we train per-electron, but evaluate per-mat (to stay a2a with
ChargE3Net / RHOAR-Net / electrAI), we're optimizing a different
metric than we report. Concretely:
per-mat NMAE on heavy mats.
light mats.
dominates the headline.
In the limit, a per-electron-trained model can have lower per-electron-
weighted NMAE than a per-mat-trained model, while having a higher
per-mat-mean NMAE. Whether that actually happens depends on:
whether re-weighting is zero-sum or both-improve).
Hypotheses
per-mat-mean NMAE than the same model trained per-mat at matched
budget.
(essentially by construction).
low n_e (bottom decile) than on high n_e (top decile).
Plan
Phase 0: empirical preview (cheap, no training)
Compute the M-per-mat distribution under each candidate weighting on
the v3 train set:
uniform: M=64 per mat, every mat gets equal sequence budget.electrons: M_eff per mat ∝ n_e, normalized so total budget matches.Stat: distribution of M per mat. Min, p10, p50, p90, max. Ratio
max:min.
If max:min ratio under
electronsis, say, > 30×, the per-mat-NMAE hitwill be substantial and we should think harder before defaulting to
per-electron.
Phase 1: v3 a2a baseline (uniform default)
Recommended: the first v3 training run uses
weighting=uniform,so the v3-vs-v2-lat comparison isolates tokenization changes from
weighting changes. Pre-registered hypothesis is in the v3 tokenizer
issue.
Phase 2: per-electron run
Train a second v3 model with
weighting=electrons, matchedarchitecture / budget / steps. Eval both models on the same val_200
under three NMAE metrics:
Reporting
Report all three metrics for both models. The "right" answer depends
on what we want to claim:
construction.
per-electron may win.
even if not the headline.
Outcome possibilities
electron-count skew isn't large enough for re-weighting to matter
much. We can default to whichever is more convenient.
for headline numbers; per-electron is a niche tool for studying
per-voxel quality.
would imply heavy mats were under-represented in capacity allocation,
not just gradient share. Worth digging into.
Implementation
Most of the machinery comes from the v3 tokenizer issue (pluggable
sampler at tokenize time). This issue tracks:
marin/eval_mat_nmae.py.Out of scope
physical motivation today; revisit if results suggest more
exploration is warranted.