Skip to content

Sampling-weight study: per-mat vs per-electron, NMAE comparability #3

@ryan-williams

Description

@ryan-williams

Context

Today, training samples M patches per material with uniform weight
across the M slots, and then implicitly weights every material equally
in the loss (1 patch / sequence, sequences batched evenly). This means:

  • Training loss ≈ per-mat-mean prediction error.
  • The headline benchmark we're chasing — ChargE3Net's 0.523% per-mat
    NMAE — is also per-mat-mean.
  • So our training objective and our headline metric are aligned by
    default. ✅

The v3 tokenizer adds a pluggable sampling-weight strategy. The most
natural alternative weight is per-electron: weight each material's
contribution to the loss by its electron count n_e (or equivalently,
sample M per material proportional to n_e).

The physical motivation: a heavy-element 100-atom material has ~10×
the electron count (and electron-density mass to predict) of a light
2-atom material. Per-mat weighting treats them equally; per-electron
weighting treats every electron equally.

The comparability concern

If we train per-electron, but evaluate per-mat (to stay a2a with
ChargE3Net / RHOAR-Net / electrAI), we're optimizing a different
metric than we report. Concretely:

  • Heavy mats get more gradient → better per-voxel fit there → low
    per-mat NMAE on heavy mats.
  • Light mats get less gradient → worse fit → high per-mat NMAE on
    light mats.
  • Per-mat mean averages those equally, so the worse-on-light tail
    dominates the headline.

In the limit, a per-electron-trained model can have lower per-electron-
weighted
NMAE than a per-mat-trained model, while having a higher
per-mat-mean NMAE. Whether that actually happens depends on:

  • The electron-count skew in the dataset (max:min ratio).
  • The model's capacity to fit both regimes simultaneously (i.e.
    whether re-weighting is zero-sum or both-improve).

Hypotheses

  • H1. A per-electron-trained model will have ≥ 0.1pp higher
    per-mat-mean NMAE than the same model trained per-mat at matched
    budget.
  • H2. It will have lower per-electron-weighted NMAE
    (essentially by construction).
  • H3. The per-mat NMAE gap (H1) will be larger on val mats with
    low n_e (bottom decile) than on high n_e (top decile).

Plan

Phase 0: empirical preview (cheap, no training)

Compute the M-per-mat distribution under each candidate weighting on
the v3 train set:

  • uniform: M=64 per mat, every mat gets equal sequence budget.
  • electrons: M_eff per mat ∝ n_e, normalized so total budget matches.
    Stat: distribution of M per mat. Min, p10, p50, p90, max. Ratio
    max:min.

If max:min ratio under electrons is, say, > 30×, the per-mat-NMAE hit
will be substantial and we should think harder before defaulting to
per-electron.

Phase 1: v3 a2a baseline (uniform default)

Recommended: the first v3 training run uses weighting=uniform,
so the v3-vs-v2-lat comparison isolates tokenization changes from
weighting changes. Pre-registered hypothesis is in the v3 tokenizer
issue.

Phase 2: per-electron run

Train a second v3 model with weighting=electrons, matched
architecture / budget / steps. Eval both models on the same val_200
under three NMAE metrics:

  1. Per-mat-mean (the headline / public benchmark format).
  2. Per-electron-weighted (matches the per-electron training objective).
  3. Stratified per-mat-mean by n_e decile.

Reporting

Report all three metrics for both models. The "right" answer depends
on what we want to claim:

  • "We match ChargE3Net on per-mat-mean NMAE" → uniform wins by
    construction.
  • "We're better at per-voxel charge density across the dataset" →
    per-electron may win.
  • "We're better on big mats, worse on small mats" → useful to know
    even if not the headline.

Outcome possibilities

  • Both models roughly equivalent on per-mat-mean. Means the
    electron-count skew isn't large enough for re-weighting to matter
    much. We can default to whichever is more convenient.
  • Per-electron clearly worse on per-mat-mean. Stick with uniform
    for headline numbers; per-electron is a niche tool for studying
    per-voxel quality.
  • Per-electron clearly better on per-mat-mean. Surprising —
    would imply heavy mats were under-represented in capacity allocation,
    not just gradient share. Worth digging into.

Implementation

Most of the machinery comes from the v3 tokenizer issue (pluggable
sampler at tokenize time). This issue tracks:

  • The empirical preview (Phase 0).
  • Pre-registration of the per-electron training run (Phase 2).
  • Eval-side plumbing for per-electron-weighted NMAE in
    marin/eval_mat_nmae.py.
  • A short writeup of the outcome.

Out of scope

  • Other weight strategies (n_e^(2/3), per-voxel, etc.) — no clean
    physical motivation today; revisit if results suggest more
    exploration is warranted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions