Sampling-weight study: per-mat vs per-electron, NMAE comparability

## Context

Today, training samples `M` patches per material with **uniform** weight
across the M slots, and then implicitly weights every material equally
in the loss (1 patch / sequence, sequences batched evenly). This means:

- Training loss ≈ per-mat-mean prediction error.
- The headline benchmark we're chasing — ChargE3Net's 0.523% per-mat
  NMAE — is also per-mat-mean.
- So our training objective and our headline metric are aligned by
  default. ✅

The v3 tokenizer adds a pluggable sampling-weight strategy. The most
natural alternative weight is **per-electron**: weight each material's
contribution to the loss by its electron count `n_e` (or equivalently,
sample `M` per material proportional to `n_e`).

The physical motivation: a heavy-element 100-atom material has ~10×
the electron count (and electron-density mass to predict) of a light
2-atom material. Per-mat weighting treats them equally; per-electron
weighting treats every electron equally.

## The comparability concern

If we train per-electron, but evaluate per-mat (to stay a2a with
ChargE3Net / RHOAR-Net / electrAI), we're optimizing a different
metric than we report. Concretely:

- Heavy mats get more gradient → better per-voxel fit there → low
  per-mat NMAE on heavy mats.
- Light mats get less gradient → worse fit → high per-mat NMAE on
  light mats.
- Per-mat **mean** averages those equally, so the worse-on-light tail
  dominates the headline.

In the limit, a per-electron-trained model can have lower **per-electron-
weighted** NMAE than a per-mat-trained model, while having a *higher*
per-mat-mean NMAE. Whether that actually happens depends on:
- The electron-count skew in the dataset (max:min ratio).
- The model's capacity to fit both regimes simultaneously (i.e.
  whether re-weighting is zero-sum or both-improve).

## Hypotheses

- **H1.** A per-electron-trained model will have ≥ 0.1pp **higher**
  per-mat-mean NMAE than the same model trained per-mat at matched
  budget.
- **H2.** It will have **lower** per-electron-weighted NMAE
  (essentially by construction).
- **H3.** The per-mat NMAE gap (H1) will be larger on val mats with
  low n_e (bottom decile) than on high n_e (top decile).

## Plan

### Phase 0: empirical preview (cheap, no training)

Compute the M-per-mat distribution under each candidate weighting on
the v3 train set:

- `uniform`: M=64 per mat, every mat gets equal sequence budget.
- `electrons`: M_eff per mat ∝ n_e, normalized so total budget matches.
  Stat: distribution of M per mat. Min, p10, p50, p90, max. Ratio
  max:min.

If max:min ratio under `electrons` is, say, > 30×, the per-mat-NMAE hit
will be substantial and we should think harder before defaulting to
per-electron.

### Phase 1: v3 a2a baseline (uniform default)

Recommended: the **first** v3 training run uses `weighting=uniform`,
so the v3-vs-v2-lat comparison isolates tokenization changes from
weighting changes. Pre-registered hypothesis is in the v3 tokenizer
issue.

### Phase 2: per-electron run

Train a second v3 model with `weighting=electrons`, matched
architecture / budget / steps. Eval both models on the same val_200
under three NMAE metrics:

1. Per-mat-mean (the headline / public benchmark format).
2. Per-electron-weighted (matches the per-electron training objective).
3. Stratified per-mat-mean by n_e decile.

### Reporting

Report all three metrics for both models. The "right" answer depends
on what we want to claim:
- "We match ChargE3Net on per-mat-mean NMAE" → uniform wins by
  construction.
- "We're better at per-voxel charge density across the dataset" →
  per-electron may win.
- "We're better on big mats, worse on small mats" → useful to know
  even if not the headline.

## Outcome possibilities

- **Both models roughly equivalent on per-mat-mean.** Means the
  electron-count skew isn't large enough for re-weighting to matter
  much. We can default to whichever is more convenient.
- **Per-electron clearly worse on per-mat-mean.** Stick with uniform
  for headline numbers; per-electron is a niche tool for studying
  per-voxel quality.
- **Per-electron clearly better on per-mat-mean.** Surprising —
  would imply heavy mats were under-represented in capacity allocation,
  not just gradient share. Worth digging into.

## Implementation

Most of the machinery comes from the v3 tokenizer issue (pluggable
sampler at tokenize time). This issue tracks:

- The empirical preview (Phase 0).
- Pre-registration of the per-electron training run (Phase 2).
- Eval-side plumbing for per-electron-weighted NMAE in
  `marin/eval_mat_nmae.py`.
- A short writeup of the outcome.

## Out of scope

- Other weight strategies (n_e^(2/3), per-voxel, etc.) — no clean
  physical motivation today; revisit if results suggest more
  exploration is warranted.

[Open-Athena/tomat#3]: https://github.com/Open-Athena/tomat/issues/3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampling-weight study: per-mat vs per-electron, NMAE comparability #3

Context

The comparability concern

Hypotheses

Plan

Phase 0: empirical preview (cheap, no training)

Phase 1: v3 a2a baseline (uniform default)

Phase 2: per-electron run

Reporting

Outcome possibilities

Implementation

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Sampling-weight study: per-mat vs per-electron, NMAE comparability #3

Description

Context

The comparability concern

Hypotheses

Plan

Phase 0: empirical preview (cheap, no training)

Phase 1: v3 a2a baseline (uniform default)

Phase 2: per-electron run

Reporting

Outcome possibilities

Implementation

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions