Skip to content

[codex] Add eval prognostic model configs#860

Draft
loliverhennigh wants to merge 1 commit into
NVIDIA:mainfrom
loliverhennigh:codex/eval-prognostic-models
Draft

[codex] Add eval prognostic model configs#860
loliverhennigh wants to merge 1 commit into
NVIDIA:mainfrom
loliverhennigh:codex/eval-prognostic-models

Conversation

@loliverhennigh
Copy link
Copy Markdown
Collaborator

@loliverhennigh loliverhennigh commented May 14, 2026

Summary

This adds a standard prognostic model catalog for the eval recipe, plus one narrow model-specific fix needed by the validated FengWu CPU fallback path.

The PR includes configs for ACE2, AIFS, AIFSENS, Atlas, Aurora, cBottleVideo, FCN, FengWu, GenCast Mini, GraphCast Small/Operational, Pangu 3/6/24, and SFNO. Existing DLWP, FCN3, DLESyM, and StormScope configs remain in the catalog.

What changed

  • Add standard recipes/eval/cfg/model/*.yaml configs for the prognostic model catalog.
  • Use concrete model module paths in the new configs, avoiding package-level prognostic model export changes.
  • Instantiate nested Hydra model.load_args, which is needed by models with configurable sources/load helpers.
  • Preserve tensor dtype/device after forecast-grid interpolation and initialize DistributedManager before rank-0-only work if needed.
  • Keep DLESyM-specific handling inside the eval recipe pipeline without adding a model-package import shim.
  • Add a FengWu CPU ONNXRuntime fallback path and update the FengWu ONNX test fixture to use dynamic_axes with the legacy exporter.
  • Document how the added model configs map to optional Earth2Studio extras without changing the eval lockfile.

Scope guardrails

This PR intentionally avoids broad import/export changes:

  • No changes to earth2studio/models/px/__init__.py.
  • No lazy package-level model exports.
  • No private PyTorch DTensor import shim in the DLESyM model source.
  • No GraphCast/GenCast source changes; the earlier xarray-copy cleanup was removed because it was not clearly necessary for inference.
  • No eval uv.lock or pyproject.toml changes.

The only Earth2Studio model implementation touched is FengWu, where CPU execution needs to bypass ONNXRuntime IO binding.

FuXi is intentionally not included in the recipe catalog in this PR. CPU ONNXRuntime fails on its fp16 com.microsoft.Gelu node, and the GPU retry confirmed CUDA/ORT visibility but still creates the initial session on CPU during model construction.

Validation

Current clean PR checks:

  • uv run pre-commit run --all-files
  • From recipes/eval: PYTHONPATH=../.. uv run --extra dev pytest test -q
    • 333 passed, 4 skipped, 147 warnings
  • PYTHONPATH=recipes/eval /Users/oliverhennigh/Documents/New\ project\ 5/earth2studio/.venv/bin/python -m pytest recipes/eval/test/test_models.py recipes/eval/test/test_data.py -q
    • 49 passed, 21 warnings
  • /Users/oliverhennigh/Documents/New\ project\ 5/earth2studio/.venv/bin/python -m pytest test/models/px/test_fengwu.py -q -k 'test_fengwu_call and cpu'
    • 2 passed, 10 deselected, 13 warnings
  • git diff --check upstream/main

NVL72 sweep notes:

  • The broader model sweep exercised the new configs through recipes/eval/main.py, nsteps=1, with readable forecast.zarr outputs for the working model set.
  • Temporary broad import/export investigations from that sweep have been removed from this PR.

@loliverhennigh loliverhennigh force-pushed the codex/eval-prognostic-models branch 10 times, most recently from d3c5229 to 25413f9 Compare May 18, 2026 18:21
@loliverhennigh loliverhennigh force-pushed the codex/eval-prognostic-models branch from 25413f9 to ef2deea Compare May 18, 2026 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant