Skip to content

[fix] Save trust_remote_code artifacts in MegatronStrategy.save_hf_model#1726

Open
dinhxuanvu wants to merge 3 commits into
NovaSky-AI:mainfrom
dinhxuanvu:vdinh/fix-megatron-strategy-save-artifacts
Open

[fix] Save trust_remote_code artifacts in MegatronStrategy.save_hf_model#1726
dinhxuanvu wants to merge 3 commits into
NovaSky-AI:mainfrom
dinhxuanvu:vdinh/fix-megatron-strategy-save-artifacts

Conversation

@dinhxuanvu
Copy link
Copy Markdown
Contributor

Summary

MegatronStrategy.save_hf_model previously only invoked bridge.save_hf_weights and the inherited save_hf_configs. For models that rely on trust_remote_code (Nemotron-H, custom MoE variants, etc.), the resulting export directory was missing the custom modeling_*.py modules and other auto_map-referenced artifacts, so the checkpoint could not be reloaded with AutoModel.from_pretrained(..., trust_remote_code=True) without manually copying files from the original source path.

Root cause

save_pretrained() only emits config.json, tokenizer files, and the generation config. It does not copy the custom Python modeling modules or the ARTIFACTS / OPTIONAL_ARTIFACTS that trust_remote_code models depend on. Megatron-Bridge's PreTrainedBase.save_artifacts(work_dir) is the API that copies these from the original source path, but save_hf_model in skyrl/backends/skyrl_train/distributed/megatron/megatron_strategy.py was not calling it before delegating to save_hf_configs.

Fix

  • Add a rank-0 call to bridge.hf_pretrained.save_artifacts(work_dir) immediately before self.save_hf_configs(...) inside MegatronStrategy.save_hf_model.
  • save_hf_configs runs afterwards, so the existing config.json / tokenizer write semantics are preserved (it overwrites the config that save_artifacts wrote with the strategy's current view).
  • For non-trust_remote_code models, save_artifacts is effectively a superset of save_hf_configs, so behavior is unchanged.

Verification

  • Existing tests pass: no CPU-side test exercises this path; GPU-only coverage in tests/backends/skyrl_train/gpu/test_save_load_model.py exercises save/load end-to-end on CI runners.
  • Lint/format clean: pre-commit run --files skyrl/backends/skyrl_train/distributed/megatron/megatron_strategy.py (ruff, black, gitleaks) all pass.
  • Manual repro: with this fix, the exported HF directory for a trust_remote_code model contains the custom modeling_*.py files, special_tokens_map.json, and other artifacts listed in ARTIFACTS / OPTIONAL_ARTIFACTS, and reloads cleanly with AutoModel.from_pretrained(..., trust_remote_code=True) without external file copying.

Risk

Low. save_artifacts is the standard Megatron-Bridge API for emitting source artifacts and is a no-op for source paths that do not declare any custom modules. save_hf_configs runs after it, so any overlap on config.json / tokenizer files is resolved in favor of the existing behavior.

Notes

API verified against NVIDIA-NeMo/Megatron-Bridge:src/megatron/bridge/models/hf_pretrained/base.py (save_artifacts) and conversion/auto_bridge.py (hf_pretrained attribute on AutoBridge).

When exporting a trained Megatron model to Hugging Face format,
MegatronStrategy.save_hf_model only invokes bridge.save_hf_weights and
the inherited save_hf_configs (which calls config/tokenizer/generation_config
save_pretrained). Models that rely on trust_remote_code (e.g. Nemotron-H,
custom MoE variants) keep their architecture in modeling_*.py modules and
reference auxiliary files via auto_map; none of these are emitted by
save_pretrained, so the resulting checkpoint directory is structurally
incomplete and cannot be loaded with AutoModel.from_pretrained(..., trust_remote_code=True)
without manually copying files from the original source path.

This change adds a rank-0 call to bridge.hf_pretrained.save_artifacts(work_dir)
before save_hf_configs. PreTrainedBase.save_artifacts copies all custom
modeling files (*.py), required ARTIFACTS, and OPTIONAL_ARTIFACTS from the
original source path. save_hf_configs runs afterwards so the existing
config.json/tokenizer write semantics are preserved (it overwrites the
config that save_artifacts wrote with the strategy's current view).

Verified bridge API against
NVIDIA-NeMo/Megatron-Bridge:src/megatron/bridge/models/hf_pretrained/base.py
(save_artifacts) and conversion/auto_bridge.py (hf_pretrained attribute).

Verification:
- pre-commit hooks (ruff, black, gitleaks): pass
- ast.parse + targeted AST walk confirms save_artifacts is invoked from
  within save_hf_model on rank 0
- Existing save/load coverage in tests/backends/skyrl_train/gpu/test_save_load_model.py
  is GPU-gated and exercises this path end-to-end on CI; no behavior change
  expected for non-trust_remote_code models because save_artifacts is a
  superset of save_hf_configs for those.

Signed-off-by: Vu Dinh <vudinh@outlook.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the Megatron strategy to preserve custom modeling artifacts for trust_remote_code models by calling bridge.hf_pretrained.save_artifacts before saving configurations. The review feedback recommends adding a defensive check to verify that the hf_pretrained attribute exists on the bridge object to prevent potential AttributeErrors during model saving.

# current view, but save_artifacts is required to copy the
# custom Python modules and other artifacts that
# save_pretrained() alone does not emit.
bridge.hf_pretrained.save_artifacts(work_dir)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To prevent potential AttributeError or NoneType errors during model saving, it is safer to use defensive programming and check if bridge has a non-None hf_pretrained attribute before calling save_artifacts. If bridge is a custom or mock bridge that does not implement hf_pretrained, this call would otherwise crash the training process at the very end.

Suggested change
bridge.hf_pretrained.save_artifacts(work_dir)
if getattr(bridge, "hf_pretrained", None) is not None:
bridge.hf_pretrained.save_artifacts(work_dir)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bridge.save_hf_weights call three lines above is unguarded on the same object — if bridge lacks hf_pretrained, it almost certainly lacks save_hf_weights too and we've already crashed.

More importantly, silently skipping save_artifacts is exactly the bug this PR fixes: HF dirs missing modeling_*.py reload as broken trust_remote_code checkpoints. A loud AttributeError surfaces a bridge gap; a silent skip re-introduces data loss. Prefer to fail loudly.

…ts ordering

Locks in the rank-0-only invocation of bridge.hf_pretrained.save_artifacts
and the relative ordering of save_hf_weights -> save_artifacts -> save_hf_configs.

Gated on _has_megatron consistent with the rest of test_megatron_correctness.py.

Signed-off-by: Vu Dinh <vudinh@outlook.com>
Pure formatting (line-length only); no logic changes. Fixes
check_code_quality / black on PR NovaSky-AI#1726.

Signed-off-by: Vu Dinh <vudinh@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant