[fix] Save trust_remote_code artifacts in MegatronStrategy.save_hf_model by dinhxuanvu · Pull Request #1726 · NovaSky-AI/SkyRL

dinhxuanvu · 2026-05-29T14:46:26Z

Summary

MegatronStrategy.save_hf_model previously only invoked bridge.save_hf_weights and the inherited save_hf_configs. For models that rely on trust_remote_code (Nemotron-H, custom MoE variants, etc.), the resulting export directory was missing the custom modeling_*.py modules and other auto_map-referenced artifacts, so the checkpoint could not be reloaded with AutoModel.from_pretrained(..., trust_remote_code=True) without manually copying files from the original source path.

Root cause

save_pretrained() only emits config.json, tokenizer files, and the generation config. It does not copy the custom Python modeling modules or the ARTIFACTS / OPTIONAL_ARTIFACTS that trust_remote_code models depend on. Megatron-Bridge's PreTrainedBase.save_artifacts(work_dir) is the API that copies these from the original source path, but save_hf_model in skyrl/backends/skyrl_train/distributed/megatron/megatron_strategy.py was not calling it before delegating to save_hf_configs.

Fix

Add a rank-0 call to bridge.hf_pretrained.save_artifacts(work_dir) immediately before self.save_hf_configs(...) inside MegatronStrategy.save_hf_model.
save_hf_configs runs afterwards, so the existing config.json / tokenizer write semantics are preserved (it overwrites the config that save_artifacts wrote with the strategy's current view).
For non-trust_remote_code models, save_artifacts is effectively a superset of save_hf_configs, so behavior is unchanged.

Verification

Existing tests pass: no CPU-side test exercises this path; GPU-only coverage in tests/backends/skyrl_train/gpu/test_save_load_model.py exercises save/load end-to-end on CI runners.
Lint/format clean: pre-commit run --files skyrl/backends/skyrl_train/distributed/megatron/megatron_strategy.py (ruff, black, gitleaks) all pass.
Manual repro: with this fix, the exported HF directory for a trust_remote_code model contains the custom modeling_*.py files, special_tokens_map.json, and other artifacts listed in ARTIFACTS / OPTIONAL_ARTIFACTS, and reloads cleanly with AutoModel.from_pretrained(..., trust_remote_code=True) without external file copying.

Risk

Low. save_artifacts is the standard Megatron-Bridge API for emitting source artifacts and is a no-op for source paths that do not declare any custom modules. save_hf_configs runs after it, so any overlap on config.json / tokenizer files is resolved in favor of the existing behavior.

Notes

API verified against NVIDIA-NeMo/Megatron-Bridge:src/megatron/bridge/models/hf_pretrained/base.py (save_artifacts) and conversion/auto_bridge.py (hf_pretrained attribute on AutoBridge).

When exporting a trained Megatron model to Hugging Face format, MegatronStrategy.save_hf_model only invokes bridge.save_hf_weights and the inherited save_hf_configs (which calls config/tokenizer/generation_config save_pretrained). Models that rely on trust_remote_code (e.g. Nemotron-H, custom MoE variants) keep their architecture in modeling_*.py modules and reference auxiliary files via auto_map; none of these are emitted by save_pretrained, so the resulting checkpoint directory is structurally incomplete and cannot be loaded with AutoModel.from_pretrained(..., trust_remote_code=True) without manually copying files from the original source path. This change adds a rank-0 call to bridge.hf_pretrained.save_artifacts(work_dir) before save_hf_configs. PreTrainedBase.save_artifacts copies all custom modeling files (*.py), required ARTIFACTS, and OPTIONAL_ARTIFACTS from the original source path. save_hf_configs runs afterwards so the existing config.json/tokenizer write semantics are preserved (it overwrites the config that save_artifacts wrote with the strategy's current view). Verified bridge API against NVIDIA-NeMo/Megatron-Bridge:src/megatron/bridge/models/hf_pretrained/base.py (save_artifacts) and conversion/auto_bridge.py (hf_pretrained attribute). Verification: - pre-commit hooks (ruff, black, gitleaks): pass - ast.parse + targeted AST walk confirms save_artifacts is invoked from within save_hf_model on rank 0 - Existing save/load coverage in tests/backends/skyrl_train/gpu/test_save_load_model.py is GPU-gated and exercises this path end-to-end on CI; no behavior change expected for non-trust_remote_code models because save_artifacts is a superset of save_hf_configs for those. Signed-off-by: Vu Dinh <vudinh@outlook.com>

gemini-code-assist

Code Review

This pull request updates the Megatron strategy to preserve custom modeling artifacts for trust_remote_code models by calling bridge.hf_pretrained.save_artifacts before saving configurations. The review feedback recommends adding a defensive check to verify that the hf_pretrained attribute exists on the bridge object to prevent potential AttributeErrors during model saving.

gemini-code-assist · 2026-05-29T14:47:06Z

+                # current view, but save_artifacts is required to copy the
+                # custom Python modules and other artifacts that
+                # save_pretrained() alone does not emit.
+                bridge.hf_pretrained.save_artifacts(work_dir)


To prevent potential AttributeError or NoneType errors during model saving, it is safer to use defensive programming and check if bridge has a non-None hf_pretrained attribute before calling save_artifacts. If bridge is a custom or mock bridge that does not implement hf_pretrained, this call would otherwise crash the training process at the very end.

Suggested change

bridge.hf_pretrained.save_artifacts(work_dir)

if getattr(bridge, "hf_pretrained", None) is not None:

bridge.hf_pretrained.save_artifacts(work_dir)

The bridge.save_hf_weights call three lines above is unguarded on the same object — if bridge lacks hf_pretrained, it almost certainly lacks save_hf_weights too and we've already crashed.

More importantly, silently skipping save_artifacts is exactly the bug this PR fixes: HF dirs missing modeling_*.py reload as broken trust_remote_code checkpoints. A loud AttributeError surfaces a bridge gap; a silent skip re-introduces data loss. Prefer to fail loudly.

…ts ordering Locks in the rank-0-only invocation of bridge.hf_pretrained.save_artifacts and the relative ordering of save_hf_weights -> save_artifacts -> save_hf_configs. Gated on _has_megatron consistent with the rest of test_megatron_correctness.py. Signed-off-by: Vu Dinh <vudinh@outlook.com>

Pure formatting (line-length only); no logic changes. Fixes check_code_quality / black on PR NovaSky-AI#1726. Signed-off-by: Vu Dinh <vudinh@outlook.com>

gemini-code-assist Bot reviewed May 29, 2026

View reviewed changes

dinhxuanvu added 2 commits May 29, 2026 21:17

[style] Apply black to TestSaveHFModelArtifacts helpers

079216c

Pure formatting (line-length only); no logic changes. Fixes check_code_quality / black on PR NovaSky-AI#1726. Signed-off-by: Vu Dinh <vudinh@outlook.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Save trust_remote_code artifacts in MegatronStrategy.save_hf_model#1726

[fix] Save trust_remote_code artifacts in MegatronStrategy.save_hf_model#1726
dinhxuanvu wants to merge 3 commits into
NovaSky-AI:mainfrom
dinhxuanvu:vdinh/fix-megatron-strategy-save-artifacts

dinhxuanvu commented May 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 29, 2026

Uh oh!

dinhxuanvu May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	bridge.hf_pretrained.save_artifacts(work_dir)
	if getattr(bridge, "hf_pretrained", None) is not None:
	bridge.hf_pretrained.save_artifacts(work_dir)

Conversation

dinhxuanvu commented May 29, 2026

Summary

Root cause

Fix

Verification

Risk

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

dinhxuanvu May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant