[fix] Preserve original AutoConfig in MegatronWorker.init_configs by dinhxuanvu · Pull Request #1727 · NovaSky-AI/SkyRL

dinhxuanvu · 2026-05-29T14:46:37Z

Summary

A Megatron-Bridge SFT run launched with overrides like ++transformer_config_kwargs.mtp_num_layers=0 produced an exported HF checkpoint whose config.json carried the runtime override (num_nextn_predict_layers=0) instead of the on-disk source-of-truth value (e.g. 4 for Nemotron-H 120B). The saved directory was silently inconsistent with the reference repo. This PR makes update_model_config non-mutating and ensures MegatronStrategy.save_hf_configs publishes the un-mutated, on-disk AutoConfig.

Root cause

update_model_config in skyrl/train/utils/utils.py mutated the input AutoConfig in place via setattr (recursing into sub-configs the same way). MegatronWorker.init_configs in skyrl/backends/skyrl_train/workers/megatron/megatron_worker.py then assigned the already-mutated object to self.strategy.hf_config, and MegatronStrategy.save_hf_configs (in skyrl/backends/skyrl_train/distributed/strategy.py) published it via model_config.save_pretrained(work_dir).

Fix

skyrl/train/utils/utils.py: update_model_config now returns a deepcopy of the input with overrides applied. The recursive setattr logic is preserved unchanged in a private _apply_overrides_in_place helper used internally on the copy.
skyrl/backends/skyrl_train/workers/megatron/megatron_worker.py: init_configs keeps the un-mutated AutoConfig.from_pretrained(...) result as hf_config_original and assigns it to self.strategy.hf_config, while the Megatron provider/runtime path continues to use the override-applied copy returned by update_model_config.

Trade-off on tokenizer-id overrides

The override dict in MegatronWorker.init_configs mixes two categories: tokenizer-id overrides (bos_token_id, eos_token_id, pad_token_id) and runtime overrides (mtp_num_layers=0, etc). Pre-PR, both leaked into the saved config.json; post-PR, neither does. We accept this as the right behavior because (i) the on-disk config.json should match the source repo exactly, (ii) tokenizer ids are duplicated in tokenizer_config.json / special_tokens_map.json which are written separately via tokenizer.save_pretrained, and (iii) the on-disk-source-of-truth contract is cleaner. If a future use case needs tokenizer-id overrides persisted to the saved config, splitting the override dict into tokenizer_overrides (applied to the saved copy) and model_overrides (runtime-only) is a localized follow-up.

Verification

CPU unit tests added at tests/train/utils/test_utils.py covering the non-mutating contract: input config is unchanged at top level, returned copy carries top-level overrides, nested-dict overrides do not leak back into the input. Includes the Bug B repro (update_model_config(cfg, {"num_nextn_predict_layers": 0}) leaves the input's num_nextn_predict_layers at its original value). Runs in cpu_skyrl_train.yaml.
Lint/format clean: pre-commit run --files skyrl/train/utils/utils.py skyrl/backends/skyrl_train/workers/megatron/megatron_worker.py tests/train/utils/test_utils.py (ruff, black, gitleaks) all pass.
Manual repro target: after this fix, config.json written by MegatronStrategy.save_hf_configs for a Nemotron-H 120B SFT run launched with ++transformer_config_kwargs.mtp_num_layers=0 matches the on-disk reference (no num_nextn_predict_layers=0 leak).

Risk

Low. The only call site of update_model_config is updated in this commit to consume the returned value. The recursion-via-setattr semantics are identical to before; only the target object changed (a deep-copied config instead of the caller's input).

Notes

The _apply_overrides_in_place helper is the original recursive body and remains internally callable, so future call sites that explicitly want in-place behavior can opt in.

When a Megatron-Bridge SFT run was launched with overrides like `++transformer_config_kwargs.mtp_num_layers=0`, the exported HF checkpoint's `config.json` carried the runtime override (`num_nextn_predict_layers=0`) instead of the on-disk source-of-truth value (e.g. 4 for Nemotron-H 120B), making the saved directory silently inconsistent with the reference repo. Root cause: `update_model_config` mutated the input `AutoConfig` in place via `setattr` (recursing into sub-configs the same way). `MegatronWorker.init_configs` then assigned that already-mutated object to `self.strategy.hf_config`, and `MegatronStrategy .save_hf_configs` published it via `model_config.save_pretrained (work_dir)`. Fix: - `skyrl/train/utils/utils.py`: `update_model_config` now returns a `deepcopy` of the input with overrides applied. The recursive setattr logic is preserved unchanged in a private `_apply_overrides_in_place` helper used internally on the copy. - `skyrl/backends/skyrl_train/workers/megatron/megatron_worker.py`: `init_configs` keeps the un-mutated `AutoConfig.from_pretrained (...)` result as `hf_config_original` and assigns it to `self.strategy.hf_config`, while the Megatron provider/runtime path continues to use the override-applied copy returned by `update_model_config`. Verification: - pre-commit hooks (ruff, black, gitleaks): pass on both files. - Standalone unit-style smoke test (recreating the function bodies in isolation, since the module pulls in heavy runtime deps): asserts the original config is unchanged, the returned copy carries top-level overrides, and nested-dict overrides also do not leak back into the input. Includes a Bug B-style repro: `update_model_config(hf_cfg, {"num_nextn_predict_layers": 0})` leaves `hf_cfg.num_nextn_predict_layers` at the original value. - No existing test in `tests/` references `update_model_config`, `MegatronWorker.init_configs`, or `save_hf_configs`, so this path was previously uncovered. Backward compatibility: the only call site of `update_model_config` is `MegatronWorker.init_configs`, which is updated in this commit to consume the returned value. The recursion-via-setattr semantics are identical to before — only the target object changed (a deep-copied config instead of the caller's input). Signed-off-by: Vu Dinh <vudinh@outlook.com>

gemini-code-assist

Code Review

This pull request updates update_model_config to return a deep copy of the configuration with overrides applied, ensuring the original configuration remains unmodified. This allows megatron_worker.py to preserve the unmutated hf_config_original for self.strategy.hf_config. The review feedback points out a potential AttributeError in _apply_overrides_in_place if nested attributes are standard dictionaries (such as rope_scaling) rather than objects, and suggests a more robust implementation that handles both dictionaries and objects while adding defensive None checks.

gemini-code-assist · 2026-05-29T14:47:43Z

+def _apply_overrides_in_place(module_config, override_config_kwargs):
+    """Apply override kwargs to ``module_config`` in place (used for sub-configs)."""
    for key, val in override_config_kwargs.items():
        if isinstance(val, dict):
-            update_model_config(getattr(module_config, key), val)
+            _apply_overrides_in_place(getattr(module_config, key), val)
        else:
            setattr(module_config, key, val)


The current implementation of _apply_overrides_in_place assumes that module_config and all its nested attributes are objects supporting getattr and setattr. However, many HuggingFace configurations contain standard dictionary attributes (e.g., rope_scaling, quantization_config). If an override contains a nested dictionary for one of these attributes, calling getattr on it returns a dict, and recursing into it will cause an AttributeError when setattr is called on the dictionary keys.

Additionally, if module_config is None or if a nested attribute does not exist (returning None), the function will crash. Enforcing defensive programming with proper None checks and handling both dictionaries and objects will make this utility much more robust.

def _apply_overrides_in_place(module_config, override_config_kwargs): """Apply override kwargs to ``module_config`` in place (used for sub-configs).""" if module_config is None: return for key, val in override_config_kwargs.items(): is_dict = isinstance(module_config, dict) current_val = module_config.get(key) if is_dict else getattr(module_config, key, None) if isinstance(val, dict) and current_val is not None: _apply_overrides_in_place(current_val, val) else: if is_dict: module_config[key] = val else: setattr(module_config, key, val)

Real issue, but pre-existing — the recursive getattr/setattr pattern is unchanged from the prior in-place version; I only relocated it. The only caller (init_configs) passes scalar overrides, so no observable regression today.

Also, the suggested patch conflates "recurse into existing sub-config" with "assign new dict" when current_val is None. Worth a dedicated PR. Happy to file a follow-up issue.

Locks in three properties of update_model_config: - the caller's input config is not mutated at the top level, - the returned copy carries top-level overrides, - nested-dict overrides do not leak back into the caller's input. Includes the Bug B repro: update_model_config(cfg, {"num_nextn_predict_layers": 0}) leaves the input's num_nextn_predict_layers at its original value. Runs on cpu_skyrl_train.yaml (uv run --extra skyrl-train --extra dev pytest tests/train/...). Signed-off-by: Vu Dinh <vudinh@outlook.com>

Avoids a pytest module-name collision with tests/train/generators/test_utils.py (no __init__.py in either directory; pytest's rootdir-based collection treats both as the same top-level module name and aborts collection). The new basename is also more descriptive of what the file actually exercises. Signed-off-by: Vu Dinh <vudinh@outlook.com>

gemini-code-assist Bot reviewed May 29, 2026

View reviewed changes

dinhxuanvu added 2 commits May 29, 2026 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Preserve original AutoConfig in MegatronWorker.init_configs#1727

[fix] Preserve original AutoConfig in MegatronWorker.init_configs#1727
dinhxuanvu wants to merge 3 commits into
NovaSky-AI:mainfrom
dinhxuanvu:vdinh/fix-megatron-worker-config-mutation

dinhxuanvu commented May 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 29, 2026

Uh oh!

dinhxuanvu May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dinhxuanvu commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Fix

Trade-off on tokenizer-id overrides

Verification

Risk

Notes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

dinhxuanvu May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dinhxuanvu commented May 29, 2026 •

edited

Loading