Fix HybridDeviceOptimizer KeyError after mixed-precision param replacement by ma-ben · Pull Request #4046 · NVIDIA/Megatron-LM

ma-ben · 2026-03-28T08:23:04Z

Summary

This fixes the HybridDeviceOptimizer KeyError that happens on the first optimizer.step() when CPU optimizer offload is used together with bf16/fp16 mixed precision.

Root Cause

HybridDeviceOptimizer built its internal mapping using tensor object identity from the original param_groups.

Later, Float16OptimizerWithFloat16Params replaces entries in optimizer.param_groups with new FP32 main parameters. On the first step, HDO still looks up the old tensor objects and fails with:

inner_param = self.param_to_inner_param[param]
KeyError: tensor(..., device='cuda:X')

Fix

Use a stable logical mapping based on parameter position in param_groups instead of relying on tensor object identity remaining unchanged.

This also updates:

grad sync from current params to inner/sub-optimizer params
copy-back from sub-optimizers to current params
HDO state synchronization to follow the current wrapped params

Tests

Added a single-GPU regression test covering:

HybridDeviceOptimizer + Float16OptimizerWithFloat16Params
bf16
fp16
first optimizer.step() no longer raises KeyError

Validation run:

OMP_NUM_THREADS=1 .venv/bin/python -m pytest tests/unit_tests/test_optimizer_cpu_offloading.py -vv

Fixes #4042

…ment Switch HybridDeviceOptimizer bookkeeping to stable param-group positions so CPU offload keeps working after Float16OptimizerWithFloat16Params replaces entries in param_groups. Add a single-GPU regression test covering bf16/fp16 first-step behavior.

copy-pr-bot · 2026-03-28T08:23:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-03-28T08:23:15Z

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

Add the oncall reviewer (optional reviewer)
Add required review teams based on your changes

See the contribution guide for more details.

janEbert · 2026-03-30T12:21:49Z

/claude review

claude · 2026-03-30T12:22:57Z

megatron/core/optimizer/cpu_offloading/hybrid_optimizer.py

        self.param_to_inner_param = {}
        self.inner_param_to_orig_param = {}
-        for group in self.param_groups:
-            for param in group["params"]:
-                if param in self.param_to_fp32_param:
-                    inner_param = self.param_to_fp32_param[param]
-                elif param in self.gpu_params_map_cpu_copy:
-                    inner_param = self.gpu_params_map_cpu_copy[param]
-                else:
-                    inner_param = param
-                self.param_to_inner_param[param] = inner_param
-                self.inner_param_to_orig_param[inner_param] = param
+        # Keep a stable logical mapping by param-group position so wrapped optimizers
+        # can replace tensor objects in-place without breaking HDO's bookkeeping.
+        self.param_group_index_to_orig_param = {}
+        self.param_group_index_to_inner_param = {}
+        self.inner_param_to_param_group_index = {}
+        for param_group_index, param in self._iter_current_param_group_entries():
+            self.param_group_index_to_orig_param[param_group_index] = param
+            if param in self.param_to_fp32_param:
+                inner_param = self.param_to_fp32_param[param]
+            elif param in self.gpu_params_map_cpu_copy:
+                inner_param = self.gpu_params_map_cpu_copy[param]
+            else:
+                inner_param = param
+            self.param_to_inner_param[param] = inner_param
+            self.inner_param_to_orig_param[inner_param] = param
+            self.param_group_index_to_inner_param[param_group_index] = inner_param
+            self.inner_param_to_param_group_index[inner_param] = param_group_index


Nit: param_to_inner_param, inner_param_to_orig_param, and param_group_index_to_orig_param are still populated here but no longer read anywhere after this PR. They could be removed to avoid confusion about which mapping is authoritative (the new param_group_index_to_* / inner_param_to_param_group_index dicts).

claude

Fix looks correct — switching from tensor identity to positional indexing is the right approach. Test covers the core regression well. Left one minor note about leftover dead dicts that could be cleaned up.

ma-ben requested review from a team as code owners March 28, 2026 08:23

svcnvidia-nemo-ci marked this pull request as draft March 28, 2026 08:23

github-actions bot added the community-request label Mar 28, 2026

ma-ben marked this pull request as ready for review March 28, 2026 08:24

svcnvidia-nemo-ci requested a review from a team March 28, 2026 08:24

svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Mar 28, 2026

chtruong814 added the needs-follow-up Issue needs follow-up label Mar 30, 2026

claude bot reviewed Mar 30, 2026

View reviewed changes

claude bot approved these changes Mar 30, 2026

View reviewed changes

chtruong814 added needs-follow-up Issue needs follow-up and removed needs-follow-up Issue needs follow-up labels Mar 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix HybridDeviceOptimizer KeyError after mixed-precision param replacement#4046

Fix HybridDeviceOptimizer KeyError after mixed-precision param replacement#4046
ma-ben wants to merge 1 commit intoNVIDIA:mainfrom
ma-ben:fix/hdo-mixed-precision-cpu-offload-keyerror

ma-ben commented Mar 28, 2026

Uh oh!

copy-pr-bot bot commented Mar 28, 2026

Uh oh!

github-actions bot commented Mar 28, 2026

Uh oh!

janEbert commented Mar 30, 2026

Uh oh!

claude bot Mar 30, 2026

Uh oh!

claude bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ma-ben commented Mar 28, 2026

Summary

Root Cause

Fix

Tests

Uh oh!

copy-pr-bot bot commented Mar 28, 2026

Uh oh!

github-actions bot commented Mar 28, 2026

Uh oh!

janEbert commented Mar 30, 2026

Uh oh!

claude bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants