Skip to content

Fix HybridDeviceOptimizer KeyError after mixed-precision param replacement#4046

Open
ma-ben wants to merge 1 commit intoNVIDIA:mainfrom
ma-ben:fix/hdo-mixed-precision-cpu-offload-keyerror
Open

Fix HybridDeviceOptimizer KeyError after mixed-precision param replacement#4046
ma-ben wants to merge 1 commit intoNVIDIA:mainfrom
ma-ben:fix/hdo-mixed-precision-cpu-offload-keyerror

Conversation

@ma-ben
Copy link
Copy Markdown

@ma-ben ma-ben commented Mar 28, 2026

Summary

This fixes the HybridDeviceOptimizer KeyError that happens on the first optimizer.step() when CPU optimizer offload is used together with bf16/fp16 mixed precision.

Root Cause

HybridDeviceOptimizer built its internal mapping using tensor object identity from the original param_groups.

Later, Float16OptimizerWithFloat16Params replaces entries in optimizer.param_groups with new FP32 main parameters. On the first step, HDO still looks up the old tensor objects and fails with:

inner_param = self.param_to_inner_param[param]
KeyError: tensor(..., device='cuda:X')

Fix

Use a stable logical mapping based on parameter position in param_groups instead of relying on tensor object identity remaining unchanged.

This also updates:

  • grad sync from current params to inner/sub-optimizer params
  • copy-back from sub-optimizers to current params
  • HDO state synchronization to follow the current wrapped params

Tests

Added a single-GPU regression test covering:

  • HybridDeviceOptimizer + Float16OptimizerWithFloat16Params
  • bf16
  • fp16
  • first optimizer.step() no longer raises KeyError

Validation run:

OMP_NUM_THREADS=1 .venv/bin/python -m pytest tests/unit_tests/test_optimizer_cpu_offloading.py -vv

Fixes #4042

…ment

Switch HybridDeviceOptimizer bookkeeping to stable param-group positions so CPU offload keeps working after Float16OptimizerWithFloat16Params replaces entries in param_groups.

Add a single-GPU regression test covering bf16/fp16 first-step behavior.
@ma-ben ma-ben requested review from a team as code owners March 28, 2026 08:23
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 28, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft March 28, 2026 08:23
@github-actions
Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@ma-ben ma-ben marked this pull request as ready for review March 28, 2026 08:24
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team March 28, 2026 08:24
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Mar 28, 2026
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Mar 30, 2026
@janEbert
Copy link
Copy Markdown
Contributor

/claude review

Comment on lines 194 to +212
self.param_to_inner_param = {}
self.inner_param_to_orig_param = {}
for group in self.param_groups:
for param in group["params"]:
if param in self.param_to_fp32_param:
inner_param = self.param_to_fp32_param[param]
elif param in self.gpu_params_map_cpu_copy:
inner_param = self.gpu_params_map_cpu_copy[param]
else:
inner_param = param
self.param_to_inner_param[param] = inner_param
self.inner_param_to_orig_param[inner_param] = param
# Keep a stable logical mapping by param-group position so wrapped optimizers
# can replace tensor objects in-place without breaking HDO's bookkeeping.
self.param_group_index_to_orig_param = {}
self.param_group_index_to_inner_param = {}
self.inner_param_to_param_group_index = {}
for param_group_index, param in self._iter_current_param_group_entries():
self.param_group_index_to_orig_param[param_group_index] = param
if param in self.param_to_fp32_param:
inner_param = self.param_to_fp32_param[param]
elif param in self.gpu_params_map_cpu_copy:
inner_param = self.gpu_params_map_cpu_copy[param]
else:
inner_param = param
self.param_to_inner_param[param] = inner_param
self.inner_param_to_orig_param[inner_param] = param
self.param_group_index_to_inner_param[param_group_index] = inner_param
self.inner_param_to_param_group_index[inner_param] = param_group_index
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: param_to_inner_param, inner_param_to_orig_param, and param_group_index_to_orig_param are still populated here but no longer read anywhere after this PR. They could be removed to avoid confusion about which mapping is authoritative (the new param_group_index_to_* / inner_param_to_param_group_index dicts).

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix looks correct — switching from tensor identity to positional indexing is the right approach. Test covers the core regression well. Left one minor note about leftover dead dicts that could be cleaned up.

@chtruong814 chtruong814 added needs-follow-up Issue needs follow-up and removed needs-follow-up Issue needs follow-up labels Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request Final Review PR is in the "final review" stage needs-follow-up Issue needs follow-up

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] HybridDeviceOptimizer KeyError with MoE models when using CPU offload

5 participants