Skip to content

[Fix] Initialize grad_norm before found_inf skip path#1762

Open
kaysonyu wants to merge 3 commits intoTHUDM:mainfrom
kaysonyu:fix/found-inf-grad-norm-init
Open

[Fix] Initialize grad_norm before found_inf skip path#1762
kaysonyu wants to merge 3 commits intoTHUDM:mainfrom
kaysonyu:fix/found-inf-grad-norm-init

Conversation

@kaysonyu
Copy link
Copy Markdown
Contributor

Summary

This fixes a bug in the local found_inf skip path in Megatron training.

When --no-check-for-nan-in-loss-and-grad is enabled, slime calls optimizer.prepare_grads() before optimizer.step().
If prepare_grads() returns found_inf=True, the code marks the step as invalid and skips optimizer.step().

Before this change, grad_norm was not initialized in that branch, but was still returned at the end of
train_one_step(), which could raise:

UnboundLocalError: cannot access local variable 'grad_norm' where it is not associated with a value

Change

Initialize grad_norm to float("nan") before entering the local bad-step check.

Effect

  • found_inf=True no longer crashes with UnboundLocalError
  • the step still skips parameter update
  • scheduler is still not advanced in this path

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant