Fix performance regression in grad_handling_hook#7882
Fix performance regression in grad_handling_hook#7882rraminen wants to merge 2 commits intodeepspeedai:masterfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2560cf45a9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self._hooks_fired_this_backward == 0: | ||
| self.current_expected_hooks = count_used_parameters_in_backward(all_params_requiring_grad) | ||
| self.update_hook_state_and_maybe_run_epilogue(self.current_expected_hooks) |
There was a problem hiding this comment.
Recompute expected hook count for each backward phase
Caching count_used_parameters_in_backward() only when _hooks_fired_this_backward == 0 makes current_expected_hooks fixed to the first phase’s value, but reentrant checkpointing can introduce additional participating params in later phases. In that case update_hook_state_and_maybe_run_epilogue() never sees the higher expected count, so remaining_grad_acc_hooks can drop to zero too early and run the epilogue before all gradients for the backward are processed, which risks incomplete reduction/accumulation for checkpointed models.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 51bec2ba76
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self._remaining_grad_acc_hooks == 0: | ||
| self.current_expected_hooks = count_used_parameters_in_backward(all_params_requiring_grad) | ||
| self.update_hook_state_and_maybe_run_epilogue(self.current_expected_hooks) | ||
| self._remaining_grad_acc_hooks -= 1 |
There was a problem hiding this comment.
Stop decrementing hook-state counter manually
update_hook_state_and_maybe_run_epilogue() already computes and stores _remaining_grad_acc_hooks, so subtracting again here can make the counter negative after the last hook. In environments that hit the fallback path (e.g., callback API unavailable or hooks firing with backward_active_depth==0), the next backward then skips the == 0 branch and reuses stale current_expected_hooks, and reenter_backward_if_needed() no longer sees the zero sentinel it relies on; this can mis-track expected hooks and prevent the epilogue from running at the correct time for reentrant/dynamic backward graphs.
Useful? React with 👍 / 👎.
|
Hi @rraminen, |
This PR fixes a performance drop introduced by calling count_used_parameters_in_backward() inside every gradient hook.
In the previous implementation, the expected hook count was computed once per backward phase. After a recent PR changes (311674f#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addL1002-L1010), it is being recomputed on every hook invocation, resulting in a drop in samples/sec values.
With the fix in this PR performance returns to pre-regression samples/sec values.