Replace apex clip_grad_norm_ with PyTorch native in dints templates#426
Replace apex clip_grad_norm_ with PyTorch native in dints templates#426HeyangQin wants to merge 1 commit intoProject-MONAI:mainfrom
Conversation
apex.contrib.clip_grad.clip_grad_norm_ crashes on PyTorch >=2.10 with "RuntimeError: Cannot access data pointer of Tensor that doesn't have storage" because apex's multi_tensor_applier cannot handle tensors with lazy/functional storage introduced in newer PyTorch versions. The try/except only caught ModuleNotFoundError (apex not installed) but not the runtime crash when apex is installed but incompatible. torch.nn.utils.clip_grad_norm_ handles all tensor types correctly and is the standard approach. The apex version offered marginal performance gains that are not worth the compatibility breakage. Fixes Project-MONAI/MONAI#8737
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughThis change replaces conditional imports that attempted to use Apex's gradient clipping function with an unconditional import of PyTorch's implementation. The modification affects two training/search script files to ensure compatibility with PyTorch 2.10, which breaks Apex's multi-tensor operations on tensors with non-traditional storage. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
apex.contrib.clip_grad.clip_grad_norm_from dints auto3dseg templates (train.pyandsearch.py)torch.nn.utils.clip_grad_norm_instead, which handles all tensor types correctly including those with lazy/functional storage in PyTorch >=2.10ModuleNotFoundError(apex not installed), but not theRuntimeErrorwhen apex is installed but itsmulti_tensor_applieris incompatible with newer PyTorch tensor storageRoot cause
On PyTorch >=2.10 (e.g., NGC 25.12), some gradient tensors use lazy/functional storage that no longer exposes a traditional data pointer.
apex.contrib.clip_grad.clip_grad_norm_callsmulti_tensor_applierwhich tries to access the raw data pointer, causing:Test plan
apex.contrib.clip_gradin the repositorytorch.nn.utils.clip_grad_norm_(either via import alias or fully-qualified)Fixes Project-MONAI/MONAI#8737
Summary by CodeRabbit
Release Notes