Fix bug in momentum type declaration in HIP TBE kernel by aryaman-gupta · Pull Request #147 · ROCm/FBGEMM

aryaman-gupta · 2026-03-19T22:10:53Z

This PR fixes a bug in the HIP TBE kernel which defined the data type of p_momentum as cache_t. When cache_t is half, this declares p_momentum as half, which is incorrect and inconsistent with the Python code (https://github.com/ROCm/FBGEMM/blob/aryaman/hip-tbe-momentum-fix/fbgemm_gpu/fbgemm_gpu/split_table_batched_embeddings_ops_training.py#L1232-L1240). Before pytorch@ea2a302 was merged, cache_t was always float, so this bug did not show up.

use acc_type for momentum instead of cache_t

4f22562

aryaman-gupta requested review from avbokovoy and liligwu March 19, 2026 22:15

aryaman-gupta marked this pull request as ready for review March 19, 2026 22:16

avbokovoy approved these changes Mar 20, 2026

View reviewed changes

aryaman-gupta added 4 commits March 23, 2026 13:58

backward_adagrad_test.py: updates unit test to vary params

0dfaed7

split env variable calls

3a179ad

backward_adagrad_test.py: modifies test name and desc

77f752b

backward_adagrad_test: reset env variable after test

ba53028

Provide feedback