-
Notifications
You must be signed in to change notification settings - Fork 33
Description
Describe the bug
In this nki kernel (https://gist.github.com/jinhongyii/492d7079efdf186b1cbb6b2f0c0478be), line 158,
p_transposed[p_loop_29[:, None, ], ((b_loop_2 * 128) + f_loop_22[None, :, ])] = nisa.tensor_copy(acc_psum_2[(block_kv * 126 + block_q) * 64 + b_loop_2, b_loop_2 % 8, p_loop_29[:, None, ], f_loop_22[None, :, ]])
this tensor_copy is accessing acc_psum_2 of bank b_loop_2 % 8, so the instruction in iteration b_loop_2 should have anti-dependency on iteration b_loop_2 - 8.
However, in the profile, it shows that this instruction depends on its previous iteration (b_loop_2 - 1). I believe this is a compiler bug which hurts performance.
Expected Behavior
The copy instruction should not depend on its previous iteration. The event semaphore in the screenshot should not exist.
Current Behavior
event semaphore appearing between each copy instruction under b_loop_2
Reproduction Steps
Run the nki kernel on trn1. You can reproduce the all the logs and profiles.
Regression Issue
- Select this option if this issue appears to be a regression.
Possible Solution
Fix the dependency building pass.
Additional Information/Context
No response
neuronx-cc version used
2.17.194
Framework(s) and their versions used (JAX, PyTorch, etc..)
No response