-
Notifications
You must be signed in to change notification settings - Fork 14k
SOLVE_TRI extension to more dimensions #17793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Does this supersede #17703? |
|
@am17an nope, that other one is optimization for the small kernel. |
468bec5 to
5efca20
Compare
am17an
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this kernel have similar performance to the templated one? If there is a not a huge performance difference it is much preferred to keep one kernel for future maintenance + binary size (a big difference in e2e performance of qwen3-next would justify its existence however)
ggml/src/ggml-cuda/solve_tri.cu
Outdated
| float * X_batch = (float *) (X + i02 * nb2 + i03 * nb3); | ||
|
|
||
| // Shared memory for current tile | ||
| __shared__ float sA[GENERAL_TILE_SIZE * GENERAL_TILE_SIZE]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would probably stuff from the same bank conflict problem right?
ggml/src/ggml-cuda/solve_tri.cu
Outdated
| int tile_end = min(tile_start + GENERAL_TILE_SIZE, n); | ||
| int tile_n = tile_end - tile_start; | ||
| // Load tile of A matrix | ||
| for (int i = tid; i < tile_n * tile_n; i += blockDim.x) { | ||
| int local_row = i / tile_n; | ||
| int local_col = i % tile_n; | ||
| int global_row = tile_start + local_row; | ||
| int global_col = tile_start + local_col; | ||
| if (global_col <= global_row) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const wherever
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copypasting and modifying the code like this is not acceptable in terms of mainability. Instead modify the existing kernel to handle tiling for the case n_template == 0 && k_template == 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you added this unintentionally, please remove it from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It adds itself automatically due to a GitHub hook installed by the server development code 😐
|
Also be aware that this PR will need to be rebased on top of master. |
|
I have a reworked approach using cuBlas trsmBatched for the big sizes instead, but I haven't been able to figure out some weird race conditions... |
|
Okay, @JohannesGaessler @am17an question: I rewrote the entire kernel to just use Optimized kernel in master: cuBLAS: Does it make sense to keep the optimized kernel for the small cases with the ~25% performance increase or should I just replace the entire kernel with the cuBLAS version? |
|
Is there an equivalent solution for HIP? |
|
@JohannesGaessler from what I've checked there's both |
|
My opinion is that both a manually written kernel and a library call would both be fine. I don't think that performance numbers for |
c8abbe0 to
60dcf1b
Compare
|
Okay, I merged both versions then, I think we can keep it. SOLVE_TRI is pretty mission-critical for tuning the new generation of hybrid models with recurrent logic, since its efficiency determines the chunk size that can be used (and thus potentially the graph size). It's only used now in Qwen3Next but probably Kimi Linear is next in line. |
Extend SOLVE_TRI with a tiled kernel to work on all dimensions. Asked my LLM friends to generate a tiled kernel, added some backend tests for it. Not really too worried about optimizations for now since the base case is 64, but adding this in case we want to experiment with different / configurable chunk size for hybrid models with chunking (like Qwen3Next or the hopefully-coming Kimi Linear).