Commit ec16a07
Optimize MOE GEMV kernel for BS > 1. (ggml-org#20905)
* Optimize MOE GEMV kernel for BS > 1.
The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.
New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).
This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.
* Remove em-dashes
* Cherry-pick changes from @am17an PR ggml-org#20885 to enable small_k optimization only for cases where it benefits
Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8
* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>1 parent f5d1c41 commit ec16a07
3 files changed
Lines changed: 358 additions & 59 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2343 | 2343 | | |
2344 | 2344 | | |
2345 | 2345 | | |
2346 | | - | |
| 2346 | + | |
| 2347 | + | |
2347 | 2348 | | |
2348 | 2349 | | |
2349 | 2350 | | |
| |||
2946 | 2947 | | |
2947 | 2948 | | |
2948 | 2949 | | |
2949 | | - | |
2950 | | - | |
2951 | | - | |
2952 | | - | |
2953 | | - | |
| 2950 | + | |
| 2951 | + | |
| 2952 | + | |
| 2953 | + | |
| 2954 | + | |
| 2955 | + | |
| 2956 | + | |
| 2957 | + | |
2954 | 2958 | | |
2955 | | - | |
| 2959 | + | |
2956 | 2960 | | |
| 2961 | + | |
2957 | 2962 | | |
2958 | 2963 | | |
2959 | 2964 | | |
| |||
0 commit comments