[Common] Persistent Grouped MXFP8 quantization kernel#2738
[Common] Persistent Grouped MXFP8 quantization kernel#2738Oleg-Goncharov wants to merge 27 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
|
/te-ci |
Greptile SummaryThis PR adds a persistent grouped MXFP8 quantization kernel with static scheduling and upgrades the Key changes:
Two issues identified:
Confidence Score: 4/5
Last reviewed commit: 5815335 |
| } else { | ||
| NVTE_CHECK(num_tensors < MAX_SUPPORTED_TENSOR_DESCRIPTORS, | ||
| NVTE_CHECK(num_tensors <= MAX_SUPPORTED_TENSOR_DESCRIPTORS, | ||
| "Number of tensors in a group is larger than " | ||
| "the MAX number of supported descriptors (64)."); | ||
| // Only full tiles supported | ||
| NVTE_CHECK(last_logical_dim % CHUNK_DIM_X == 0, | ||
| "Last dimension of a grouped tensor should be divisible by 128."); | ||
| blocks_Y = 1; | ||
| blocks_X = DIVUP(elts_total, CHUNK_DIM_Y * CHUNK_DIM_X); | ||
| work_blocks_Y = 1; | ||
| work_blocks_X = DIVUP(elts_total, CHUNK_DIM_Y * CHUNK_DIM_X); | ||
| } |
There was a problem hiding this comment.
Missing column-alignment check for non-single-tensor grouped tensors
The original code included an NVTE_CHECK that enforced last_logical_dim % CHUNK_DIM_X == 0 for the non-single-tensor path (VARYING_LAST_DIM, VARYING_BOTH_DIMS). This check was removed in this PR, but the kernel still requires this alignment for correctness in the non-single-tensor path.
The unit tests themselves still skip when this condition is not met:
if (!is_single_tensor && (last_dims[t] % CHUNK_DIM_X != 0)) {
GTEST_SKIP();
}Without the runtime check, callers can pass non-128-aligned last dimensions for non-single-tensor groups and receive silently wrong results. For example, with cols = 160:
blocks_X_num_in_current_tensor = DIVUP(160, 128) = 2decode_blockmapsblock_id = 1toblock_offset_X = 128, issuing a TMA load at column offset 128 in a 160-wide tensor- Meanwhile
is_job_validcomputes the flat-space element offset as16384, which maps to(row=102, col=64)— these coordinates conflict with whatdecode_blockproduces, leading to incorrect quantization
The check should be restored:
} else {
NVTE_CHECK(num_tensors <= MAX_SUPPORTED_TENSOR_DESCRIPTORS,
"Number of tensors in a group is larger than "
"the MAX number of supported descriptors (64).");
NVTE_CHECK(last_logical_dim % CHUNK_DIM_X == 0,
"Last dimension of a grouped tensor must be divisible by 128.");
work_blocks_Y = 1;
work_blocks_X = DIVUP(elts_total, CHUNK_DIM_Y * CHUNK_DIM_X);
}Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
924ff91 to
325181b
Compare
| if (block_offset_Y_in_tensor >= job.rows || block_offset_X_in_tensor >= job.cols) { | ||
| return false; |
There was a problem hiding this comment.
Redundant block_offset_X_in_tensor >= job.cols condition is always false
block_offset_X_in_tensor is computed as tensor_offset_from_start % job.cols (line 210). By the definition of the modulo operator, this result is always in [0, job.cols - 1], so the condition block_offset_X_in_tensor >= job.cols is mathematically impossible to be true.
The actual guard that matters is block_offset_Y_in_tensor >= job.rows. The dead half of the condition silently provides no protection against out-of-bounds blocks.
| if (block_offset_Y_in_tensor >= job.rows || block_offset_X_in_tensor >= job.cols) { | |
| return false; | |
| if (block_offset_Y_in_tensor >= job.rows) { | |
| return false; | |
| } |
| } | ||
|
|
||
| const float *const thread_in_base = dbias_partial + dbias_in_offset_Y * cols + thread_id * nvec; | ||
| OType *const thread_out_base = dbias_output + tensor_id * cols + thread_id * nvec; |
There was a problem hiding this comment.
Output stride assumes uniform cols across all tensors
The output write offset is computed as:
OType *const thread_out_base = dbias_output + tensor_id * cols + thread_id * nvec;where cols is last_logical_dim — a single value shared across all tensors in the group. This is correct for SAME_BOTH_DIMS and VARYING_FIRST_DIM (where all tensors share the same last dimension), but the kernel receives shape_rep as a parameter and does not enforce that restriction.
For VARYING_LAST_DIM or VARYING_BOTH_DIMS where per-tensor cols differ, the fixed tensor_id * cols stride would compute wrong output offsets. Currently, tests skip dbias validation for these cases, but the kernel would produce incorrect results if actually called with varying-last-dim tensors.
Consider adding a device-side assertion to enforce the precondition:
| OType *const thread_out_base = dbias_output + tensor_id * cols + thread_id * nvec; | |
| if (shape_rep != ShapeRepresentation::SAME_BOTH_DIMS && shape_rep != ShapeRepresentation::VARYING_FIRST_DIM) { | |
| NVTE_DEVICE_ERROR("group_reduce_dbias_kernel requires uniform last dimensions across tensors"); | |
| } |
Description
This PR adds a persistent grouped MXFP8 quantization kernel with static scheduling.
It is built on top of the PR#2674 [Common] MOE Split dBias
Type of change
Changes
Checklist: