[Common] Persistent Grouped MXFP8 quantization kernel by Oleg-Goncharov · Pull Request #2738 · NVIDIA/TransformerEngine

Oleg-Goncharov · 2026-03-05T16:16:17Z

Description

This PR adds a persistent grouped MXFP8 quantization kernel with static scheduling.
It is built on top of the PR#2674 [Common] MOE Split dBias

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Added persistent kernel
Added TunableConfig structure to tune performance

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Oleg-Goncharov · 2026-03-05T16:33:00Z

/te-ci

greptile-apps · 2026-03-05T16:34:35Z

Greptile Summary

This PR adds a persistent grouped MXFP8 quantization kernel with static scheduling and upgrades the dbias output from a single flat tensor to a per-group NVTEGroupedTensor. The persistent kernel introduces a TunableConfig struct, a grid-stride work scheduler that maps a compact physical CTA grid over a virtual work grid, and a ping-pong double-buffer pipeline for TMA loads and stores.

Key changes:

TunableConfig centralises tunable constants (CHUNK_DIM_Y/X, THREADS_PER_CHUNK, PREFETCH_STAGES, PERSISTENT, STATIC_PERSISTENT_BLOCKS_PER_SM); the existing CHUNK_DIM_X/Y constants are now aliases.
The main kernel gains work_blocks_X/Y parameters and a while (!job_finished) outer loop; job decoding is split into decode_job / decode_block / is_job_valid device helpers.
Colwise and rowwise processing are extracted into process_colwise_stage / process_rowwise_stage device functions, reducing the kernel body significantly.
grouped_reduce_dbias (new host function + group_reduce_dbias_kernel) replaces reduce_dbias, writing one output row per tensor into a [num_tensors, cols] layout.
ShapeRepresentation enum is moved to common.cuh so both the main kernel and the new reduction kernel share the same type.
All nvte_group_quantize_dbias* public API signatures change NVTETensor dbias → NVTEGroupedTensor dbias.
Tests are updated to exercise the new grouped dbias layout, run the reference over all tensors uniformly, and skip non-16-byte-aligned last dimensions.

Two issues identified:

The is_job_valid function contains a mathematically-impossible condition (block_offset_X_in_tensor >= job.cols) due to modulo arithmetic that should be removed for clarity.
The group_reduce_dbias_kernel assumes all tensors in a group share the same last dimension (last_logical_dim), but does not enforce this precondition. The kernel would produce incorrect results if called with VARYING_LAST_DIMS or VARYING_BOTH_DIMS shapes. Tests currently skip dbias validation for these cases, preventing the bug from manifesting, but adding a precondition check or documentation is recommended.

Confidence Score: 4/5

The PR is functionally correct for the common paths (single-tensor and varying-first-dim cases) but has one low-severity issue in the persistent kernel and one latent bug in the dbias reduction path.
The redundant condition in is_job_valid is a clarity issue that doesn't affect correctness. The uniform-cols assumption in group_reduce_dbias_kernel is a real bug that could corrupt dbias output for varying-last-dim groups, but tests currently skip these cases, so no failures manifest. Both issues are addressable with minimal code changes (remove dead condition, add precondition check). The overall refactoring is well-structured and the job decoding, barrier management, and grouped dbias reduction logic is correct.
transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh (remove dead condition) and transformer_engine/common/cast/core/common.cuh (add precondition check for varying-last-dim support).

_{Last reviewed commit: 5815335}

greptile-apps · 2026-03-05T16:34:39Z

transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh

  } else {
-    NVTE_CHECK(num_tensors < MAX_SUPPORTED_TENSOR_DESCRIPTORS,
+    NVTE_CHECK(num_tensors <= MAX_SUPPORTED_TENSOR_DESCRIPTORS,
               "Number of tensors in a group is larger than "
               "the MAX number of supported descriptors (64).");
-    // Only full tiles supported
-    NVTE_CHECK(last_logical_dim % CHUNK_DIM_X == 0,
-               "Last dimension of a grouped tensor should be divisible by 128.");
-    blocks_Y = 1;
-    blocks_X = DIVUP(elts_total, CHUNK_DIM_Y * CHUNK_DIM_X);
+    work_blocks_Y = 1;
+    work_blocks_X = DIVUP(elts_total, CHUNK_DIM_Y * CHUNK_DIM_X);
+  }


Missing column-alignment check for non-single-tensor grouped tensors

The original code included an NVTE_CHECK that enforced last_logical_dim % CHUNK_DIM_X == 0 for the non-single-tensor path (VARYING_LAST_DIM, VARYING_BOTH_DIMS). This check was removed in this PR, but the kernel still requires this alignment for correctness in the non-single-tensor path.

The unit tests themselves still skip when this condition is not met:

if (!is_single_tensor && (last_dims[t] % CHUNK_DIM_X != 0)) { GTEST_SKIP(); }

Without the runtime check, callers can pass non-128-aligned last dimensions for non-single-tensor groups and receive silently wrong results. For example, with cols = 160:

blocks_X_num_in_current_tensor = DIVUP(160, 128) = 2

decode_block maps block_id = 1 to block_offset_X = 128, issuing a TMA load at column offset 128 in a 160-wide tensor

Meanwhile is_job_valid computes the flat-space element offset as 16384, which maps to (row=102, col=64) — these coordinates conflict with what decode_block produces, leading to incorrect quantization

The check should be restored:

} else { NVTE_CHECK(num_tensors <= MAX_SUPPORTED_TENSOR_DESCRIPTORS, "Number of tensors in a group is larger than " "the MAX number of supported descriptors (64)."); NVTE_CHECK(last_logical_dim % CHUNK_DIM_X == 0, "Last dimension of a grouped tensor must be divisible by 128."); work_blocks_Y = 1; work_blocks_X = DIVUP(elts_total, CHUNK_DIM_Y * CHUNK_DIM_X); }

transformer_engine/common/cast/core/common.cuh

tests/cpp/operator/test_cast_mxfp8_grouped.cu

transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-03-06T10:54:05Z

transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh

+  if (block_offset_Y_in_tensor >= job.rows || block_offset_X_in_tensor >= job.cols) {
+    return false;


Redundant block_offset_X_in_tensor >= job.cols condition is always false

block_offset_X_in_tensor is computed as tensor_offset_from_start % job.cols (line 210). By the definition of the modulo operator, this result is always in [0, job.cols - 1], so the condition block_offset_X_in_tensor >= job.cols is mathematically impossible to be true.

The actual guard that matters is block_offset_Y_in_tensor >= job.rows. The dead half of the condition silently provides no protection against out-of-bounds blocks.

Suggested change

if (block_offset_Y_in_tensor >= job.rows || block_offset_X_in_tensor >= job.cols) {

return false;

if (block_offset_Y_in_tensor >= job.rows) {

return false;

}

greptile-apps · 2026-03-06T10:54:06Z

transformer_engine/common/cast/core/common.cuh

+  }
+
+  const float *const thread_in_base = dbias_partial + dbias_in_offset_Y * cols + thread_id * nvec;
+  OType *const thread_out_base = dbias_output + tensor_id * cols + thread_id * nvec;


Output stride assumes uniform cols across all tensors

The output write offset is computed as:

OType *const thread_out_base = dbias_output + tensor_id * cols + thread_id * nvec;

where cols is last_logical_dim — a single value shared across all tensors in the group. This is correct for SAME_BOTH_DIMS and VARYING_FIRST_DIM (where all tensors share the same last dimension), but the kernel receives shape_rep as a parameter and does not enforce that restriction.

For VARYING_LAST_DIM or VARYING_BOTH_DIMS where per-tensor cols differ, the fixed tensor_id * cols stride would compute wrong output offsets. Currently, tests skip dbias validation for these cases, but the kernel would produce incorrect results if actually called with varying-last-dim tensors.

Consider adding a device-side assertion to enforce the precondition:

Suggested change

OType *const thread_out_base = dbias_output + tensor_id * cols + thread_id * nvec;

if (shape_rep != ShapeRepresentation::SAME_BOTH_DIMS && shape_rep != ShapeRepresentation::VARYING_FIRST_DIM) {

NVTE_DEVICE_ERROR("group_reduce_dbias_kernel requires uniform last dimensions across tensors");

}

Oleg-Goncharov and others added 22 commits February 27, 2026 15:53

Implemented the kernel with split dbias

c7c1a76

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7abbc7b

for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f820b21

for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Relaxed constraints on the last dimension

0c05632

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Added notes on group tensor restrictions into documentation

4a85dea

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixes per the review

aedd53d

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixed pointer

38288b1

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

More fixes

ce3a137

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixed kernel grid size

bddd804

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Merge branch 'main' into pr_split_dbias

a894d1a

Enabled persistency with WorkID Query feature

87352bd

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Added a struct with tunable parameters

e23f553

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Added persistency with static scheduling

d185299

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixed test cases

5e15f57

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Ready for benchmarking

98e9558

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixed out-of-boundary error

ab816cb

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Tuned kernel parameters

8a429ad

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Refactoring

ab3f911

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Refactoring 2

92720ac

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Refactoring 3

46d9811

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Removed the dynamic (WorkID Query) persistency

7172400

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Ready for PR

4344627

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov added enhancement New feature or request MoE labels Mar 5, 2026

pre-commit-ci bot and others added 2 commits March 5, 2026 16:17

[pre-commit.ci] auto fixes from pre-commit.com hooks

ede33b4

for more information, see https://pre-commit.ci

Merge branch 'main' into pr_persistent_grouped_mxfp8_kernel

219e925

Oleg-Goncharov requested a review from ptrendx March 5, 2026 16:18

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh Show resolved Hide resolved

Fixes per the review

325181b

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov force-pushed the pr_persistent_grouped_mxfp8_kernel branch from 924ff91 to 325181b Compare March 6, 2026 10:39

Oleg-Goncharov and others added 2 commits March 6, 2026 11:39

Merge branch 'main' into pr_persistent_grouped_mxfp8_kernel

04609b1

[pre-commit.ci] auto fixes from pre-commit.com hooks

5815335

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

Oleg-Goncharov mentioned this pull request Mar 6, 2026

[Common] Persistent Grouped NVFP4 quantization kernel #2743

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] Persistent Grouped MXFP8 quantization kernel#2738

[Common] Persistent Grouped MXFP8 quantization kernel#2738
Oleg-Goncharov wants to merge 27 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_persistent_grouped_mxfp8_kernel

Oleg-Goncharov commented Mar 5, 2026 •

edited

Loading

Uh oh!

Oleg-Goncharov commented Mar 5, 2026

Uh oh!

greptile-apps bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

greptile-apps bot Mar 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 6, 2026

Uh oh!

greptile-apps bot Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if (block_offset_Y_in_tensor >= job.rows \|\| block_offset_X_in_tensor >= job.cols) {
		return false;

-  OType *const thread_out_base = dbias_output + tensor_id * cols + thread_id * nvec;
+  if (shape_rep != ShapeRepresentation::SAME_BOTH_DIMS && shape_rep != ShapeRepresentation::VARYING_FIRST_DIM) {
+    NVTE_DEVICE_ERROR("group_reduce_dbias_kernel requires uniform last dimensions across tensors");
+  }

Conversation

Oleg-Goncharov commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Oleg-Goncharov commented Mar 5, 2026

Uh oh!

greptile-apps bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Uh oh!

greptile-apps bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oleg-Goncharov commented Mar 5, 2026 •

edited

Loading

greptile-apps bot commented Mar 5, 2026 •

edited

Loading