[Common] Persistent Grouped NVFP4 quantization kernel by Oleg-Goncharov · Pull Request #2743 · NVIDIA/TransformerEngine

Oleg-Goncharov · 2026-03-06T16:27:05Z

Description

This PR adds a persistent grouped NVFP4 quantization + transpose kernel with static scheduling.
It is built on top of the PR#2738 [Common] Persistent Grouped MXFP8 quantization kernel

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Added persistent grouped kernel
Added test suite

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-03-06T16:32:09Z

Greptile Summary

This PR introduces a new persistent grouped NVFP4 quantize+transpose kernel (group_quantize_transpose_nvfp4_tuned_1D.cuh) with static grid-stride scheduling, double-buffered TMA prefetch, and per-tensor scale-offset bookkeeping. It also promotes the dbias output parameter from NVTETensor to NVTEGroupedTensor across the grouped activation/quantization API.

The new kernel itself is well-designed. However, the PR contains several changes that appear to be development-time leftovers that would cause serious regressions if merged as-is:

All pre-existing single-tensor quantization is silently broken: Both quantize_fwd_helper and quantize_bwd_helper in quantize.cuh have their entire bodies commented out, leaving empty no-op stubs for every scaling mode.
MXFP8 grouped quantization is disabled: The NVTE_MXFP8_1D_SCALING cases in both group_quantize_fwd_helper and group_quantize_bwd_helper are commented out, meaning any grouped MXFP8 call hits NVTE_ERROR.
Test CI coverage gutted: tests/cpp/operator/CMakeLists.txt comments out all 29 existing test sources, and tests/cpp/CMakeLists.txt restricts the fallback CUDA architecture to Blackwell-only (sm_100). These would prevent detection of any regressions introduced by the above changes.
Debug utilities left in test code: dump_nvfp4_tensor_data and print_detailed_tensor_comparison in the new test file write files to disk and print verbose tables to stdout; these should not ship in production test code.
Unused variable: truncated_pair is computed but never read in the reference quantize_nvfp4 helper.

Confidence Score: 1/5

Not safe to merge — pre-existing quantization paths are silently disabled and the test suite is stripped down to a single new test file.
The new kernel implementation looks sound in isolation, but the PR inadvertently (or as a development shortcut) comments out the entire bodies of quantize_fwd_helper and quantize_bwd_helper, disables the MXFP8 grouped dispatch, restricts test builds to Blackwell only, and comments out all pre-existing operator tests. These changes would silently break all non-grouped and all MXFP8 quantization in the library, and the disabled test suite would prevent CI from catching it.
transformer_engine/common/cast/dispatch/quantize.cuh (empty fwd/bwd helpers, disabled MXFP8 case), tests/cpp/CMakeLists.txt (architecture restriction), tests/cpp/operator/CMakeLists.txt (all tests commented out).

Important Files Changed

Filename	Overview
tests/cpp/CMakeLists.txt	CUDA architecture narrowed to sm_100 only in the fallback path, silently dropping Volta/Ampere/Ada/Hopper coverage; old line left as a comment.
tests/cpp/operator/CMakeLists.txt	All 29 existing operator test sources commented out; only the new nvfp4 grouped test is compiled, completely suppressing regression coverage.
transformer_engine/common/cast/dispatch/quantize.cuh	Both `quantize_fwd_helper` and `quantize_bwd_helper` are now empty stubs (entire bodies commented out); MXFP8 grouped dispatch is also disabled — all pre-existing quantization paths are silently broken.
transformer_engine/common/cast/nvfp4/specialized/group_quantize_transpose_nvfp4_tuned_1D.cuh	New persistent grouped NVFP4 quantize+transpose kernel with static grid-stride scheduling, double-buffered TMA prefetch, and separate rowwise/colwise scaling helpers; well-structured with configurable TunableConfig.
transformer_engine/common/cast/mxfp8/group_quantize_mxfp8.cuh	Refactored to share `ShapeRepresentation` enum and `JobDescriptor`/`BlockDescriptor` structs from common; MXFP8 kernel now also has persistent-mode support, but is currently unreachable because the dispatch in quantize.cuh is commented out.
transformer_engine/common/cast/core/common.cuh	Adds `ShapeRepresentation` enum (moved from mxfp8) and a new `group_reduce_dbias_kernel` / `grouped_reduce_dbias` for multi-tensor dbias reduction; changes are self-contained and correct.
tests/cpp/operator/test_cast_nvfp4_transpose_grouped.cu	New grouped NVFP4 quantize+transpose test suite; logic is sound but includes unused `truncated_pair` variable and debug file-dump/verbose-print utilities that should not ship in production tests.
tests/cpp/operator/test_cast_mxfp8_grouped.cu	Updated to support per-tensor dbias comparison across all group sizes; the old single-tensor special-casing is replaced with a unified loop, which is cleaner and more correct.
transformer_engine/common/include/transformer_engine/cast.h	API signature for `nvte_group_quantize_dbias` updated: `dbias` parameter promoted from `NVTETensor` to `NVTEGroupedTensor`, consistent with the new per-tensor dbias layout.
transformer_engine/common/activation/gelu.cu	dbias parameters in `nvte_group_quantize_dbias_dgelu` / `nvte_group_quantize_dbias_dqgelu` and internal local nulls updated from `NVTETensor` to `NVTEGroupedTensor`; straightforward signature alignment.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["nvte_group_quantize()"] --> B["group_quantize_fwd_helper()"]
    B --> C{scaling_mode?}
    C -->|NVTE_NVFP4_1D_SCALING| D["nvfp4::group_quantize_transpose()"]
    C -->|NVTE_MXFP8_1D_SCALING| E["❌ NVTE_ERROR\n(case commented out)"]
    C -->|other| F["NVTE_ERROR"]

    D --> G{use_single_work_grid?}
    G -->|SAME_BOTH_DIMS or\nVARYING_FIRST_DIM| H["Single contiguous work grid\n(ctaid_Y * cols + ctaid_X)"]
    G -->|VARYING_LAST_DIM or\nVARYING_BOTH_DIMS| I["Per-element block grid\n(block_id * ELTS_PER_CHUNK)"]

    H --> J["Persistent kernel\n(static grid-stride loop)"]
    I --> J

    J --> K["decode_job()\nResolve tensor_id, rows, cols"]
    K --> L["is_job_valid()\nBounds check vs offsets"]
    L -->|invalid| M["Drain in-flight TMA\nthen exit"]
    L -->|valid| N["decode_block()\nblock_id_Y, block_id_X, offsets"]

    N --> O["Prime pipeline:\nPrefetch PREFETCH_STAGES stage(s)\nvia TMA global→shared"]
    O --> P["Main stage loop\n(STAGES = TILES_Y × TILES_X)"]

    P --> Q["process_colwise_stage()\nAmax + E4M3 scale + FP4 quantize"]
    P --> R["process_rowwise_stage()\nAmax + E4M3 scale + FP4 quantize"]
    Q --> S["store_output_stage()\nTMA shared→global commit"]
    R --> S

    S --> T{Last PREFETCH_STAGES?}
    T -->|yes| U["Prefetch stage-0\nof next job"]
    U --> P
    T -->|next job in stride| V["Advance ctaid\nstatic_next_block_id += stride"]
    V --> K
    T -->|no more jobs| W["cp_async_bulk_wait\nthen exit CTA"]

Comments Outside Diff (1)

transformer_engine/common/cast/dispatch/quantize.cuh, line 175-314 (link)

quantize_bwd_helper body completely commented out — function is now a no-op

Identical to quantize_fwd_helper: the full backward quantization dispatch (all scaling modes) has been commented out. Any backward quantization pass — including the critical IS_DBIAS and IS_DACT paths used during training — now silently does nothing. This must be restored before merging.

_{Last reviewed commit: 46fdb93}

greptile-apps · 2026-03-06T16:32:12Z

tests/cpp/CMakeLists.txt

+    # set(CMAKE_CUDA_ARCHITECTURES 75 80 89 90)
+    set(CMAKE_CUDA_ARCHITECTURES 100)


Development-only architecture restriction should not be merged

The old multi-architecture fallback (75 80 89 90) has been replaced with a Blackwell-only build (100). The original line is left behind as a commented-out breadcrumb. This means on any CI runner with CUDA < 12.8 the test binary will only compile for sm_100, silently skipping all Volta/Ampere/Ada/Hopper targets. This is clearly a local development shortcut and must be reverted before merging.

Suggested change

# set(CMAKE_CUDA_ARCHITECTURES 75 80 89 90)

set(CMAKE_CUDA_ARCHITECTURES 100)

set(CMAKE_CUDA_ARCHITECTURES 75 80 89 90)

greptile-apps · 2026-03-06T16:32:13Z

tests/cpp/operator/CMakeLists.txt

+            #    test_cast.cu
+            #    test_cast_current_scaling.cu
+            #    test_cast_dbias.cu
+            #    test_cast_dbias_dgelu.cu
+            #    test_cast_gated_swiglu.cu
+            #    test_cast_mxfp8_gated_swiglu.cu
+            #    test_qdq.cu
+            #    test_cast_mxfp8.cu
+            #    test_cast_mxfp8_grouped.cu
+            #    test_cast_nvfp4_transpose.cu
+               test_cast_nvfp4_transpose_grouped.cu
+            #    test_cast_float8blockwise.cu
+            #    test_dequantize_mxfp8.cu
+            #    test_transpose.cu
+            #    test_cast_transpose.cu
+            #    test_cast_transpose_current_scaling.cu
+            #    test_cast_transpose_dbias.cu
+            #    test_cast_transpose_dbias_dgelu.cu
+            #    test_cast_transpose_dgeglu.cu
+            #    test_act.cu
+            #    test_normalization.cu
+            #    test_normalization_mxfp8.cu
+            #    test_memset.cu
+            #    test_multi_cast_transpose.cu
+            #    test_multi_padding.cu
+            #    test_multi_unpadding.cu
+            #    test_causal_softmax.cu
+            #    test_swizzle.cu
+            #    test_swap_first_dims.cu
+            #    test_grouped_gemm.cu


All existing operator tests commented out

Every pre-existing test (test_cast.cu, test_cast_mxfp8.cu, test_cast_mxfp8_grouped.cu, test_normalization.cu, etc.) has been commented out, leaving only test_cast_nvfp4_transpose_grouped.cu in the build. This completely disables the existing regression suite and hides any breakage introduced by the changes in quantize.cuh and group_quantize_mxfp8.cuh. These comment-outs appear to be a local development convenience and must be reverted before merging.

The new test file should simply be added to the existing list, not substituted for it.

greptile-apps · 2026-03-06T16:32:14Z

transformer_engine/common/cast/dispatch/quantize.cuh

+  // using namespace detail;
+
+  // const Tensor *input_tensor = convertNVTETensorCheck(input);
+  // Tensor *output_tensor = convertNVTETensorCheck(output);
+
+  // // Quantization config
+  // QuantizationConfig quant_config_cpp;
+  // if (quant_config != nullptr) {
+  //   quant_config_cpp = *reinterpret_cast<QuantizationConfig *>(quant_config);
+  // }
+
+  // // Noop flag
+  // Tensor dummy_tensor;
+  // Tensor *noop_tensor = &dummy_tensor;
+  // if (quant_config_cpp.noop_tensor != nullptr) {
+  //   noop_tensor = convertNVTETensorCheck(quant_config_cpp.noop_tensor);
+  // }
+
+  // // Check for unsupported options
+  // if (quant_config_cpp.stochastic_rounding) {
+  //   NVTE_CHECK(output_tensor->scaling_mode == NVTE_NVFP4_1D_SCALING,
+  //              "Stochastic rounding is only supported for NVFP4 quantization.");
+  // }
+
+  // NVTE_CHECK(output_tensor->has_data() || output_tensor->has_columnwise_data(),
+  //            "Either rowwise or columnwise output data need to be allocated.");
+
+  // // Dispatch to quantization kernel depending on data format
+  // switch (output_tensor->scaling_mode) {
+  //   case NVTE_DELAYED_TENSOR_SCALING: {
+  //     const Tensor *dummy_input_tensor = nullptr;
+  //     Tensor *dummy_dbias_tensor = nullptr;
+  //     Tensor *dummy_workspace_tensor = nullptr;
+  //     if (output_tensor->has_columnwise_data()) {
+  //       NVTE_CHECK(output_tensor->has_data(),
+  //                  "Quantizing in only the columnwise direction not supported yet!");
+  //       if constexpr (!IS_ACT) {
+  //         cast_transpose(*input_tensor, *noop_tensor, output_tensor, stream);
+  //       } else {
+  //         cast_transpose_fused</*IS_DBIAS=*/false, /*IS_DACT=*/false, IS_ACT, float, ParamOP, OP>(
+  //             *input_tensor, dummy_input_tensor, output_tensor, dummy_dbias_tensor,
+  //             dummy_workspace_tensor, stream);
+  //       }
+  //     } else if (output_tensor->has_data()) {
+  //       fp8::quantize</*IS_DBIAS=*/false, /*IS_DACT=*/false, IS_ACT, ParamOP, OP>(
+  //           *input_tensor, dummy_input_tensor, noop_tensor, output_tensor, dummy_dbias_tensor,
+  //           dummy_workspace_tensor, stream);
+  //     }
+  //     break;
+  //   }
+  //   case NVTE_MXFP8_1D_SCALING: {
+  //     const Tensor *dummy_input_tensor = nullptr;
+  //     Tensor *dummy_dbias_tensor = nullptr;
+  //     Tensor *dummy_workspace_tensor = nullptr;
+  //     mxfp8::quantize</*IS_DBIAS=*/false, /*IS_DACT=*/false, IS_ACT, ParamOP, OP>(
+  //         *input_tensor, dummy_input_tensor, noop_tensor, output_tensor, dummy_dbias_tensor,
+  //         dummy_workspace_tensor, stream);
+  //     break;
+  //   }
+  //   case NVTE_NVFP4_1D_SCALING: {
+  //     NVTE_CHECK(!IS_ACT, "IS_ACT is not supported by FWD NVTE_NVFP4_1D_SCALING");
+
+  //     // Check tensors
+  //     CheckNoopTensor(*noop_tensor, "cast_noop");
+  //     CheckInputTensor(*input_tensor, "input");
+  //     CheckOutputTensor(*output_tensor, "output", false);
+
+  //     // Choose kernel
+  //     int32_t rows = input_tensor->flat_first_dim();
+  //     int32_t cols = input_tensor->flat_last_dim();
+  //     auto dtype = input_tensor->dtype();
+  //     bool use_optimized_kernel = (dtype == DType::kBFloat16) && (rows % 32 == 0) &&
+  //                                 (cols % 32 == 0) && output_tensor->has_data();
+
+  //     // Launch NVFP4 quantize kernel
+  //     if (use_optimized_kernel) {
+  //       if (quant_config_cpp.nvfp4_2d_quantization) {
+  //         nvfp4::quantize_transpose</*use_2d_quantization=*/true>(
+  //             *input_tensor, noop_tensor, output_tensor, &quant_config_cpp, stream);
+  //       } else {
+  //         nvfp4::quantize_transpose</*use_2d_quantization*/ false>(
+  //             *input_tensor, noop_tensor, output_tensor, &quant_config_cpp, stream);
+  //       }
+  //     } else {
+  //       auto &global_amax = (output_tensor->amax.dptr != nullptr) ? output_tensor->amax
+  //                                                                 : output_tensor->columnwise_amax;
+  //       quantize_transpose_vector_blockwise_fp4(
+  //           /*input=*/input_tensor->data, /*global_amax=*/global_amax,
+  //           /*scale_inv=*/output_tensor->scale_inv,
+  //           /*scale_inv_t=*/output_tensor->columnwise_scale_inv,
+  //           /*output=*/output_tensor->data, /*output_t=*/output_tensor->columnwise_data,
+  //           /*epsilon=*/0.0f, /*return_identity=*/output_tensor->has_data(),
+  //           /*return_transpose=*/output_tensor->has_columnwise_data(), /*pow2_scale=*/false,
+  //           /*swizzled_scale=*/false,
+  //           /*use_stochastic_rounding=*/quant_config_cpp.stochastic_rounding,
+  //           /*rng_state=*/quant_config_cpp.rng_state,
+  //           /*use_2d_quantization=*/quant_config_cpp.nvfp4_2d_quantization,
+  //           /*noop_tensor=*/noop_tensor->data, /*stream=*/stream);
+  //     }
+  //     break;
+  //   }
+  //   case NVTE_BLOCK_SCALING_2D: {
+  //     // TODO(kwyss): IS_ACT, ParamOP, OP parameters support.
+  //     NVTE_CHECK(!IS_ACT, "IS_ACT is not implemented for FWD NVTE_BLOCK_SCALING_2D");
+  //     bool force_pow_2_scales = quant_config_cpp.force_pow_2_scales;
+  //     float epsilon = quant_config_cpp.amax_epsilon;
+  //     quantize_transpose_square_blockwise(
+  //         input_tensor->data, output_tensor->scale_inv, output_tensor->columnwise_scale_inv,
+  //         output_tensor->data, output_tensor->columnwise_data, epsilon,
+  //         /*return_transpose=*/output_tensor->has_columnwise_data(), force_pow_2_scales,
+  //         /*noop_tensor=*/noop_tensor->data, stream);
+  //     break;
+  //   }
+  //   case NVTE_BLOCK_SCALING_1D: {
+  //     // TODO(kwyss): IS_ACT, ParamOP, OP parameters support.
+  //     NVTE_CHECK(!IS_ACT, "IS_ACT is not implemented for FWD NVTE_BLOCK_SCALING_1D");
+  //     bool force_pow_2_scales = quant_config_cpp.force_pow_2_scales;
+  //     float epsilon = quant_config_cpp.amax_epsilon;
+  //     FP8BlockwiseRowwiseOption rowwise_option = FP8BlockwiseRowwiseOption::NONE;
+  //     FP8BlockwiseColumnwiseOption columnwise_option = FP8BlockwiseColumnwiseOption::NONE;
+  //     if (output_tensor->has_data()) {
+  //       rowwise_option = FP8BlockwiseRowwiseOption::ROWWISE_GEMM_READY;
+  //     }
+  //     if (output_tensor->has_columnwise_data()) {
+  //       columnwise_option = FP8BlockwiseColumnwiseOption::COLUMNWISE_GEMM_READY;
+  //     }
+  //     quantize_transpose_vector_blockwise(
+  //         input_tensor->data, output_tensor->scale_inv, output_tensor->columnwise_scale_inv,
+  //         output_tensor->data, output_tensor->columnwise_data, epsilon, rowwise_option,
+  //         columnwise_option, force_pow_2_scales, noop_tensor->data, stream);
+  //     break;
+  //   }
+  //   default:
+  //     NVTE_ERROR("Not implemented scaling mode: " + to_string(output_tensor->scaling_mode) + ".");
+  // }
 }

 template <bool IS_DBIAS, bool IS_DACT, typename ParamOP, float (*OP)(float, const ParamOP &)>
 void quantize_bwd_helper(const NVTETensor grad, const NVTETensor input, NVTETensor output,
                         NVTETensor dbias, NVTETensor workspace,
                         const NVTEQuantizationConfig quant_config, cudaStream_t stream) {


quantize_fwd_helper body completely commented out — function is now a no-op

The entire implementation of quantize_fwd_helper (covering NVTE_DELAYED_TENSOR_SCALING, NVTE_MXFP8_1D_SCALING, NVTE_NVFP4_1D_SCALING, NVTE_BLOCK_SCALING_2D, and NVTE_BLOCK_SCALING_1D) has been replaced with block comments, leaving the function as an empty stub. Any call to single-tensor forward quantization (nvte_quantize, nvte_quantize_dbias, etc.) now silently does nothing.

This is a severe regression that would cause all non-grouped quantization code paths to silently produce uninitialized output. These commented-out lines must be restored.

greptile-apps · 2026-03-06T16:32:18Z

transformer_engine/common/cast/dispatch/quantize.cuh

+    // case NVTE_MXFP8_1D_SCALING: {
+    //   mxfp8::group_quantize</*IS_DBIAS=*/false, /*IS_DACT=*/false, IS_ACT, ParamOP, OP>(
+    //       input_tensor, activations_tensor, noop_tensor, output_tensor, dbias_tensor,
+    //       workspace_tensor, stream);
+    //   break;
+    // }
+    case NVTE_NVFP4_1D_SCALING: {


MXFP8 grouped quantization dispatch disabled

The NVTE_MXFP8_1D_SCALING case has been commented out from both group_quantize_fwd_helper (line ~410) and group_quantize_bwd_helper (line ~464). As a result, any call to grouped MXFP8 forward or backward quantization falls through to the default branch and calls NVTE_ERROR, breaking existing MXFP8 grouped workflows.

If this is intentional (e.g., MXFP8 is being refactored in parallel), the PR description and checklist should call this out explicitly as a breaking change, and the NVTE_MXFP8_1D_SCALING cases should at least be preserved in a functional state or guarded by a feature flag.

greptile-apps · 2026-03-06T16:32:19Z

tests/cpp/operator/test_cast_nvfp4_transpose_grouped.cu

+                fp4e2m1x2 casted_to_e2m1_pair(scaled_elt_pair);
+                output[idx_pair] = casted_to_e2m1_pair;
+
+                const double2 truncated_pair = cvt_fp4x2_to_double2(casted_to_e2m1_pair);


Unused variable truncated_pair

truncated_pair is computed from cvt_fp4x2_to_double2(casted_to_e2m1_pair) but its value is never read. This will generate a compiler warning and the dead computation should be removed.

Suggested change

const double2 truncated_pair = cvt_fp4x2_to_double2(casted_to_e2m1_pair);

greptile-apps · 2026-03-06T16:32:21Z

tests/cpp/operator/test_cast_nvfp4_transpose_grouped.cu

+void dump_nvfp4_tensor_data(const std::string& prefix,
+                            const fp4e2m1 *test_data, const fp4e2m1 *ref_data,
+                            const int rows, const int cols) {
+    std::string test_file = prefix + "_test.txt";
+    std::string ref_file = prefix + "_ref.txt";
+    std::string diff_file = prefix + "_diff.txt";
+
+    std::ofstream test_out(test_file);
+    std::ofstream ref_out(ref_file);
+    std::ofstream diff_out(diff_file);
+
+    if (test_out.is_open() && ref_out.is_open() && diff_out.is_open()) {
+        for (int i = 0; i < rows; ++i) {
+            for (int j = 0; j < cols; j += 2) {
+                const int idx = i * cols + j;
+                double2 test_data_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&test_data[idx/2]));
+                double2 ref_data_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&ref_data[idx/2]));
+
+                for (int k = 0; k < 2; ++k) {
+                    const double t = (k == 0 ? test_data_pair.x : test_data_pair.y);
+                    const double r = (k == 0 ? ref_data_pair.x : ref_data_pair.y);
+                    const int pos = idx + k;
+
+                    test_out << "pos[" << pos << "] = " << t << std::endl;
+                    ref_out << "pos[" << pos << "] = " << r << std::endl;
+                    diff_out << "pos[" << pos << "] test=" << t << " ref=" << r
+                            << " abs_diff=" << fabs(t - r)
+                            << " rel_diff=" << (r == 0 ? 0.0 : fabs((t - r) / r)) << std::endl;
+                }
+            }
+        }
+        std::cout << "DEBUG: Dumped tensor data to files: " << test_file << ", " << ref_file << ", " << diff_file << std::endl;
+    } else {
+        std::cout << "WARNING: Could not open files for tensor data dump" << std::endl;
+    }
+}
+
+void print_detailed_tensor_comparison(const std::string& name,
+                                     const fp4e2m1 *test_data, const fp4e2m1 *ref_data,
+                                     const int rows, const int cols) {
+    printf("\n=== DETAILED COMPARISON for %s (%d×%d = %d elements) ===\n",
+           name.c_str(), rows, cols, rows * cols);
+
+    const int total_elements = rows * cols;
+    const int check_count = 128;
+
+    printf("--- FIRST %d ELEMENTS ---\n", check_count);
+    printf("Index | Test_Value    | Ref_Value     | Match\n");
+    printf("------|---------------|---------------|-------\n");
+    for (int i = 0; i < std::min(check_count, total_elements); ++i) {
+        double2 test_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&test_data[i/2]));
+        double2 ref_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&ref_data[i/2]));
+
+        double t = (i % 2 == 0) ? test_pair.x : test_pair.y;
+        double r = (i % 2 == 0) ? ref_pair.x : ref_pair.y;
+        bool match = (fabs(t - r) < 1e-6);
+
+        printf("%5d | %13.6f | %13.6f | %s\n", i, t, r, match ? "✓" : "✗");
+    }
+
+    if (total_elements > 2 * check_count) {
+        printf("\n--- LAST %d ELEMENTS ---\n", check_count);
+        printf("Index | Test_Value    | Ref_Value     | Match\n");
+        printf("------|---------------|---------------|-------\n");
+        for (int i = total_elements - check_count; i < total_elements; ++i) {
+            double2 test_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&test_data[i/2]));
+            double2 ref_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&ref_data[i/2]));
+
+            double t = (i % 2 == 0) ? test_pair.x : test_pair.y;
+            double r = (i % 2 == 0) ? ref_pair.x : ref_pair.y;
+            bool match = (fabs(t - r) < 1e-6);
+
+            printf("%5d | %13.6f | %13.6f | %s\n", i, t, r, match ? "✓" : "✗");
+        }
+    }
+    printf("==================================\n");
+}


Debug dump and verbose comparison utilities should be removed or hidden before merging

dump_nvfp4_tensor_data (line 234) writes test/ref/diff text files to disk during CI runs. print_detailed_tensor_comparison (line 271) prints element-by-element tables to stdout. Both functions are only conditionally exercised and print_detailed_tensor_comparison is already fully commented out at the call site (lines 323–324). These are development-time diagnostic aids that add noise, may fail in read-only CI environments, and were not part of any documented public API. They should be removed from the final PR, or at minimum wrapped in #ifdef NVTE_DEBUG_DUMP guards so they never run in CI.

Oleg-Goncharov and others added 30 commits February 27, 2026 15:53

Implemented the kernel with split dbias

c7c1a76

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7abbc7b

for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

f820b21

for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Relaxed constraints on the last dimension

0c05632

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Added notes on group tensor restrictions into documentation

4a85dea

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixes per the review

aedd53d

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixed pointer

38288b1

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

More fixes

ce3a137

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixed kernel grid size

bddd804

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Merge branch 'main' into pr_split_dbias

a894d1a

Enabled persistency with WorkID Query feature

87352bd

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Added a struct with tunable parameters

e23f553

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Added persistency with static scheduling

d185299

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixed test cases

5e15f57

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Ready for benchmarking

98e9558

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixed out-of-boundary error

ab816cb

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Tuned kernel parameters

8a429ad

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Refactoring

ab3f911

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Refactoring 2

92720ac

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Refactoring 3

46d9811

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Removed the dynamic (WorkID Query) persistency

7172400

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Ready for PR

4344627

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

ede33b4

for more information, see https://pre-commit.ci

Merge branch 'main' into pr_persistent_grouped_mxfp8_kernel

219e925

Fixes per the review

325181b

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Merge branch 'main' into pr_persistent_grouped_mxfp8_kernel

04609b1

[pre-commit.ci] auto fixes from pre-commit.com hooks

5815335

for more information, see https://pre-commit.ci

Added the test suite

0bd837c

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Initial kernel draft

0c5849c

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Refactoring

178a7c4

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Oleg-Goncharov and others added 4 commits March 6, 2026 14:33

Added the kernel to the quantization dispatcher

b035b43

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Isolated only the Group Quantize NVFP4 for compilation

9d72757

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

Fixed test suite and bug in scaling factors padding

da8da89

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

46fdb93

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Common] Persistent Grouped NVFP4 quantization kernel#2743

[Common] Persistent Grouped NVFP4 quantization kernel#2743
Oleg-Goncharov wants to merge 34 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_persistent_grouped_nvfp4_kernel

Oleg-Goncharov commented Mar 6, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 6, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps bot Mar 6, 2026

Uh oh!

greptile-apps bot Mar 6, 2026

Uh oh!

greptile-apps bot Mar 6, 2026

Uh oh!

greptile-apps bot Mar 6, 2026

Uh oh!

greptile-apps bot Mar 6, 2026

Uh oh!

greptile-apps bot Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# set(CMAKE_CUDA_ARCHITECTURES 75 80 89 90)
		set(CMAKE_CUDA_ARCHITECTURES 100)

	# set(CMAKE_CUDA_ARCHITECTURES 75 80 89 90)
	set(CMAKE_CUDA_ARCHITECTURES 100)
	set(CMAKE_CUDA_ARCHITECTURES 75 80 89 90)

Conversation

Oleg-Goncharov commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 1/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oleg-Goncharov commented Mar 6, 2026 •

edited

Loading

greptile-apps bot commented Mar 6, 2026 •

edited

Loading