[Common] Persistent Grouped NVFP4 quantization kernel#2743
[Common] Persistent Grouped NVFP4 quantization kernel#2743Oleg-Goncharov wants to merge 34 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
for more information, see https://pre-commit.ci
Greptile SummaryThis PR introduces a new persistent grouped NVFP4 quantize+transpose kernel ( The new kernel itself is well-designed. However, the PR contains several changes that appear to be development-time leftovers that would cause serious regressions if merged as-is:
Confidence Score: 1/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["nvte_group_quantize()"] --> B["group_quantize_fwd_helper()"]
B --> C{scaling_mode?}
C -->|NVTE_NVFP4_1D_SCALING| D["nvfp4::group_quantize_transpose()"]
C -->|NVTE_MXFP8_1D_SCALING| E["❌ NVTE_ERROR\n(case commented out)"]
C -->|other| F["NVTE_ERROR"]
D --> G{use_single_work_grid?}
G -->|SAME_BOTH_DIMS or\nVARYING_FIRST_DIM| H["Single contiguous work grid\n(ctaid_Y * cols + ctaid_X)"]
G -->|VARYING_LAST_DIM or\nVARYING_BOTH_DIMS| I["Per-element block grid\n(block_id * ELTS_PER_CHUNK)"]
H --> J["Persistent kernel\n(static grid-stride loop)"]
I --> J
J --> K["decode_job()\nResolve tensor_id, rows, cols"]
K --> L["is_job_valid()\nBounds check vs offsets"]
L -->|invalid| M["Drain in-flight TMA\nthen exit"]
L -->|valid| N["decode_block()\nblock_id_Y, block_id_X, offsets"]
N --> O["Prime pipeline:\nPrefetch PREFETCH_STAGES stage(s)\nvia TMA global→shared"]
O --> P["Main stage loop\n(STAGES = TILES_Y × TILES_X)"]
P --> Q["process_colwise_stage()\nAmax + E4M3 scale + FP4 quantize"]
P --> R["process_rowwise_stage()\nAmax + E4M3 scale + FP4 quantize"]
Q --> S["store_output_stage()\nTMA shared→global commit"]
R --> S
S --> T{Last PREFETCH_STAGES?}
T -->|yes| U["Prefetch stage-0\nof next job"]
U --> P
T -->|next job in stride| V["Advance ctaid\nstatic_next_block_id += stride"]
V --> K
T -->|no more jobs| W["cp_async_bulk_wait\nthen exit CTA"]
|
| # set(CMAKE_CUDA_ARCHITECTURES 75 80 89 90) | ||
| set(CMAKE_CUDA_ARCHITECTURES 100) |
There was a problem hiding this comment.
Development-only architecture restriction should not be merged
The old multi-architecture fallback (75 80 89 90) has been replaced with a Blackwell-only build (100). The original line is left behind as a commented-out breadcrumb. This means on any CI runner with CUDA < 12.8 the test binary will only compile for sm_100, silently skipping all Volta/Ampere/Ada/Hopper targets. This is clearly a local development shortcut and must be reverted before merging.
| # set(CMAKE_CUDA_ARCHITECTURES 75 80 89 90) | |
| set(CMAKE_CUDA_ARCHITECTURES 100) | |
| set(CMAKE_CUDA_ARCHITECTURES 75 80 89 90) |
| # test_cast.cu | ||
| # test_cast_current_scaling.cu | ||
| # test_cast_dbias.cu | ||
| # test_cast_dbias_dgelu.cu | ||
| # test_cast_gated_swiglu.cu | ||
| # test_cast_mxfp8_gated_swiglu.cu | ||
| # test_qdq.cu | ||
| # test_cast_mxfp8.cu | ||
| # test_cast_mxfp8_grouped.cu | ||
| # test_cast_nvfp4_transpose.cu | ||
| test_cast_nvfp4_transpose_grouped.cu | ||
| # test_cast_float8blockwise.cu | ||
| # test_dequantize_mxfp8.cu | ||
| # test_transpose.cu | ||
| # test_cast_transpose.cu | ||
| # test_cast_transpose_current_scaling.cu | ||
| # test_cast_transpose_dbias.cu | ||
| # test_cast_transpose_dbias_dgelu.cu | ||
| # test_cast_transpose_dgeglu.cu | ||
| # test_act.cu | ||
| # test_normalization.cu | ||
| # test_normalization_mxfp8.cu | ||
| # test_memset.cu | ||
| # test_multi_cast_transpose.cu | ||
| # test_multi_padding.cu | ||
| # test_multi_unpadding.cu | ||
| # test_causal_softmax.cu | ||
| # test_swizzle.cu | ||
| # test_swap_first_dims.cu | ||
| # test_grouped_gemm.cu |
There was a problem hiding this comment.
All existing operator tests commented out
Every pre-existing test (test_cast.cu, test_cast_mxfp8.cu, test_cast_mxfp8_grouped.cu, test_normalization.cu, etc.) has been commented out, leaving only test_cast_nvfp4_transpose_grouped.cu in the build. This completely disables the existing regression suite and hides any breakage introduced by the changes in quantize.cuh and group_quantize_mxfp8.cuh. These comment-outs appear to be a local development convenience and must be reverted before merging.
The new test file should simply be added to the existing list, not substituted for it.
| // using namespace detail; | ||
|
|
||
| // const Tensor *input_tensor = convertNVTETensorCheck(input); | ||
| // Tensor *output_tensor = convertNVTETensorCheck(output); | ||
|
|
||
| // // Quantization config | ||
| // QuantizationConfig quant_config_cpp; | ||
| // if (quant_config != nullptr) { | ||
| // quant_config_cpp = *reinterpret_cast<QuantizationConfig *>(quant_config); | ||
| // } | ||
|
|
||
| // // Noop flag | ||
| // Tensor dummy_tensor; | ||
| // Tensor *noop_tensor = &dummy_tensor; | ||
| // if (quant_config_cpp.noop_tensor != nullptr) { | ||
| // noop_tensor = convertNVTETensorCheck(quant_config_cpp.noop_tensor); | ||
| // } | ||
|
|
||
| // // Check for unsupported options | ||
| // if (quant_config_cpp.stochastic_rounding) { | ||
| // NVTE_CHECK(output_tensor->scaling_mode == NVTE_NVFP4_1D_SCALING, | ||
| // "Stochastic rounding is only supported for NVFP4 quantization."); | ||
| // } | ||
|
|
||
| // NVTE_CHECK(output_tensor->has_data() || output_tensor->has_columnwise_data(), | ||
| // "Either rowwise or columnwise output data need to be allocated."); | ||
|
|
||
| // // Dispatch to quantization kernel depending on data format | ||
| // switch (output_tensor->scaling_mode) { | ||
| // case NVTE_DELAYED_TENSOR_SCALING: { | ||
| // const Tensor *dummy_input_tensor = nullptr; | ||
| // Tensor *dummy_dbias_tensor = nullptr; | ||
| // Tensor *dummy_workspace_tensor = nullptr; | ||
| // if (output_tensor->has_columnwise_data()) { | ||
| // NVTE_CHECK(output_tensor->has_data(), | ||
| // "Quantizing in only the columnwise direction not supported yet!"); | ||
| // if constexpr (!IS_ACT) { | ||
| // cast_transpose(*input_tensor, *noop_tensor, output_tensor, stream); | ||
| // } else { | ||
| // cast_transpose_fused</*IS_DBIAS=*/false, /*IS_DACT=*/false, IS_ACT, float, ParamOP, OP>( | ||
| // *input_tensor, dummy_input_tensor, output_tensor, dummy_dbias_tensor, | ||
| // dummy_workspace_tensor, stream); | ||
| // } | ||
| // } else if (output_tensor->has_data()) { | ||
| // fp8::quantize</*IS_DBIAS=*/false, /*IS_DACT=*/false, IS_ACT, ParamOP, OP>( | ||
| // *input_tensor, dummy_input_tensor, noop_tensor, output_tensor, dummy_dbias_tensor, | ||
| // dummy_workspace_tensor, stream); | ||
| // } | ||
| // break; | ||
| // } | ||
| // case NVTE_MXFP8_1D_SCALING: { | ||
| // const Tensor *dummy_input_tensor = nullptr; | ||
| // Tensor *dummy_dbias_tensor = nullptr; | ||
| // Tensor *dummy_workspace_tensor = nullptr; | ||
| // mxfp8::quantize</*IS_DBIAS=*/false, /*IS_DACT=*/false, IS_ACT, ParamOP, OP>( | ||
| // *input_tensor, dummy_input_tensor, noop_tensor, output_tensor, dummy_dbias_tensor, | ||
| // dummy_workspace_tensor, stream); | ||
| // break; | ||
| // } | ||
| // case NVTE_NVFP4_1D_SCALING: { | ||
| // NVTE_CHECK(!IS_ACT, "IS_ACT is not supported by FWD NVTE_NVFP4_1D_SCALING"); | ||
|
|
||
| // // Check tensors | ||
| // CheckNoopTensor(*noop_tensor, "cast_noop"); | ||
| // CheckInputTensor(*input_tensor, "input"); | ||
| // CheckOutputTensor(*output_tensor, "output", false); | ||
|
|
||
| // // Choose kernel | ||
| // int32_t rows = input_tensor->flat_first_dim(); | ||
| // int32_t cols = input_tensor->flat_last_dim(); | ||
| // auto dtype = input_tensor->dtype(); | ||
| // bool use_optimized_kernel = (dtype == DType::kBFloat16) && (rows % 32 == 0) && | ||
| // (cols % 32 == 0) && output_tensor->has_data(); | ||
|
|
||
| // // Launch NVFP4 quantize kernel | ||
| // if (use_optimized_kernel) { | ||
| // if (quant_config_cpp.nvfp4_2d_quantization) { | ||
| // nvfp4::quantize_transpose</*use_2d_quantization=*/true>( | ||
| // *input_tensor, noop_tensor, output_tensor, &quant_config_cpp, stream); | ||
| // } else { | ||
| // nvfp4::quantize_transpose</*use_2d_quantization*/ false>( | ||
| // *input_tensor, noop_tensor, output_tensor, &quant_config_cpp, stream); | ||
| // } | ||
| // } else { | ||
| // auto &global_amax = (output_tensor->amax.dptr != nullptr) ? output_tensor->amax | ||
| // : output_tensor->columnwise_amax; | ||
| // quantize_transpose_vector_blockwise_fp4( | ||
| // /*input=*/input_tensor->data, /*global_amax=*/global_amax, | ||
| // /*scale_inv=*/output_tensor->scale_inv, | ||
| // /*scale_inv_t=*/output_tensor->columnwise_scale_inv, | ||
| // /*output=*/output_tensor->data, /*output_t=*/output_tensor->columnwise_data, | ||
| // /*epsilon=*/0.0f, /*return_identity=*/output_tensor->has_data(), | ||
| // /*return_transpose=*/output_tensor->has_columnwise_data(), /*pow2_scale=*/false, | ||
| // /*swizzled_scale=*/false, | ||
| // /*use_stochastic_rounding=*/quant_config_cpp.stochastic_rounding, | ||
| // /*rng_state=*/quant_config_cpp.rng_state, | ||
| // /*use_2d_quantization=*/quant_config_cpp.nvfp4_2d_quantization, | ||
| // /*noop_tensor=*/noop_tensor->data, /*stream=*/stream); | ||
| // } | ||
| // break; | ||
| // } | ||
| // case NVTE_BLOCK_SCALING_2D: { | ||
| // // TODO(kwyss): IS_ACT, ParamOP, OP parameters support. | ||
| // NVTE_CHECK(!IS_ACT, "IS_ACT is not implemented for FWD NVTE_BLOCK_SCALING_2D"); | ||
| // bool force_pow_2_scales = quant_config_cpp.force_pow_2_scales; | ||
| // float epsilon = quant_config_cpp.amax_epsilon; | ||
| // quantize_transpose_square_blockwise( | ||
| // input_tensor->data, output_tensor->scale_inv, output_tensor->columnwise_scale_inv, | ||
| // output_tensor->data, output_tensor->columnwise_data, epsilon, | ||
| // /*return_transpose=*/output_tensor->has_columnwise_data(), force_pow_2_scales, | ||
| // /*noop_tensor=*/noop_tensor->data, stream); | ||
| // break; | ||
| // } | ||
| // case NVTE_BLOCK_SCALING_1D: { | ||
| // // TODO(kwyss): IS_ACT, ParamOP, OP parameters support. | ||
| // NVTE_CHECK(!IS_ACT, "IS_ACT is not implemented for FWD NVTE_BLOCK_SCALING_1D"); | ||
| // bool force_pow_2_scales = quant_config_cpp.force_pow_2_scales; | ||
| // float epsilon = quant_config_cpp.amax_epsilon; | ||
| // FP8BlockwiseRowwiseOption rowwise_option = FP8BlockwiseRowwiseOption::NONE; | ||
| // FP8BlockwiseColumnwiseOption columnwise_option = FP8BlockwiseColumnwiseOption::NONE; | ||
| // if (output_tensor->has_data()) { | ||
| // rowwise_option = FP8BlockwiseRowwiseOption::ROWWISE_GEMM_READY; | ||
| // } | ||
| // if (output_tensor->has_columnwise_data()) { | ||
| // columnwise_option = FP8BlockwiseColumnwiseOption::COLUMNWISE_GEMM_READY; | ||
| // } | ||
| // quantize_transpose_vector_blockwise( | ||
| // input_tensor->data, output_tensor->scale_inv, output_tensor->columnwise_scale_inv, | ||
| // output_tensor->data, output_tensor->columnwise_data, epsilon, rowwise_option, | ||
| // columnwise_option, force_pow_2_scales, noop_tensor->data, stream); | ||
| // break; | ||
| // } | ||
| // default: | ||
| // NVTE_ERROR("Not implemented scaling mode: " + to_string(output_tensor->scaling_mode) + "."); | ||
| // } | ||
| } | ||
|
|
||
| template <bool IS_DBIAS, bool IS_DACT, typename ParamOP, float (*OP)(float, const ParamOP &)> | ||
| void quantize_bwd_helper(const NVTETensor grad, const NVTETensor input, NVTETensor output, | ||
| NVTETensor dbias, NVTETensor workspace, | ||
| const NVTEQuantizationConfig quant_config, cudaStream_t stream) { |
There was a problem hiding this comment.
quantize_fwd_helper body completely commented out — function is now a no-op
The entire implementation of quantize_fwd_helper (covering NVTE_DELAYED_TENSOR_SCALING, NVTE_MXFP8_1D_SCALING, NVTE_NVFP4_1D_SCALING, NVTE_BLOCK_SCALING_2D, and NVTE_BLOCK_SCALING_1D) has been replaced with block comments, leaving the function as an empty stub. Any call to single-tensor forward quantization (nvte_quantize, nvte_quantize_dbias, etc.) now silently does nothing.
This is a severe regression that would cause all non-grouped quantization code paths to silently produce uninitialized output. These commented-out lines must be restored.
| // case NVTE_MXFP8_1D_SCALING: { | ||
| // mxfp8::group_quantize</*IS_DBIAS=*/false, /*IS_DACT=*/false, IS_ACT, ParamOP, OP>( | ||
| // input_tensor, activations_tensor, noop_tensor, output_tensor, dbias_tensor, | ||
| // workspace_tensor, stream); | ||
| // break; | ||
| // } | ||
| case NVTE_NVFP4_1D_SCALING: { |
There was a problem hiding this comment.
MXFP8 grouped quantization dispatch disabled
The NVTE_MXFP8_1D_SCALING case has been commented out from both group_quantize_fwd_helper (line ~410) and group_quantize_bwd_helper (line ~464). As a result, any call to grouped MXFP8 forward or backward quantization falls through to the default branch and calls NVTE_ERROR, breaking existing MXFP8 grouped workflows.
If this is intentional (e.g., MXFP8 is being refactored in parallel), the PR description and checklist should call this out explicitly as a breaking change, and the NVTE_MXFP8_1D_SCALING cases should at least be preserved in a functional state or guarded by a feature flag.
| fp4e2m1x2 casted_to_e2m1_pair(scaled_elt_pair); | ||
| output[idx_pair] = casted_to_e2m1_pair; | ||
|
|
||
| const double2 truncated_pair = cvt_fp4x2_to_double2(casted_to_e2m1_pair); |
There was a problem hiding this comment.
Unused variable truncated_pair
truncated_pair is computed from cvt_fp4x2_to_double2(casted_to_e2m1_pair) but its value is never read. This will generate a compiler warning and the dead computation should be removed.
| const double2 truncated_pair = cvt_fp4x2_to_double2(casted_to_e2m1_pair); |
| void dump_nvfp4_tensor_data(const std::string& prefix, | ||
| const fp4e2m1 *test_data, const fp4e2m1 *ref_data, | ||
| const int rows, const int cols) { | ||
| std::string test_file = prefix + "_test.txt"; | ||
| std::string ref_file = prefix + "_ref.txt"; | ||
| std::string diff_file = prefix + "_diff.txt"; | ||
|
|
||
| std::ofstream test_out(test_file); | ||
| std::ofstream ref_out(ref_file); | ||
| std::ofstream diff_out(diff_file); | ||
|
|
||
| if (test_out.is_open() && ref_out.is_open() && diff_out.is_open()) { | ||
| for (int i = 0; i < rows; ++i) { | ||
| for (int j = 0; j < cols; j += 2) { | ||
| const int idx = i * cols + j; | ||
| double2 test_data_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&test_data[idx/2])); | ||
| double2 ref_data_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&ref_data[idx/2])); | ||
|
|
||
| for (int k = 0; k < 2; ++k) { | ||
| const double t = (k == 0 ? test_data_pair.x : test_data_pair.y); | ||
| const double r = (k == 0 ? ref_data_pair.x : ref_data_pair.y); | ||
| const int pos = idx + k; | ||
|
|
||
| test_out << "pos[" << pos << "] = " << t << std::endl; | ||
| ref_out << "pos[" << pos << "] = " << r << std::endl; | ||
| diff_out << "pos[" << pos << "] test=" << t << " ref=" << r | ||
| << " abs_diff=" << fabs(t - r) | ||
| << " rel_diff=" << (r == 0 ? 0.0 : fabs((t - r) / r)) << std::endl; | ||
| } | ||
| } | ||
| } | ||
| std::cout << "DEBUG: Dumped tensor data to files: " << test_file << ", " << ref_file << ", " << diff_file << std::endl; | ||
| } else { | ||
| std::cout << "WARNING: Could not open files for tensor data dump" << std::endl; | ||
| } | ||
| } | ||
|
|
||
| void print_detailed_tensor_comparison(const std::string& name, | ||
| const fp4e2m1 *test_data, const fp4e2m1 *ref_data, | ||
| const int rows, const int cols) { | ||
| printf("\n=== DETAILED COMPARISON for %s (%d×%d = %d elements) ===\n", | ||
| name.c_str(), rows, cols, rows * cols); | ||
|
|
||
| const int total_elements = rows * cols; | ||
| const int check_count = 128; | ||
|
|
||
| printf("--- FIRST %d ELEMENTS ---\n", check_count); | ||
| printf("Index | Test_Value | Ref_Value | Match\n"); | ||
| printf("------|---------------|---------------|-------\n"); | ||
| for (int i = 0; i < std::min(check_count, total_elements); ++i) { | ||
| double2 test_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&test_data[i/2])); | ||
| double2 ref_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&ref_data[i/2])); | ||
|
|
||
| double t = (i % 2 == 0) ? test_pair.x : test_pair.y; | ||
| double r = (i % 2 == 0) ? ref_pair.x : ref_pair.y; | ||
| bool match = (fabs(t - r) < 1e-6); | ||
|
|
||
| printf("%5d | %13.6f | %13.6f | %s\n", i, t, r, match ? "✓" : "✗"); | ||
| } | ||
|
|
||
| if (total_elements > 2 * check_count) { | ||
| printf("\n--- LAST %d ELEMENTS ---\n", check_count); | ||
| printf("Index | Test_Value | Ref_Value | Match\n"); | ||
| printf("------|---------------|---------------|-------\n"); | ||
| for (int i = total_elements - check_count; i < total_elements; ++i) { | ||
| double2 test_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&test_data[i/2])); | ||
| double2 ref_pair = cvt_fp4x2_to_double2(*reinterpret_cast<const fp4e2m1x2*>(&ref_data[i/2])); | ||
|
|
||
| double t = (i % 2 == 0) ? test_pair.x : test_pair.y; | ||
| double r = (i % 2 == 0) ? ref_pair.x : ref_pair.y; | ||
| bool match = (fabs(t - r) < 1e-6); | ||
|
|
||
| printf("%5d | %13.6f | %13.6f | %s\n", i, t, r, match ? "✓" : "✗"); | ||
| } | ||
| } | ||
| printf("==================================\n"); | ||
| } |
There was a problem hiding this comment.
Debug dump and verbose comparison utilities should be removed or hidden before merging
dump_nvfp4_tensor_data (line 234) writes test/ref/diff text files to disk during CI runs. print_detailed_tensor_comparison (line 271) prints element-by-element tables to stdout. Both functions are only conditionally exercised and print_detailed_tensor_comparison is already fully commented out at the call site (lines 323–324). These are development-time diagnostic aids that add noise, may fail in read-only CI environments, and were not part of any documented public API. They should be removed from the final PR, or at minimum wrapped in #ifdef NVTE_DEBUG_DUMP guards so they never run in CI.
Description
This PR adds a persistent grouped NVFP4 quantization + transpose kernel with static scheduling.
It is built on top of the PR#2738 [Common] Persistent Grouped MXFP8 quantization kernel
Type of change
Changes
Checklist: