Skip to content

Conversation

@protonu
Copy link
Collaborator

@protonu protonu commented Dec 23, 2025

This add quantized scaled MM ops to our Python benchmark.

This will create/quantize the module to:

        (feed_forward): Llama4MoE(
          (gate): NVFP4InferenceLinear()
          (shared_experts): NVFP4InferenceSwiGLU(
            (gate_proj): NVFP4InferenceLinear()
            (up_proj): NVFP4InferenceLinear()
            (down_proj): NVFP4InferenceLinear()
          )
          (routed_experts): NVFP4InferenceGroupedSwiGLU(
            (gate_proj): NVFP4InferenceGroupedLinear()
            (up_proj): NVFP4InferenceGroupedLinear()
            (down_proj): NVFP4InferenceGroupedLinear()
          )
        )

There was a small bug fixed.
When inferring the output allocation we don't call tensor_.view when one of the split was not a divisible split.
This problem shows up when we pad the inner dimension by 4, and the "padded" outer split dimension was one.

@protonu protonu changed the title Pbasu/nvfp4 linear bench [DO NOT REVIEW] benchmark for nvfp4 scaled mm Dec 23, 2025
@github-actions
Copy link

github-actions bot commented Jan 5, 2026

Review updated until commit d6c1cb6

Description

  • Add nvfp4 scaled MM operations to Python benchmark framework

  • Implement NVFP4InferenceLinear and NVFP4InferenceSwiGLU classes for quantized inference

  • Fix tensor view bug when split dimensions are not divisible

  • Integrate new layers into Llama4MoE quantization pipeline

Changes walkthrough

Relevant files
Bug fix
allocations.cpp
Fix tensor view bug for non-divisible splits                         

csrc/runtime/allocations.cpp

  • Add divisibility check before calling tensor_.view() for merged
    dimensions
  • Fix bug where tensor view fails when split dimensions are not
    divisible
  • +5/-1     
    Enhancement
    benchmark_inference.py
    Add nvfp4 scaled MM benchmarking support                                 

    benchmarks/python/benchmark_inference.py

  • Register nvfp4 scaled MM custom operation with Thunder
  • Add nvfp4_scaled_mm_translator for nvfuser integration
  • Extend quantization to include NVFP4InferenceSwiGLU and
    NVFP4InferenceLinear
  • Add quantization of gate projection in Llama4MoE modules
  • +57/-0   
    layers_for_inference_benchmark.py
    Implement NVFP4 inference layer classes                                   

    benchmarks/python/layers_for_inference_benchmark.py

  • Add nvfuser_f16a_nvfp4weight_scaled_mm custom operation with fake
    implementation
  • Implement NVFP4InferenceLinear class for quantized linear layers
  • Implement NVFP4InferenceSwiGLU class for quantized SwiGLU layers
  • Add from_linear and from_swiglu conversion methods
  • +154/-0 

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    ⚡ Recommended focus areas for review
    Missing Performance Data

    The PR description mentions this is a benchmark for nvfp4 scaled mm, but no actual benchmark results or performance data are provided. As per guidelines, PRs should include current performance metrics, expected performance gains, and gap analysis. The PR should demonstrate the performance benefits of the new nvfp4 scaled MM operations compared to existing implementations.

    Incomplete Translator Implementation

    The nvfp4_scaled_mm_translator function appears to be a basic wrapper that calls existing PyTorch operations rather than utilizing specialized nvfuser hardware acceleration. The translator should leverage nvfuser's optimized implementations for better performance. The current implementation may not provide the intended performance benefits.

    Bug Fix Validation

    The bug fix for tensor view allocation when dealing with non-divisible splits needs validation. While the logic appears sound (checking divisibility before calling view), the PR should include test cases that specifically exercise this code path to ensure the fix works correctly and doesn't introduce regressions.

    Test failures

    • (High, 45) NVFuser internal assertion failures (evaluate() const-prop) in AllocationDomainTest, LowerCollectiveTest, MultiDeviceTest across multiple suites

      Test Name A100 A100 (dist.) GB200 GB200 (dist.) H100 H100 (dist.) Source
      AllocationDomainTest.NCHW4d_To_NHWC2d Link
      AllocationDomainTest.NHWC2d_To_NHWC2d_cacheAfter Link
      BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_1024x1024_NoGlobalScale_WithSwizzle Link
      BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_1024x1024_WithGlobalScale_WithSwizzle Link
      BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_128x64_WithGlobalScale_WithSwizzle Link
      BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_2048x128_NoGlobalScale_WithSwizzle Link
      BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_2048x2048_NoGlobalScale_WithSwizzle Link
      BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_2048x2048_WithGlobalScale_WithSwizzle Link
      BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/float_1024x1024_NoGlobalScale_WithSwizzle Link
      BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/float_128x64_NoGlobalScale_WithSwizzle Link
      ... with 12 more test failures omitted. Check internal logs.
    • (Medium, 35) NVFuser internal assert on Val::evaluate (const value inference) in multiple *multidevice* and *stream* tests

      Test Name A100 A100 (dist.) GB200 (dist.) H100 H100 (dist.) Source
      tests.python.direct.test_stream.test_two_matmuls_inlinable[nvfuser_direct_test=eager]
      tests.python.direct.test_stream.test_two_matmuls_inlinable[nvfuser_direct_test=lru_cache]
      tests.python.multidevice.test_dtensor.test_plus_one
      tests.python.multidevice.test_matmul.test_sequence_parallel_linear
      tests.python.multidevice.test_multidevice.test_binary
      tests.python.multidevice.test_multidevice.test_inner_reduction
      tests.python.multidevice.test_multidevice.test_insert_resharding_after
      tests.python.multidevice.test_multidevice.test_privatize_squeeze
      tests.python.multidevice.test_overlap.test_row_parallel_linear_forward

    @protonu protonu changed the title [DO NOT REVIEW] benchmark for nvfp4 scaled mm Benchmark for nvfp4 scaled mm Jan 5, 2026
    @protonu protonu marked this pull request as ready for review January 5, 2026 22:21
    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Jan 5, 2026

    Greptile Summary

    This PR adds nvfp4 quantized scaled MM operations to the Python benchmark infrastructure. It implements NVFP4InferenceLinear and NVFP4InferenceSwiGLU classes that use custom nvf_cutlass::f16a_nvfp4weight_scaled_mm operations for efficient inference with FP4-quantized weights.

    Key changes:

    • Added nvfuser_f16a_nvfp4weight_scaled_mm custom op with fake registration for graph compilation
    • Implemented NVFP4InferenceLinear layer that performs FP4-quantized matrix multiplication
    • Implemented NVFP4InferenceSwiGLU that composes three quantized linear layers
    • Extended _quantize_llama4() to quantize SwiGLU modules and the gate projection layer
    • Fixed C++ allocation bug where tensor_.view() was incorrectly called on non-divisible splits

    The implementation follows the existing pattern for grouped operations and properly registers translators with Thunder's nvfuser backend.

    Confidence Score: 4/5

    • PR is mostly safe to merge with known shape handling issues already flagged
    • The implementation follows established patterns and includes proper validation. The C++ bug fix is well-scoped and addresses a specific edge case. Two issues were previously identified but not critical for initial merge: output shape documentation mismatch and dtype consistency in the fake function. These are known and can be addressed in follow-ups if they cause issues.
    • benchmarks/python/layers_for_inference_benchmark.py has known issues already flagged in previous review

    Important Files Changed

    Filename Overview
    benchmarks/python/benchmark_inference.py Adds nvfp4 scaled MM ops registration and quantization logic for SwiGLU and gate projection layers
    benchmarks/python/layers_for_inference_benchmark.py Implements NVFP4InferenceLinear and NVFP4InferenceSwiGLU with custom ops; has shape handling issue in forward (already flagged)
    csrc/runtime/allocations.cpp Fixes bug by checking divisibility before calling tensor_.view() for non-divisible splits with padding

    Sequence Diagram

    sequenceDiagram
        participant User
        participant Benchmark
        participant Quantize as _quantize_llama4
        participant Model as Llama4MoE
        participant NVFP4Linear as NVFP4InferenceLinear
        participant CustomOp as nvf_cutlass::f16a_nvfp4weight_scaled_mm
        participant Thunder as Thunder/nvfuser
    
        User->>Benchmark: Run benchmark
        Benchmark->>Quantize: quantize model
        Quantize->>Model: Find SwiGLU modules
        Quantize->>NVFP4Linear: Replace with NVFP4InferenceSwiGLU.from_swiglu()
        Note over NVFP4Linear: Quantizes weights to FP4 format
        Quantize->>Model: Find Llama4MoE.gate layer
        Quantize->>NVFP4Linear: Replace with NVFP4InferenceLinear.from_linear()
        Note over NVFP4Linear: Stores fp4_weight, scaling factors
        
        User->>Model: Forward pass with inputs
        Model->>NVFP4Linear: forward(hidden_states)
        NVFP4Linear->>NVFP4Linear: Flatten to 2D: view(-1, in_features)
        NVFP4Linear->>CustomOp: f16a_nvfp4weight_scaled_mm(activation, fp4_weight, scales)
        CustomOp->>CustomOp: Dequantize FP4 weights to high precision
        CustomOp->>CustomOp: torch.nn.functional.linear(activation, hp_weight)
        CustomOp-->>NVFP4Linear: Output tensor (2D)
        NVFP4Linear-->>Model: Return output
        
        Note over Benchmark,Thunder: During compilation
        Thunder->>CustomOp: Register nvfp4_scaled_mm_translator
        Thunder->>Thunder: Translate to nvfuser operations
        Thunder->>Thunder: Generate optimized kernel
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    3 files reviewed, 1 comment

    Edit Code Review Agent Settings | Greptile

    @protonu
    Copy link
    Collaborator Author

    protonu commented Jan 5, 2026

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    3 files reviewed, 1 comment

    Edit Code Review Agent Settings | Greptile

    @protonu
    Copy link
    Collaborator Author

    protonu commented Jan 5, 2026

    !test

    @protonu protonu requested review from jjsjann123 and tbqh January 5, 2026 22:51
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants