-
Notifications
You must be signed in to change notification settings - Fork 74
Benchmark for nvfp4 scaled mm #5737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Review updated until commit d6c1cb6 Description
|
| Relevant files | |||||
|---|---|---|---|---|---|
| Bug fix |
| ||||
| Enhancement |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 No relevant tests |
| ⚡ Recommended focus areas for review |
Missing Performance Data
|
Test failures
-
(High, 45)
NVFuser internal assertion failures (evaluate() const-prop) in AllocationDomainTest, LowerCollectiveTest, MultiDeviceTest across multiple suitesTest Name A100 A100 (dist.) GB200 GB200 (dist.) H100 H100 (dist.) Source AllocationDomainTest.NCHW4d_To_NHWC2d ❌ ❌ ❌ Link AllocationDomainTest.NHWC2d_To_NHWC2d_cacheAfter ❌ ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_1024x1024_NoGlobalScale_WithSwizzle ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_1024x1024_WithGlobalScale_WithSwizzle ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_128x64_WithGlobalScale_WithSwizzle ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_2048x128_NoGlobalScale_WithSwizzle ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_2048x2048_NoGlobalScale_WithSwizzle ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/__bfloat_2048x2048_WithGlobalScale_WithSwizzle ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/float_1024x1024_NoGlobalScale_WithSwizzle ❌ Link BlockQuantizationSchedulingTestSuite/BlockQuantizationSchedulingTest.AutoScheduleSingleOp/float_128x64_NoGlobalScale_WithSwizzle ❌ Link ... with 12 more test failures omitted. Check internal logs. -
(Medium, 35)
NVFuser internal assert on Val::evaluate (const value inference) in multiple *multidevice* and *stream* testsTest Name A100 A100 (dist.) GB200 (dist.) H100 H100 (dist.) Source tests.python.direct.test_stream.test_two_matmuls_inlinable[nvfuser_direct_test=eager] ❌ ❌ tests.python.direct.test_stream.test_two_matmuls_inlinable[nvfuser_direct_test=lru_cache] ❌ ❌ tests.python.multidevice.test_dtensor.test_plus_one ❌ ❌ ❌ ❌ ❌ tests.python.multidevice.test_matmul.test_sequence_parallel_linear ❌ tests.python.multidevice.test_multidevice.test_binary ❌ ❌ ❌ ❌ ❌ tests.python.multidevice.test_multidevice.test_inner_reduction ❌ ❌ ❌ ❌ ❌ tests.python.multidevice.test_multidevice.test_insert_resharding_after ❌ ❌ ❌ ❌ ❌ tests.python.multidevice.test_multidevice.test_privatize_squeeze ❌ ❌ ❌ ❌ ❌ tests.python.multidevice.test_overlap.test_row_parallel_linear_forward ❌ ❌ ❌ ❌ ❌
Greptile SummaryThis PR adds nvfp4 quantized scaled MM operations to the Python benchmark infrastructure. It implements Key changes:
The implementation follows the existing pattern for grouped operations and properly registers translators with Thunder's nvfuser backend. Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Benchmark
participant Quantize as _quantize_llama4
participant Model as Llama4MoE
participant NVFP4Linear as NVFP4InferenceLinear
participant CustomOp as nvf_cutlass::f16a_nvfp4weight_scaled_mm
participant Thunder as Thunder/nvfuser
User->>Benchmark: Run benchmark
Benchmark->>Quantize: quantize model
Quantize->>Model: Find SwiGLU modules
Quantize->>NVFP4Linear: Replace with NVFP4InferenceSwiGLU.from_swiglu()
Note over NVFP4Linear: Quantizes weights to FP4 format
Quantize->>Model: Find Llama4MoE.gate layer
Quantize->>NVFP4Linear: Replace with NVFP4InferenceLinear.from_linear()
Note over NVFP4Linear: Stores fp4_weight, scaling factors
User->>Model: Forward pass with inputs
Model->>NVFP4Linear: forward(hidden_states)
NVFP4Linear->>NVFP4Linear: Flatten to 2D: view(-1, in_features)
NVFP4Linear->>CustomOp: f16a_nvfp4weight_scaled_mm(activation, fp4_weight, scales)
CustomOp->>CustomOp: Dequantize FP4 weights to high precision
CustomOp->>CustomOp: torch.nn.functional.linear(activation, hp_weight)
CustomOp-->>NVFP4Linear: Output tensor (2D)
NVFP4Linear-->>Model: Return output
Note over Benchmark,Thunder: During compilation
Thunder->>CustomOp: Register nvfp4_scaled_mm_translator
Thunder->>Thunder: Translate to nvfuser operations
Thunder->>Thunder: Generate optimized kernel
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 1 comment
…to pbasu/nvfp4_linear_bench
|
!test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 files reviewed, 1 comment
|
!test |
This add quantized scaled MM ops to our Python benchmark.
This will create/quantize the module to:
There was a small bug fixed.
When inferring the output allocation we don't call
tensor_.viewwhen one of the split was not a divisible split.This problem shows up when we pad the inner dimension by 4, and the "padded" outer split dimension was one.