NVFP4 dequantization by aris134 · Pull Request #505 · ROCm/TransformerEngine

aris134 · 2026-03-25T16:22:33Z

Description

Fixes https://github.com/ROCm/frameworks-internal/issues/15998

Enable NVFP4 dequantization on AMD GPU (gfx950) and add unit test.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Enable compilation of NVFP4 dequantization kernel for AMD GPU
Add unit test that verifies NVFP4 dequantization works on gfx950

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Resolve wheels and examples

…ener/fp4-cast-transpose

This reverts commit 5c747bd.

…ener/fp4-cast-transpose

…pose

matthiasdiener · 2026-03-26T14:57:21Z

tests/cpp/operator/test_dequantize_nvfp4.cu

+    ASSERT_EQ(err, hipSuccess) << hipGetErrorString(err);
+
+    const float amax = 1.0f;
+    input.set_tensor_amax(amax);


set_scale() instead?

Yeah, I think for dequantization, the scale is needed

ipanfilo

It is based on PR#472. Not to review the same changes twice let's wait for that PR to merge

wangye805 · 2026-03-27T16:22:55Z

tests/cpp/operator/test_dequantize_nvfp4.cu

+}
+
+std::vector<std::pair<size_t, size_t>> tensor_dims = {
+    {32, 32},


Like mxfp8, NV fp4 has its own scale_inv layout agreement for rowwise/colwise data:

TransformerEngine/tests/cpp/test_common.h

Line 348 in 98ccd2e

constexpr size_t nvfp4_scale_tensor_alignment_Y_rowwise = 128;

TransformerEngine/transformer_engine/common/common.h

Line 783 in 98ccd2e

constexpr size_t scale_tensor_alignment_X_rowwise = 4;

Take tensor dim {32,32} as an example, the rowwise scale inv will not be a continuous array for the first and the second row because nvfp4_scale_tensor_alignment_Y_rowwise=128, so padding is needed from 32/16=2 to 128 per row

wangye805 · 2026-03-27T16:23:51Z

tests/cpp/operator/test_dequantize_nvfp4.cu

+    ASSERT_EQ(err, hipSuccess) << hipGetErrorString(err);
+
+    const float amax = 1.0f;
+    input.set_tensor_amax(amax);


Yeah, I think for dequantization, the scale is needed

wangye805 · 2026-03-27T16:26:06Z

tests/cpp/operator/test_dequantize_nvfp4.cu

+    generate_data(host_input.get(), rows, cols, gen, fp4_dis);
+    generate_scales(host_scales.get(),


According to the layout alignment requirement, the data and scale for nvfp4 are not continuous in memory. Probably we can reuse the nvfp4 quantization here to generate a valid nvfp4 tensor

wangye805 · 2026-03-27T16:26:40Z

tests/cpp/operator/test_dequantize_nvfp4.cu

+    const size_t blocks_per_row = cols / block_size_1d;
+
+    Tensor input("input", std::vector<size_t>{rows, cols}, itype,
+                 true, false, NVTE_NVFP4_1D_SCALING);


Try to also test with 2D scaling, and with columnwise data

wangye805 and others added 30 commits February 2, 2026 14:16

[ROCm] resolve the conflicts in common dir

b8a4024

[ROCm] resolve the conflicts on jax side

0519b4b

[ROCm] resolve the conflicts on pytorch side

8f4b04d

[ROCm] resolve the conflicts in setup

e60ff21

[ROCm] resolve the cpp gtest

8bbb162

[ROCm] resolve pytorch and jax tests

f573b40

Resolve wheels and examples

pytest, example, wheels conflict resolution

eaaae94

jax and pytorch bugfix

8f94cf6

copyrights and fp8_autocast->autocast fix

bac7993

Enable test_distributed_dense.py

8ae38e8

address IFU comments

05a977a

_FormatHelperFP8 and missing file add

0385852

add use_async_d2h_group_size as a test parameter

46d382d

enable FP4 tests

15416f1

rough initial version

bac5096

initial working version

da24223

Addressing comments and small fixes

c03b7bb

various cleanups

c453dba

manually update runner labels

4a843ba

Comment cleanup

316dffb

Merge remote-tracking branch 'origin/IFU-dev-20251114-v2.10' into mdi…

8a47bc5

…ener/fp4-cast-transpose

only enable on gfx950

5c747bd

Update jax gemm.py

db56b8f

Merge remote-tracking branch 'origin/IFU-dev-20251114-v2.10' into mdi…

b318bda

…ener/fp4-cast-transpose

Revert "only enable on gfx950"

62eea94

This reverts commit 5c747bd.

reenable in NVTEDType

6d459ec

Fix dev merge conflicts

6eb2707

enable in bwd_helper

8cec975

Merge remote-tracking branch 'origin/IFU-dev-20251114-v2.10' into mdi…

c20e0e9

…ener/fp4-cast-transpose

alignment fixes

ccda439

matthiasdiener and others added 20 commits March 17, 2026 17:21

adjust more error messages

10d88bf

change disabling of header includes

b4caf6f

address review comments

511db61

implement SR

36cf73a

simplify slightly

a85f68f

Merge remote-tracking branch 'origin/dev' into mdiener/fp4-cast-trans…

f4f5ec9

…pose

address review comments

a607feb

bugfix arch SR support

ca2e444

use scale constants

5a5803c

Merge remote-tracking branch 'origin/dev' into mdiener/fp4-cast-trans…

d36ccbd

…pose

simplify to use __hip_fp4x4_storage_t directly

fc5af65

simplify storage for bit fiddling

94a4e5e

allow null amax in fallback kernel

82af544

minor cleanup

56fefaf

Merge remote-tracking branch 'origin/dev' into mdiener/fp4-cast-trans…

dfd3205

…pose

Merge remote-tracking branch 'origin/dev' into mdiener/fp4-cast-trans…

a39e0d5

…pose

enable nvfp::dequantize to be called on amd gpu

0b07970

Add NVFP4 dequant operator test

645e37b

add EOL to test_dequantize_nvfp4.cu

99fc99f

simplify and add comments to the NVFP4 dequantization test

5f5dece

aris134 self-assigned this Mar 26, 2026

aris134 marked this pull request as ready for review March 26, 2026 13:16

aris134 requested review from ipanfilo, wangye805 and wenchenvincent as code owners March 26, 2026 13:16

matthiasdiener reviewed Mar 26, 2026

View reviewed changes

ipanfilo reviewed Mar 26, 2026

View reviewed changes

aris134 added 2 commits March 26, 2026 17:42

Merge branch 'dev' into amartin/nvfp4-dequant

2b2ff5c

Update copy right string and replace hip prefixes

1f71218

wangye805 requested changes Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVFP4 dequantization#505

NVFP4 dequantization#505
aris134 wants to merge 63 commits intodevfrom
amartin/nvfp4-dequant

aris134 commented Mar 25, 2026 •

edited

Loading

Uh oh!

matthiasdiener Mar 26, 2026

Uh oh!

wangye805 Mar 27, 2026

Uh oh!

ipanfilo left a comment

Uh oh!

wangye805 Mar 27, 2026

Uh oh!

wangye805 Mar 27, 2026

Uh oh!

wangye805 Mar 27, 2026

Uh oh!

wangye805 Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		generate_data(host_input.get(), rows, cols, gen, fp4_dis);
		generate_scales(host_scales.get(),

Conversation

aris134 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

matthiasdiener Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

wangye805 Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

ipanfilo left a comment

Choose a reason for hiding this comment

Uh oh!

wangye805 Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

wangye805 Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

wangye805 Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

wangye805 Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

aris134 commented Mar 25, 2026 •

edited

Loading