WMMA grouped conv fwd large tensor bias bnorm clamp #3595

wj-laskowski · 2026-01-16T13:49:42Z

Proposed changes

Added bias bnorm clamp operation for WMMA conv fwd large tensor (FP16/BF16 data type and NHWGCxGKYXC layout).

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

Following operations are added for FP16/BF16 data type and NHWGCxGKYXC layout. - grouped_conv2d_fwd_bias_bnorm_clamp - grouped_conv3d_fwd_bias_bnorm_clamp

chris-tsiaousis-hpc · 2026-01-19T17:34:57Z

...eration/gpu/device/impl/device_grouped_conv_fwd_multiple_d_wmma_cshuffle_v3_large_tensor.hpp

-
-                gemm_desc_kernel_args_.At(valid_gemms_count_) = new_args;
+                auto* gemm_args = &gemm_desc_kernel_args_.At(valid_gemms_count_);
+                new(gemm_args) GemmArgs{p_as_grid,


Nice use of placement new here. The basic idea is good but this is difficult to maintain, as operator new with placement-args comes with a caveat: you have to manually call the dtor.
In our case (I think that) the implicitly defaulted dtor doesn't do anything useful. -I haven't checked that- But the implicitly deleted copy assignment operator that made you use this approach is concerning.
My point is basically that: if a future dev changes something that requires the GemmArg dtor to be called, they could very easily oversee this.
My proposal is to define an Emplace function in the Array struct to handle this case :

template<typename... Args> auto Emplace(ck::index_t i, Args&&... args) -> std::enable_if_t<std::is_nothrow_constructible_v<TData, Args...>> { if (i >= ck::index_t{0} && i < NSize) { mData[i].~TData(); new(mData + i) TData(std::forward<Args>(args)...); } }

On another note... I see this way of initializing structs quite often and I can boldly say that I really don't like it. In C++17 we can already use designated initializers given that they are all included. This makes code more explicit, more robust and safer. So I'd change this section of the code to :

gemm_desc_kernel_args_.Emplace(valid_gemms_count_, GemmArgs{.a_ptrs_ = p_as_grid, .b_ptrs_ = p_bs_grid, .ds_ptrs_ = p_ds_grid, .e_ptr_ = p_e_grid, .a_element_op_ = a_element_op_, .b_element_op_ = b_element_op_, .cde_element_op_ = cde_element_op_, .M_ = gemm_m, .N_ = gemm_n, .a_grid_desc_ = a_grid_desc, .b_grid_desc_ = b_grid_desc, .ds_grid_desc_mblock_mperblock_nblock_nperblock_ = ds_desc_mblock_mperblock_nblock_nperblock, .e_grid_desc_mblock_mperblock_nblock_nperblock_ = e_desc_mblock_mperblock_nblock_nperblock, .BlockStart_ = BlockStart, .BlockEnd_ = BlockEnd});

Thanks for the useful suggestion and explanation, Chris! I like this approach to handle gemmargs array and explicit initialization. I added it to the PR with a slight modification of having a runtime bound check.

krithalith · 2026-01-20T09:50:17Z

Looks good overall! I have some comments:

They wanted only a single generic instance per instance list for (Large Tensor) bias bnorm clamp. You can just make two new instance lists that only contain one generic instance and use them only for bias bnorm clamp. We did the same for the non-large version. Please first find an actual generic instance, because it seems like the currents instance lists do not contain one. You can probably adapt them from the XDL Large Tensor instance lists. They should have all the scalarPerVector values equal to 1.
Are the current bias bnorm clamp tests sufficient for testing the Large implementation? I.e. are the tensors large enough to actually cause splitting? If not it might be useful to add a "Large Cases" test like for the other flavors.
Did you check if the Large Tensor implementation is actually run and can support all the test cases (especially check after reducing instance lists to generic only)?

wj-laskowski added the organization: streamhpc label Jan 16, 2026

Added bias_bnorm_clamp for WMMA conv fwd large tensor.

b5c541f

Following operations are added for FP16/BF16 data type and NHWGCxGKYXC layout. - grouped_conv2d_fwd_bias_bnorm_clamp - grouped_conv3d_fwd_bias_bnorm_clamp

wj-laskowski force-pushed the streamhpc/grouped-conv-fwd-wmma-large-tensor-bias_bnorm_clamp branch from 7b0341d to b5c541f Compare January 19, 2026 09:35

wj-laskowski changed the title ~~Streamhpc/grouped conv fwd wmma large tensor bias bnorm clamp~~ WMMA grouped conv fwd large tensor bias bnorm clamp Jan 19, 2026

wj-laskowski marked this pull request as ready for review January 19, 2026 12:58

wj-laskowski requested review from a team, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent and vidyasagar-amd as code owners January 19, 2026 12:58

chris-tsiaousis-hpc reviewed Jan 19, 2026

View reviewed changes

changed strategy to handle GemmArgs array

7ef748a

krithalith self-requested a review January 20, 2026 09:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WMMA grouped conv fwd large tensor bias bnorm clamp #3595

WMMA grouped conv fwd large tensor bias bnorm clamp #3595

wj-laskowski commented Jan 16, 2026 •

edited

Loading

Uh oh!

chris-tsiaousis-hpc Jan 19, 2026 •

edited

Loading

Uh oh!

wj-laskowski Jan 20, 2026

Uh oh!

krithalith commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

WMMA grouped conv fwd large tensor bias bnorm clamp #3595

Are you sure you want to change the base?

WMMA grouped conv fwd large tensor bias bnorm clamp #3595

Conversation

wj-laskowski commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Discussion

Uh oh!

chris-tsiaousis-hpc Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wj-laskowski Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

krithalith commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wj-laskowski commented Jan 16, 2026 •

edited

Loading

chris-tsiaousis-hpc Jan 19, 2026 •

edited

Loading