GEMM Blockscale ABQuant Optimization #3620

kensclin · 2026-01-20T15:57:09Z

Proposed changes

Improve abquant performance

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

Copilot

Pull request overview

This PR optimizes GEMM (General Matrix Multiply) operations with ABQuant (A and B quantization) by improving scheduling, reducing redundant computations, and adding support for special cases. The changes focus on performance improvements through better instruction scheduling, handling of NQPerBlock==1 cases, and configuration updates.

Changes:

Improved instruction scheduling in the pipeline by adjusting MFMA and VALU instruction counts and barrier placements
Added optimized path for NQPerBlock==1 case that pre-computes quantization scale products
Enhanced sweep_tile_span to handle zero-dimensional spans
Updated configuration to enable transpose_c for ABQuantGrouped mode

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
gemm_wp_abquant_pipeline_ag_bg_cr_v2.hpp	Optimized instruction scheduling by adjusting ds_read_inst, mfma_inst calculations and VALU barrier count; improved code comments
block_universal_gemm_as_aquant_bs_bquant_cr.hpp	Added fast path for NQPerBlock==1 by pre-multiplying quantization scales; improved code formatting
block_universal_gemm_ar_aquant_flatbr_bquant_cr.hpp	Added NQPerBlock==1 optimization path similar to the other block implementation
sweep_tile.hpp	Added handling for zero-dimensional spans to prevent compilation issues
run_gemm_quant_example.inc	Updated transpose_c configuration to be conditional based on QuantMode
gemm_utils.hpp	Added new configuration structs for ABQuant operations with appropriate settings
gemm_abquant_quantgrouped.cpp	Updated to use new ABQuant-specific configuration templates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-20T22:10:41Z

include/ck_tile/ops/gemm_quant/pipeline/gemm_wp_abquant_pipeline_ag_bg_cr_v2.hpp

        constexpr index_t buffer_load_inst = Aload_inst + Bload_inst + BQload_inst;
        // Approximate number of LDS reads per block
-        constexpr index_t ds_read_inst = kMPerBlock / kLdsInstCycle;
+        constexpr index_t ds_read_inst = kMPerBlock / kLdsInstCycle / nloop;


The division by nloop appears twice in the calculation formulas (lines 130 and 135) but nloop is not defined or visible in this diff. Verify that nloop is properly defined in scope and consider adding a comment explaining its purpose in these calculations.

Copilot · 2026-01-20T22:10:41Z

include/ck_tile/ops/gemm_quant/pipeline/gemm_wp_abquant_pipeline_ag_bg_cr_v2.hpp

        constexpr index_t ds_write_inst = Aload_inst;
        // Number of MFMA instructions per wave for one block tile:
-        constexpr index_t mfma_inst = (kMPerBlock / WG::kM) * (kNPerBlock / WG::kN);
+        constexpr index_t mfma_inst = (kMPerBlock / WG::kM) / nloop * (kNPerBlock / WG::kN) / nloop;


The formula divides by nloop twice but lacks parentheses to clarify operator precedence. Consider rewriting as ((kMPerBlock / WG::kM) / nloop) * ((kNPerBlock / WG::kN) / nloop) for clarity.

Suggested change

constexpr index_t mfma_inst = (kMPerBlock / WG::kM) / nloop * (kNPerBlock / WG::kN) / nloop;

constexpr index_t mfma_inst =

((kMPerBlock / WG::kM) / nloop) * ((kNPerBlock / WG::kN) / nloop);

include/ck_tile/ops/gemm_quant/block/block_universal_gemm_as_aquant_bs_bquant_cr.hpp

include/ck_tile/ops/gemm_quant/block/block_universal_gemm_ar_aquant_flatbr_bquant_cr.hpp

example/ck_tile/38_block_scale_gemm/run_gemm_quant_example.inc

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

GEMM Blockscale ABQuant Optimization

651e612

kensclin requested review from Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, geyyer, illsilin, poyenc, qianfengz, shumway, tenpercent, vidyasagar-amd and vpietila-amd as code owners January 20, 2026 15:57

afagaj requested a review from Copilot January 20, 2026 22:09

Copilot AI reviewed Jan 20, 2026

View reviewed changes

ThomasNing and others added 3 commits January 20, 2026 14:38

Apply suggestion from @Copilot

4a32045

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Apply suggestion from @Copilot

b39c391

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Apply suggestion from @Copilot

05d0f85

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GEMM Blockscale ABQuant Optimization #3620

GEMM Blockscale ABQuant Optimization #3620

kensclin commented Jan 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Copilot AI Jan 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	constexpr index_t mfma_inst = (kMPerBlock / WG::kM) / nloop * (kNPerBlock / WG::kN) / nloop;
	constexpr index_t mfma_inst =
	((kMPerBlock / WG::kM) / nloop) * ((kNPerBlock / WG::kN) / nloop);

GEMM Blockscale ABQuant Optimization #3620

Are you sure you want to change the base?

GEMM Blockscale ABQuant Optimization #3620

Conversation

kensclin commented Jan 20, 2026

Proposed changes

Checklist

Discussion

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants