-
Notifications
You must be signed in to change notification settings - Fork 267
GEMM Blockscale ABQuant Optimization #3620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes GEMM (General Matrix Multiply) operations with ABQuant (A and B quantization) by improving scheduling, reducing redundant computations, and adding support for special cases. The changes focus on performance improvements through better instruction scheduling, handling of NQPerBlock==1 cases, and configuration updates.
Changes:
- Improved instruction scheduling in the pipeline by adjusting MFMA and VALU instruction counts and barrier placements
- Added optimized path for NQPerBlock==1 case that pre-computes quantization scale products
- Enhanced sweep_tile_span to handle zero-dimensional spans
- Updated configuration to enable transpose_c for ABQuantGrouped mode
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| gemm_wp_abquant_pipeline_ag_bg_cr_v2.hpp | Optimized instruction scheduling by adjusting ds_read_inst, mfma_inst calculations and VALU barrier count; improved code comments |
| block_universal_gemm_as_aquant_bs_bquant_cr.hpp | Added fast path for NQPerBlock==1 by pre-multiplying quantization scales; improved code formatting |
| block_universal_gemm_ar_aquant_flatbr_bquant_cr.hpp | Added NQPerBlock==1 optimization path similar to the other block implementation |
| sweep_tile.hpp | Added handling for zero-dimensional spans to prevent compilation issues |
| run_gemm_quant_example.inc | Updated transpose_c configuration to be conditional based on QuantMode |
| gemm_utils.hpp | Added new configuration structs for ABQuant operations with appropriate settings |
| gemm_abquant_quantgrouped.cpp | Updated to use new ABQuant-specific configuration templates |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| constexpr index_t buffer_load_inst = Aload_inst + Bload_inst + BQload_inst; | ||
| // Approximate number of LDS reads per block | ||
| constexpr index_t ds_read_inst = kMPerBlock / kLdsInstCycle; | ||
| constexpr index_t ds_read_inst = kMPerBlock / kLdsInstCycle / nloop; |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The division by nloop appears twice in the calculation formulas (lines 130 and 135) but nloop is not defined or visible in this diff. Verify that nloop is properly defined in scope and consider adding a comment explaining its purpose in these calculations.
| constexpr index_t ds_write_inst = Aload_inst; | ||
| // Number of MFMA instructions per wave for one block tile: | ||
| constexpr index_t mfma_inst = (kMPerBlock / WG::kM) * (kNPerBlock / WG::kN); | ||
| constexpr index_t mfma_inst = (kMPerBlock / WG::kM) / nloop * (kNPerBlock / WG::kN) / nloop; |
Copilot
AI
Jan 20, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formula divides by nloop twice but lacks parentheses to clarify operator precedence. Consider rewriting as ((kMPerBlock / WG::kM) / nloop) * ((kNPerBlock / WG::kN) / nloop) for clarity.
| constexpr index_t mfma_inst = (kMPerBlock / WG::kM) / nloop * (kNPerBlock / WG::kN) / nloop; | |
| constexpr index_t mfma_inst = | |
| ((kMPerBlock / WG::kM) / nloop) * ((kNPerBlock / WG::kN) / nloop); |
include/ck_tile/ops/gemm_quant/block/block_universal_gemm_as_aquant_bs_bquant_cr.hpp
Outdated
Show resolved
Hide resolved
include/ck_tile/ops/gemm_quant/block/block_universal_gemm_ar_aquant_flatbr_bquant_cr.hpp
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Proposed changes
Improve abquant performance
Checklist
Please put an
xinto the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.clang-formaton all changed filesDiscussion
If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered