Skip to content

Conversation

@kensclin
Copy link
Contributor

Proposed changes

Improve abquant performance

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes GEMM (General Matrix Multiply) operations with ABQuant (A and B quantization) by improving scheduling, reducing redundant computations, and adding support for special cases. The changes focus on performance improvements through better instruction scheduling, handling of NQPerBlock==1 cases, and configuration updates.

Changes:

  • Improved instruction scheduling in the pipeline by adjusting MFMA and VALU instruction counts and barrier placements
  • Added optimized path for NQPerBlock==1 case that pre-computes quantization scale products
  • Enhanced sweep_tile_span to handle zero-dimensional spans
  • Updated configuration to enable transpose_c for ABQuantGrouped mode

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
gemm_wp_abquant_pipeline_ag_bg_cr_v2.hpp Optimized instruction scheduling by adjusting ds_read_inst, mfma_inst calculations and VALU barrier count; improved code comments
block_universal_gemm_as_aquant_bs_bquant_cr.hpp Added fast path for NQPerBlock==1 by pre-multiplying quantization scales; improved code formatting
block_universal_gemm_ar_aquant_flatbr_bquant_cr.hpp Added NQPerBlock==1 optimization path similar to the other block implementation
sweep_tile.hpp Added handling for zero-dimensional spans to prevent compilation issues
run_gemm_quant_example.inc Updated transpose_c configuration to be conditional based on QuantMode
gemm_utils.hpp Added new configuration structs for ABQuant operations with appropriate settings
gemm_abquant_quantgrouped.cpp Updated to use new ABQuant-specific configuration templates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

constexpr index_t buffer_load_inst = Aload_inst + Bload_inst + BQload_inst;
// Approximate number of LDS reads per block
constexpr index_t ds_read_inst = kMPerBlock / kLdsInstCycle;
constexpr index_t ds_read_inst = kMPerBlock / kLdsInstCycle / nloop;
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The division by nloop appears twice in the calculation formulas (lines 130 and 135) but nloop is not defined or visible in this diff. Verify that nloop is properly defined in scope and consider adding a comment explaining its purpose in these calculations.

Copilot uses AI. Check for mistakes.
constexpr index_t ds_write_inst = Aload_inst;
// Number of MFMA instructions per wave for one block tile:
constexpr index_t mfma_inst = (kMPerBlock / WG::kM) * (kNPerBlock / WG::kN);
constexpr index_t mfma_inst = (kMPerBlock / WG::kM) / nloop * (kNPerBlock / WG::kN) / nloop;
Copy link

Copilot AI Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formula divides by nloop twice but lacks parentheses to clarify operator precedence. Consider rewriting as ((kMPerBlock / WG::kM) / nloop) * ((kNPerBlock / WG::kN) / nloop) for clarity.

Suggested change
constexpr index_t mfma_inst = (kMPerBlock / WG::kM) / nloop * (kNPerBlock / WG::kN) / nloop;
constexpr index_t mfma_inst =
((kMPerBlock / WG::kM) / nloop) * ((kNPerBlock / WG::kN) / nloop);

Copilot uses AI. Check for mistakes.
ThomasNing and others added 3 commits January 20, 2026 14:38
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants