Fix LRU cache pollution causing BLOCK_SIZE_S3 KeyError in gemm_afp4wfp4 by Duyi-Wang · Pull Request #2169 · ROCm/aiter

Duyi-Wang · 2026-03-04T07:42:44Z

Motivation

Fix KeyError: 'Keyword argument BLOCK_SIZE_S3 was specified but unrecognised' crash in gemm_afp4wfp4 triggered when using quark_w4a4_mxfp4 quantization in SGLang.

Technical Details

get_gemm_config() in aiter/ops/triton/utils/gemm_config_utils.py uses @functools.lru_cache, which caches the returned (dict, bool) tuple by reference. The fused kernel fused_gemm_afp4wfp4_split_cat mutates the cached dict in-place by injecting BLOCK_SIZE_S3:

config["BLOCK_SIZE_S3"] = triton.next_power_of_2(...)

Subsequent calls to gemm_afp4wfp4 with the same (M, N, K) receive the same polluted dict from the cache. When unpacked via **config into a Triton kernel that doesn't accept BLOCK_SIZE_S3, it raises a KeyError.

Fix: Added config.pop("BLOCK_SIZE_S3", None) in the three affected functions in gemm_afp4wfp4.py (gemm_afp4wfp4_, gemm_afp4wfp4_preshuffled_scales, gemm_afp4wfp4_preshuffle) immediately after obtaining the config dict. This strips the polluted key before it reaches the kernel call, regardless of whether the cache was polluted or not.

This is a minimal workaround. A more thorough fix would be to have get_gemm_config() return defensive copies to prevent cache mutation entirely.

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

Addresses a Triton kernel launch crash (KeyError: '... BLOCK_SIZE_S3 ... unrecognised') caused by BLOCK_SIZE_S3 being injected into (and then reused from) a cached GEMM config when using MXFP4 quantization paths.

Changes:

Strip BLOCK_SIZE_S3 from the GEMM config dict after config retrieval in gemm_afp4wfp4_.
Strip BLOCK_SIZE_S3 from the GEMM config dict after config retrieval in gemm_afp4wfp4_preshuffled_scales.
Strip BLOCK_SIZE_S3 from the GEMM config dict after config retrieval in gemm_afp4wfp4_preshuffle.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

k50112113 · 2026-03-04T19:57:39Z

Instead of popping BLOCK_SIZE_S3 with a default to None in the gemm_afp4wfp4 API, my suggestion would be to add .copy() at _get_config() at /root/aiter/aiter/ops/triton/_triton_kernels/gemm/basic/gemm_afp4wfp4.py

like the following:

def _get_config(
    M: int,
    N: int,
    K: int,
    shuffle: bool = False,
):
    # Note: Config files use K=2*K in their naming
    K = 2 * K
    if shuffle:
        return get_gemm_config("GEMM-AFP4WFP4_PRESHUFFLED", M, N, K).copy()
    else:
        return get_gemm_config("GEMM-AFP4WFP4", M, N, K).copy()

to copy the config dict

so that the configs of both the gemm_afp4wfp4 and the fused_gemm_afp4wfp4_split_cat

Please let me know if this works

Thanks,
Shao-Chun Lee

Duyi-Wang · 2026-03-05T04:10:05Z

Instead of popping BLOCK_SIZE_S3 with a default to None in the gemm_afp4wfp4 API, my suggestion would be to add .copy() at _get_config() at /root/aiter/aiter/ops/triton/_triton_kernels/gemm/basic/gemm_afp4wfp4.py

like the following:
def _get_config(
    M: int,
    N: int,
    K: int,
    shuffle: bool = False,
):
    # Note: Config files use K=2*K in their naming
    K = 2 * K
    if shuffle:
        return get_gemm_config("GEMM-AFP4WFP4_PRESHUFFLED", M, N, K).copy()
    else:
        return get_gemm_config("GEMM-AFP4WFP4", M, N, K).copy()
to copy the config dict

so that the configs of both the gemm_afp4wfp4 and the fused_gemm_afp4wfp4_split_cat

Please let me know if this works

Thanks, Shao-Chun Lee

Returning a deep copy in _get_config() would be a better solution to the cache pollution issue and it works, but I’m unsure about its impact on performance or other components. Therefore, I’m using this as a workaround for my current workload.
The fix based on deep-copy has been pushed.

…ions

This reverts commit 213b76f.

Duyi-Wang · 2026-03-06T03:19:53Z

Similar one merged #2173. However, this is the same workaround and is only applicable to gemm_afp4wfp4. @k50112113

k50112113

Yes, this looks good, thanks for the work

…p4 (ROCm#2169) * Walk around "BLOCK_SIZE_S3" error * Remove workaround for "BLOCK_SIZE_S3" key in GEMM configuration functions * Revert "Copy config before mutate (ROCm#2173)" This reverts commit 213b76f.

…p4 (#2169) * Walk around "BLOCK_SIZE_S3" error * Remove workaround for "BLOCK_SIZE_S3" key in GEMM configuration functions * Revert "Copy config before mutate (#2173)" This reverts commit fe41d1e.

…p4 (ROCm#2169) * Walk around "BLOCK_SIZE_S3" error * Remove workaround for "BLOCK_SIZE_S3" key in GEMM configuration functions * Revert "Copy config before mutate (ROCm#2173)" This reverts commit 000ba6c.

Duyi-Wang requested review from a team and Copilot March 4, 2026 07:42

Copilot started reviewing on behalf of Duyi-Wang March 4, 2026 07:43 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

Comment thread aiter/ops/triton/gemm/basic/gemm_afp4wfp4.py Outdated

Comment thread aiter/ops/triton/gemm/basic/gemm_afp4wfp4.py Outdated

Duyi-Wang changed the title ~~Walk around "BLOCK_SIZE_S3" error~~ Fix LRU cache pollution causing BLOCK_SIZE_S3 KeyError in gemm_afp4wfp4 Mar 4, 2026

azaidy requested a review from k50112113 March 4, 2026 19:00

Duyi-Wang added 3 commits March 6, 2026 11:14

Walk around "BLOCK_SIZE_S3" error

f5b44e5

Remove workaround for "BLOCK_SIZE_S3" key in GEMM configuration funct…

e5a483d

…ions

Revert "Copy config before mutate (ROCm#2173)"

a0d87c6

This reverts commit 213b76f.

Duyi-Wang force-pushed the fix_block_size_3_error branch from cb05b0b to a0d87c6 Compare March 6, 2026 03:15

k50112113 approved these changes Mar 9, 2026

View reviewed changes

k50112113 merged commit 7d29fce into ROCm:main Mar 12, 2026
36 checks passed

1am9trash mentioned this pull request Apr 13, 2026

[AMD] Remove aiter hotfixes in Dockerfile covered by aiter v0.1.12.post1 sgl-project/sglang#22657

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LRU cache pollution causing BLOCK_SIZE_S3 KeyError in gemm_afp4wfp4#2169

Fix LRU cache pollution causing BLOCK_SIZE_S3 KeyError in gemm_afp4wfp4#2169
k50112113 merged 3 commits intoROCm:mainfrom
Duyi-Wang:fix_block_size_3_error

Duyi-Wang commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

k50112113 commented Mar 4, 2026

Uh oh!

Duyi-Wang commented Mar 5, 2026

Uh oh!

Duyi-Wang commented Mar 6, 2026

Uh oh!

k50112113 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Duyi-Wang commented Mar 4, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

k50112113 commented Mar 4, 2026

Uh oh!

Duyi-Wang commented Mar 5, 2026

Uh oh!

Duyi-Wang commented Mar 6, 2026

Uh oh!

k50112113 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants