Debug large address draft by JaxChen29 · Pull Request #2150 · ROCm/aiter

JaxChen29 · 2026-03-02T12:05:59Z

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…is (ROCm#1464) * add iris as aiter optional dependency * prevents torch.distributed from leaking into aiter namespace * add reduce_scatter triton ops * add all_gater triton ops * add fused rs_rms_quant_ag triton comm op * add unit tests * complete integration * update format * update fused kernel * more format update * suspect this may break vllm tests * add iris.py and remove unecessary iris function api * fix ci import error by enable conditional import iris * remove unecessary tests * Refactored the reduce-scatter and all-gather kernel implementations to eliminate code duplication between standalone operations and fused kernels * use new restructured reduce_scatter and all_gather * add calculate_heap_size to atomatically allocate heap with M, N * make sure calculate_heap_size exported in different level * fix undefined shmem issue * use aiter rmsnorm triton kernel instead * add howto * add test for fused rs_rmsnorm_quant_ag kernel * use multiple processing for this test as well * resolve format issue * fix typo for black * fix another black format issue * ctx is always going to be required * remove unused variables * remove uncessary pass * remove unusded math module * Update aiter/ops/triton/comms/iris.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update op_tests/multigpu_tests/triton/test_reduce_scatter_all_gather.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * remove dead link * Update aiter/ops/triton/comms/fused/__init__.py Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com> * Update aiter/ops/triton/comms/fused/reduce_scatter_rmsnorm_quant_all_gather.py Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com> * Update aiter/ops/triton/comms/fused/reduce_scatter_rmsnorm_quant_all_gather.py Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com> * Update op_tests/multigpu_tests/triton/test_fused_rs_rmsnorm_quant_ag.py Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com> * Update aiter/ops/triton/comms/reduce_scatter.py Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com> * fix iterations for distributed tests to prevent deadlocks * update license * pin iris SHA * fix format * Update aiter/ops/triton/comms/all_gather.py Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

Add chmod step before checkout to fix permissions on .eggs files

* opt(pa_ps): replace pa_ps asm due to reduce ignore nan Signed-off-by: Double Young <yang.yang2@amd.com> * opt(pa_ps): optimize ceil in v1_2 gen_metadata * opt(pa_ps): use dense represent to decrease reduce overhead * tmp(pa_ps): enhance v1_2 gen_metadata --------- Signed-off-by: Double Young <yang.yang2@amd.com>

* Add instances supporting inter_dim=192 for CK 2stage MoE * revert changes to other than bf16 types * fix per-token instances * add some instances * adjust some instances * adjust some instances * remove bad tunings for inter_dim=192 * Add tuned ck2stage kernels for Qwen3-235B * disable KPerBlock=64 fp8 cases for gfx950 * add one more instance * revert changes to A4W4 --------- Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

…using Ir…" (ROCm#1601) This reverts commit f0a3c00.

* update * update ut * update * fix prefill block_size * rm tmp files * update ck * update tuning range * update ck & ut * update ck --------- Co-authored-by: Zzz9990 <Zzz9990>

* fix setup develop prebuild bugs * add more module in prebuild * format code * skip mi350 a8w8 kernel * bug fix * bug fix II * bug fix III * format II * format III * format IV * format V * bugs fix * bug new fix II * bug new fix III * format new I --------- Co-authored-by: zufayu <zufayu@amd.com>

* Add batch prefill benchmark Create initial bench_batch_prefill.py for benchmarking mha_batch_prefill_func with paged KV cache. The equal_seqlens sets to default true now and use "--no-equal_seqlens" for random test. * [lint style fix] bench_batch_prefill.py pylint: Your code has been rated at 8.53/10 ruff: All checks passed! black: All done! --------- Co-authored-by: root <root@smci350-odcdh1-a07-2.png-odc.dcgpu> Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

…or TGMM (ROCm#1541) * Initial commit * Added TGMM test and fixed the wrapper * Added test for accumulate * atol and rtol changes, comment fixes * Addressing PR comments * Added benchmark script modifications * Added bias and accumulate to non persistent * Refactor TGMM test: enabled bias grad test for nptgmm and relaxed tolerance for gfx950 * Removed accumulate boolean for bias grad and added tests for accumulate. * Applied black formatting * Adding accumulate configs because of triton.runtime.errors.OutOfResources: out of resource: shared memory (Required: 131072, Hardware limit: 65536) when accumulate = True * Fixed black formatting --------- Co-authored-by: Sudharshan Govindan <sugovind@amd.com>

* CI: Temporarily migrate SGlang and vLLM tests

* Add attention sink support to forward pass * Add attention sink forward pass support to benchmark script * Add attention sink support to backward pass * Add attention sink backward pass support to benchmark script * Conditionally relax dv error toletance on `gfx942` * Decrease error tolerance for `dsink`

* bf16gemm_asm_tune * mdf_csv * update * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * update * update GemmTuner.py * update kl * update_mi30x_kl * update * update * update * update * update * add_gpt_oss_tuned * update test * update * update * update tuner with bias * Update GemmTuner.py * filter bpreshuffle * fix lint * mdf test * set max splitk 16 * rm tuned result * Update bf16_tuned_gemm.csv --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Ying.Zhou2 <yzhou103@amd.com> Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>

…is - try 2 (ROCm#1607) * add iris as aiter optional dependency * prevents torch.distributed from leaking into aiter namespace * add reduce_scatter triton ops * add all_gater triton ops * add fused rs_rms_quant_ag triton comm op * add unit tests * complete integration * update format * update fused kernel * more format update * suspect this may break vllm tests * add iris.py and remove unecessary iris function api * fix ci import error by enable conditional import iris * remove unecessary tests * Refactored the reduce-scatter and all-gather kernel implementations to eliminate code duplication between standalone operations and fused kernels * use new restructured reduce_scatter and all_gather * add calculate_heap_size to atomatically allocate heap with M, N * make sure calculate_heap_size exported in different level * fix undefined shmem issue * use aiter rmsnorm triton kernel instead * add howto * add test for fused rs_rmsnorm_quant_ag kernel * use multiple processing for this test as well * resolve format issue * fix typo for black * fix another black format issue * ctx is always going to be required * remove unused variables * remove uncessary pass * remove unusded math module * Update aiter/ops/triton/comms/iris.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update op_tests/multigpu_tests/triton/test_reduce_scatter_all_gather.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * remove dead link * Update aiter/ops/triton/comms/fused/__init__.py Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com> * Update aiter/ops/triton/comms/fused/reduce_scatter_rmsnorm_quant_all_gather.py Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com> * Update aiter/ops/triton/comms/fused/reduce_scatter_rmsnorm_quant_all_gather.py Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com> * Update op_tests/multigpu_tests/triton/test_fused_rs_rmsnorm_quant_ag.py Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com> * Update aiter/ops/triton/comms/reduce_scatter.py Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com> * fix iterations for distributed tests to prevent deadlocks * update license * pin iris SHA * fix format * The warnings were at import time (too early), now they're only at usage time (when it matters) * add IRIS_DEP to pin IRIS release to avoid sudden iris api change Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * remove commented lines * Fixed fp8 dtype check inconsistency * Add public is_initialized property to fix encapsulation issue * remove unused * Fixed multiprocessing race condition * fix aiter imports and remove unused M_shard * fix inefficient mem allocation * fix format issue * completely silent iris log when not using iris * fix the import crash * move triton_comms README * add triton comms info to README.md --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

Co-authored-by: Guanbao Yu <gyu@amd.com> Co-authored-by: Zhu Yuhua <yuhzhu@amd.com>

* refactor fmoe profile data, and log * Update fused_moe.py

* add pa_decode_triton_fp8_gluon kernel and perf benchmark * enhance code readability, make the code more elegant * add gluon version paged attention decode and test - Create __init__.py for aiter.ops.triton.gluon module - Add test_paged_attention_decode_gluon.py to test gluon FP8 implementation - Test compares performance between assembly and gluon kernels * add per_tensor quant test for gluon version PA * fix per_tensor quant bug of paged_attention_decode_v2_gluon_large_block_fp8 * triton3.4 and triton3.5 use different reduce codes to avoid performance degeneration caused by triton3.5 * fix bug in kv use varlen, fix per_token && per_tensor bug, add torch_mha_extend_flashattn_style which is consistent with the triton kernel to compare diff with the triton kernel * rm useless code * add pa_gluon AOT compile flow and unittest * fix pa_gluon AOT compile flow bug * remove xxx_POW2 params of PA triton kernels && add separate test for pa_decode_attention_kernel AOT compile * modify pa kernel interface && fix add compute_type param problem in AOT mode * merge pa_decode attention and reduce kernel into one C++ wrapper * rename some variable && add paged_attention_decode_v2_reduce_kernel_triton34 into pa_decode_triton_gluon_fp8.py and remove pa_decode_triton_gluon_fp8_triton34.py * rename file * change the triton version comparison logic * format file * fix the issue of pa gluon performance decline in triton AOT mode * add assertions to all tests pass in test_pa_decode_gluon.py * fix per-token quant bug * add support for COMPUTE_TYPE: bf16,fp16 && QUANT_Q_AND_KV: (False,False),(False,True) * AOT mode support COMPUTE_TYPE: bf16,fp16 && QUANT_Q_AND_KV: (False,False),(False,True) * add script to build PA AOT so * the script to build PA AOT so support multiprocessing * add loop version of paged_attention_decode_v2_reduce_kernel to remove MAX_CONTEXT_PARTITION_NUM and reduce so count in AOT mode * reduce kernel templates * modify build PA AOT so options * test support sample_rate to save time * fix bug of building PA AOT so with multiprocessing * refactor: generalize gluon kernel && implement build cache cleanup with .so file reporting * feat: add backward compatibility for Triton without gluon Gracefully handle missing triton.experimental.gluon imports by adding try-except blocks and runtime checks. Add simplified test configuration. * optimize paged_attention_decode_v2_reduce_kernel in long context scenarios * fix deadlock bug of building PA AOT so with multiprocessing && refine unit test * close support for bf16 and fp16 for the time being, due to some precision problems * Add Gluon transpose kernels for query and output in paged attention decode - Add transpose_query_gluon_kernel to transpose query from [batch*seq, num_heads, head_size] to Gluon format - Add transpose_output_gluon_kernel to transpose output back to standard format - Support both JIT and AOT compilation modes - Update pa_decode_gluon API to handle transposition internally - Add unit tests and prebuild scripts for transpose kernels * ali test * use gl.int64 to hold kv_cache offsets to avoid overflow when the kv_cache's shape is too large * clean cache file after aot compile function, add doc for func pa_decode_gluon and pa_decode_gluon_aot * remove paged_attention_decode_v2_reduce_kernel_triton34, PA kernel can only run in triton 3.5.0 or higher * rename PA dot kernel * Add arch assertion to restrict pa_decode_gluon to gfx942 (CDNA3) --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com>

* mdf package_name to amd-aiter * update * update

…8 and mtp>=1 (ROCm#1626) * introduce a new solution which can reduce #splits rev 1.1 1.2 * Revert kFixedOverheadNumBlocks * revert attention.py

* fix ci triton error * rebase

* Move triton_tests into folders * fix import errors * formatting fix * remove misc folder * Update op_tests/triton_tests/quant/test_fused_mxfp4_quant.py - copilot suggestion * remove import workarounds * fix op_benchmarks imports

* add sageish attn implementation (fav3_sage.py), i.e. attn quantization scheme suited for video generation

* Support gdn ops, removed layernorm_gated.py to pass triton test

* draft * Update CONTRIBUTE.md * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix 1 --------- Co-authored-by: zufayu <zufayu@amd.com> Co-authored-by: Xin Huang <Xin.Huang@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* mla ps support paged 64 and 3buffer layout for ds3.2

…)" (ROCm#1941) This reverts commit 0e0a378.

* CI: Fix Sglang dummy_grok 403 issue

* add mha bwd 192_128 causal br asm kernel * add causal br a16 kernel * fix kernel name * fix_a16_kernel_nan

* CI: Fix docker ratelimit issues * Fix

* enhance cast * update WIP fp4 * update * remove std dependency for integral based static_for * update * let gfx950 using instruction cast f32->bf16 by static_cast

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

* mla ps support paged 64 and 3buffer layout for ds3.2 * fix the github-actions * upload kernel * fix the comments * modify the scale values

Signed-off-by: Double Young <yang.yang2@amd.com>

xiaohuguo2023 and others added 30 commits December 9, 2025 20:24

Update vllm_benchmark.yaml (ROCm#1598)

33f942c

Add chmod step before checkout to fix permissions on .eggs files

Revert "Initial development of triton based communication primitives …

4186a17

…using Ir…" (ROCm#1601) This reverts commit f0a3c00.

[moe] moe NT type opt & preshuffle layout tuning (ROCm#1561)

303c0b5

* update * update ut * update * fix prefill block_size * rm tmp files * update ck * update tuning range * update ck & ut * update ck --------- Co-authored-by: Zzz9990 <Zzz9990>

fix heuristic impl (ROCm#1581)

10c9a7f

Update fused_moe dispatch (ROCm#1605)

deeb02b

CI: Upgrade sglang version in CI (ROCm#1604)

c7d25ff

add ck moe a8w8 gemm2 missing instances (ROCm#1608)

0808e1d

minor fix (ROCm#1609)

3cd58c2

Fix issues (ROCm#1606)

f6fad2f

CI: Temporarily migrate SGlang and vLLM tests (ROCm#1611)

3ba2cc1

* CI: Temporarily migrate SGlang and vLLM tests

add max split num to sparse mla (ROCm#1617)

5961a92

CI: Update aiter release workflow (ROCm#1619)

c7b3bee

[Config] tune gemm and moe for Qwen3 MoE models (ROCm#1585)

00a7095

Co-authored-by: Guanbao Yu <gyu@amd.com> Co-authored-by: Zhu Yuhua <yuhzhu@amd.com>

CI: Migrate SGlang and vLLM tests back (ROCm#1624)

d263d44

refactor fmoe tuner profile (ROCm#1614)

97ac929

* refactor fmoe profile data, and log * Update fused_moe.py

mdf package_name to amd-aiter (ROCm#1620)

2c6ff6c

* mdf package_name to amd-aiter * update * update

Reduce #splits and assign workload more evenly for MLA cases nhead=12…

95352ad

…8 and mtp>=1 (ROCm#1626) * introduce a new solution which can reduce #splits rev 1.1 1.2 * Revert kFixedOverheadNumBlocks * revert attention.py

Fix ci triton error (ROCm#1627)

a64c397

* fix ci triton error * rebase

CI: Skip some triton tests in multigpu tests (ROCm#1634)

dd816be

gyohuangxin and others added 26 commits January 29, 2026 15:40

CI: Upgrade sglang version from v0.5.6 to v0.5.8 to in CI (ROCm#1921)

862af4c

[TRITON] Fav3 sage (ROCm#1818)

42ae0ad

* add sageish attn implementation (fav3_sage.py), i.e. attn quantization scheme suited for video generation

fix mla f8 v3 data sync issue (ROCm#1926)

3728dce

[Triton] Support GDR prefill and decode (ROCm#1594)

a032942

* Support gdn ops, removed layernorm_gated.py to pass triton test

mla ps support paged 64 and 3buffer layout for ds3.2 (ROCm#1917)

0e0a378

* mla ps support paged 64 and 3buffer layout for ds3.2

CI: Fix the issues after mi325 runner migration (ROCm#1938)

f527094

Revert "mla ps support paged 64 and 3buffer layout for ds3.2 (ROCm#1917…

ca0de0e

…)" (ROCm#1941) This reverts commit 0e0a378.

CI: Fix Sglang dummy_grok 403 issue (ROCm#1947)

9b969ae

* CI: Fix Sglang dummy_grok 403 issue

[FEAT] add mha bwd hd192_128 bottom right a32/a16 asm kernel (ROCm#1928)

a4a13f8

* add mha bwd 192_128 causal br asm kernel * add causal br a16 kernel * fix kernel name * fix_a16_kernel_nan

CI: Fix docker ratelimit issues (ROCm#1948)

ea64ab0

* CI: Fix docker ratelimit issues * Fix

[OPUS] enhance cast (ROCm#1933)

02b8920

* enhance cast * update WIP fp4 * update * remove std dependency for integral based static_for * update * let gfx950 using instruction cast f32->bf16 by static_cast

optimize allreduce write mode by broadcast output addr (ROCm#1910)

c4cb6ea

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>

mla ps support paged64 and 3buffer for ds3.2 (ROCm#1950)

ca7948e

* mla ps support paged 64 and 3buffer layout for ds3.2 * fix the github-actions * upload kernel * fix the comments * modify the scale values

fix(paps): add support for max_split_per_batch (ROCm#1951)

01be8fe

Signed-off-by: Double Young <yang.yang2@amd.com>

fix_register_buffer (ROCm#1953)

22504cc

enable hd192_128 br kernel in python

a144415

debug large addr

450ab33

update new co

624cfd4

fix other kernel1

4588fcf

debugging gfx950 kernel

85b85b6

fix address overflow on gfx942

68dc3a4

fix others

15284e7

fixed d128 case passed

10b66fb

add unverified kernel co

5a24418

add test for debug

db47e5c

JaxChen29 requested a review from a team March 2, 2026 12:06

JaxChen29 marked this pull request as draft March 2, 2026 12:06

valarLip closed this Mar 18, 2026

valarLip force-pushed the main branch from 584e490 to 3a5a1cf Compare March 18, 2026 05:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug large address draft#2150

Debug large address draft#2150
JaxChen29 wants to merge 1290 commits intoROCm:mainfrom
JaxChen29:debug_large_address_draft

JaxChen29 commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

JaxChen29 commented Mar 2, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants