Skip to content

Debug large address draft#2150

Closed
JaxChen29 wants to merge 1290 commits intoROCm:mainfrom
JaxChen29:debug_large_address_draft
Closed

Debug large address draft#2150
JaxChen29 wants to merge 1290 commits intoROCm:mainfrom
JaxChen29:debug_large_address_draft

Conversation

@JaxChen29
Copy link
Copy Markdown
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

xiaohuguo2023 and others added 30 commits December 9, 2025 20:24
…is (ROCm#1464)

* add iris as aiter optional dependency

* prevents torch.distributed from leaking into aiter namespace

* add reduce_scatter triton ops

* add all_gater triton ops

* add fused rs_rms_quant_ag triton comm op

* add unit tests

* complete integration

* update format

* update fused kernel

* more format update

* suspect this may break vllm tests

* add iris.py and remove unecessary iris function api

* fix ci import error by enable conditional import iris

* remove unecessary tests

* Refactored the reduce-scatter and all-gather kernel implementations to eliminate code duplication between standalone operations and fused kernels

* use new restructured reduce_scatter and all_gather

* add calculate_heap_size to atomatically allocate heap with M, N

* make sure calculate_heap_size exported in different level

* fix undefined shmem issue

* use aiter rmsnorm triton kernel instead

* add howto

* add test for fused rs_rmsnorm_quant_ag kernel

* use multiple processing for this test as well

* resolve format issue

* fix typo for black

* fix another black format issue

* ctx is always going to be required

* remove unused variables

* remove uncessary pass

* remove unusded math module

* Update aiter/ops/triton/comms/iris.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update op_tests/multigpu_tests/triton/test_reduce_scatter_all_gather.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* remove dead link

* Update aiter/ops/triton/comms/fused/__init__.py

Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

* Update aiter/ops/triton/comms/fused/reduce_scatter_rmsnorm_quant_all_gather.py

Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

* Update aiter/ops/triton/comms/fused/reduce_scatter_rmsnorm_quant_all_gather.py

Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

* Update op_tests/multigpu_tests/triton/test_fused_rs_rmsnorm_quant_ag.py

Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

* Update aiter/ops/triton/comms/reduce_scatter.py

Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

* fix iterations for distributed tests to prevent deadlocks

* update license

* pin iris SHA

* fix format

* Update aiter/ops/triton/comms/all_gather.py

Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>
Add chmod step before checkout to fix permissions on .eggs files
* opt(pa_ps): replace pa_ps asm due to reduce ignore nan

Signed-off-by: Double Young <yang.yang2@amd.com>

* opt(pa_ps): optimize ceil in v1_2 gen_metadata

* opt(pa_ps): use dense represent to decrease reduce overhead

* tmp(pa_ps): enhance v1_2 gen_metadata

---------

Signed-off-by: Double Young <yang.yang2@amd.com>
* Add instances supporting inter_dim=192 for CK 2stage MoE

* revert changes to other than bf16 types

* fix per-token instances

* add some instances

* adjust some instances

* adjust some instances

* remove bad tunings for inter_dim=192

* Add tuned ck2stage kernels for Qwen3-235B

* disable KPerBlock=64 fp8 cases for gfx950

* add one more instance

* revert changes to A4W4

---------

Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* update

* update ut

* update

* fix prefill block_size

* rm tmp files

* update ck

* update tuning range

* update ck & ut

* update ck

---------

Co-authored-by: Zzz9990 <Zzz9990>
* fix setup develop prebuild bugs

* add more module in prebuild

* format code

* skip mi350 a8w8 kernel

* bug fix

* bug fix II

* bug fix III

* format II

* format III

* format IV

* format V

* bugs fix

* bug new fix II

* bug new fix III

* format new I

---------

Co-authored-by: zufayu <zufayu@amd.com>
* Add batch prefill benchmark

Create initial bench_batch_prefill.py for benchmarking mha_batch_prefill_func
with paged KV cache. The equal_seqlens sets to default true now and use
"--no-equal_seqlens" for random test.

* [lint style fix] bench_batch_prefill.py

pylint:
Your code has been rated at 8.53/10
ruff:
All checks passed!
black:
All done!

---------

Co-authored-by: root <root@smci350-odcdh1-a07-2.png-odc.dcgpu>
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
…or TGMM (ROCm#1541)

* Initial commit

* Added TGMM test and fixed the wrapper

* Added test for accumulate

* atol and rtol changes, comment fixes

* Addressing PR comments

* Added benchmark script modifications

* Added bias and accumulate to non persistent

* Refactor TGMM test: enabled bias grad test for nptgmm and relaxed tolerance for gfx950

* Removed accumulate boolean for bias grad and added tests for accumulate.

* Applied black formatting

* Adding accumulate configs because of  triton.runtime.errors.OutOfResources: out of resource: shared memory (Required: 131072, Hardware limit: 65536) when accumulate = True

* Fixed black formatting

---------

Co-authored-by: Sudharshan Govindan <sugovind@amd.com>
* CI: Temporarily migrate SGlang and vLLM tests
* Add attention sink support to forward pass

* Add attention sink forward pass support to benchmark script

* Add attention sink support to backward pass

* Add attention sink backward pass support to benchmark script

* Conditionally relax dv error toletance on `gfx942`

* Decrease error tolerance for `dsink`
* bf16gemm_asm_tune

* mdf_csv

* update

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update

* update GemmTuner.py

* update kl

* update_mi30x_kl

* update

* update

* update

* update

* update

* add_gpt_oss_tuned

* update test

* update

* update

* update tuner with bias

* Update GemmTuner.py

* filter bpreshuffle

* fix lint

* mdf test

* set max splitk 16

* rm tuned result

* Update bf16_tuned_gemm.csv

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Ying.Zhou2 <yzhou103@amd.com>
Co-authored-by: yzhou103 <Ying.Zhou2@amd.com>
…is - try 2 (ROCm#1607)

* add iris as aiter optional dependency

* prevents torch.distributed from leaking into aiter namespace

* add reduce_scatter triton ops

* add all_gater triton ops

* add fused rs_rms_quant_ag triton comm op

* add unit tests

* complete integration

* update format

* update fused kernel

* more format update

* suspect this may break vllm tests

* add iris.py and remove unecessary iris function api

* fix ci import error by enable conditional import iris

* remove unecessary tests

* Refactored the reduce-scatter and all-gather kernel implementations to eliminate code duplication between standalone operations and fused kernels

* use new restructured reduce_scatter and all_gather

* add calculate_heap_size to atomatically allocate heap with M, N

* make sure calculate_heap_size exported in different level

* fix undefined shmem issue

* use aiter rmsnorm triton kernel instead

* add howto

* add test for fused rs_rmsnorm_quant_ag kernel

* use multiple processing for this test as well

* resolve format issue

* fix typo for black

* fix another black format issue

* ctx is always going to be required

* remove unused variables

* remove uncessary pass

* remove unusded math module

* Update aiter/ops/triton/comms/iris.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update op_tests/multigpu_tests/triton/test_reduce_scatter_all_gather.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* remove dead link

* Update aiter/ops/triton/comms/fused/__init__.py

Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

* Update aiter/ops/triton/comms/fused/reduce_scatter_rmsnorm_quant_all_gather.py

Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

* Update aiter/ops/triton/comms/fused/reduce_scatter_rmsnorm_quant_all_gather.py

Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

* Update op_tests/multigpu_tests/triton/test_fused_rs_rmsnorm_quant_ag.py

Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

* Update aiter/ops/triton/comms/reduce_scatter.py

Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>

* fix iterations for distributed tests to prevent deadlocks

* update license

* pin iris SHA

* fix format

* The warnings were at import time (too early), now they're only at usage time (when it matters)

* add IRIS_DEP to pin IRIS release to avoid sudden iris api change

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* remove commented lines

* Fixed fp8 dtype check inconsistency

* Add public is_initialized property to fix encapsulation issue

* remove unused

* Fixed multiprocessing race condition

* fix aiter imports and remove unused M_shard

* fix inefficient mem allocation

* fix format issue

* completely silent iris log when not using iris

* fix the import crash

* move triton_comms README

* add triton comms info to README.md

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Muhammad Awad <112003944+mawad-amd@users.noreply.github.com>
Co-authored-by: Guanbao Yu <gyu@amd.com>
Co-authored-by: Zhu Yuhua <yuhzhu@amd.com>
* refactor fmoe profile data, and log

* Update fused_moe.py
* add pa_decode_triton_fp8_gluon kernel and perf benchmark

* enhance code readability, make the code more elegant

* add gluon version paged attention decode and test

- Create __init__.py for aiter.ops.triton.gluon module
- Add test_paged_attention_decode_gluon.py to test gluon FP8 implementation
- Test compares performance between assembly and gluon kernels

* add per_tensor quant test for gluon version PA

* fix per_tensor quant bug of paged_attention_decode_v2_gluon_large_block_fp8

* triton3.4 and triton3.5 use different reduce codes to avoid performance degeneration caused by triton3.5

* fix bug in kv use varlen, fix per_token && per_tensor bug, add torch_mha_extend_flashattn_style which is consistent with the triton kernel to compare diff with the triton kernel

* rm useless code

* add pa_gluon AOT compile flow and unittest

* fix pa_gluon AOT compile flow bug

* remove xxx_POW2 params of PA triton kernels && add separate test for pa_decode_attention_kernel AOT compile

* modify pa kernel interface && fix add compute_type param problem in AOT mode

* merge pa_decode attention and reduce kernel into one C++ wrapper

* rename some variable && add paged_attention_decode_v2_reduce_kernel_triton34 into pa_decode_triton_gluon_fp8.py and remove pa_decode_triton_gluon_fp8_triton34.py

* rename file

* change the triton version comparison logic

* format file

* fix the issue of pa gluon performance decline in triton AOT mode

* add assertions to all tests pass in test_pa_decode_gluon.py

* fix per-token quant bug

* add support for COMPUTE_TYPE: bf16,fp16 && QUANT_Q_AND_KV: (False,False),(False,True)

* AOT mode support COMPUTE_TYPE: bf16,fp16 && QUANT_Q_AND_KV: (False,False),(False,True)

* add script to build PA AOT so

* the script to build PA AOT so support multiprocessing

* add loop version of paged_attention_decode_v2_reduce_kernel to remove MAX_CONTEXT_PARTITION_NUM and reduce so count in AOT mode

* reduce kernel templates

* modify build PA AOT so options

* test support sample_rate to save time

* fix bug of building PA AOT so with multiprocessing

* refactor: generalize gluon kernel && implement build cache cleanup with .so file reporting

* feat: add backward compatibility for Triton without gluon
Gracefully handle missing triton.experimental.gluon imports by adding try-except blocks and runtime checks. Add simplified test configuration.

* optimize paged_attention_decode_v2_reduce_kernel in long context scenarios

* fix deadlock bug of building PA AOT so with multiprocessing && refine unit test

* close support for bf16 and fp16 for the time being, due to some precision problems

* Add Gluon transpose kernels for query and output in paged attention decode

- Add transpose_query_gluon_kernel to transpose query from
  [batch*seq, num_heads, head_size] to Gluon format
- Add transpose_output_gluon_kernel to transpose output back
  to standard format
- Support both JIT and AOT compilation modes
- Update pa_decode_gluon API to handle transposition internally
- Add unit tests and prebuild scripts for transpose kernels

* ali test

* use gl.int64 to hold kv_cache offsets to avoid overflow when the kv_cache's shape is too large

* clean cache file after aot compile function, add doc for func pa_decode_gluon and pa_decode_gluon_aot

* remove paged_attention_decode_v2_reduce_kernel_triton34, PA kernel can only run in triton 3.5.0 or higher

* rename PA dot kernel

* Add arch assertion to restrict pa_decode_gluon to gfx942 (CDNA3)

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>
* mdf package_name to amd-aiter

* update

* update
…8 and mtp>=1 (ROCm#1626)

* introduce a new solution which can reduce #splits

rev 1.1

1.2

* Revert kFixedOverheadNumBlocks

* revert attention.py
* fix ci triton error

* rebase
* Move triton_tests into folders

* fix import errors

* formatting fix

* remove misc folder

* Update op_tests/triton_tests/quant/test_fused_mxfp4_quant.py - copilot suggestion

* remove import workarounds

* fix op_benchmarks imports
gyohuangxin and others added 26 commits January 29, 2026 15:40
* add sageish attn implementation (fav3_sage.py), i.e. attn quantization scheme suited for video generation
* Support gdn ops, removed layernorm_gated.py to pass triton test
* draft

* Update CONTRIBUTE.md

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix 1

---------

Co-authored-by: zufayu <zufayu@amd.com>
Co-authored-by: Xin Huang <Xin.Huang@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* mla ps support paged 64 and 3buffer layout for ds3.2
* CI: Fix Sglang dummy_grok 403 issue
* add mha bwd 192_128 causal br asm kernel

* add causal br a16 kernel

* fix kernel name

* fix_a16_kernel_nan
* CI: Fix docker ratelimit issues

* Fix
* enhance cast

* update WIP fp4

* update

* remove std dependency for integral based static_for

* update

* let gfx950 using instruction cast f32->bf16 by static_cast
Co-authored-by: Lingpeng Jin <103567126+valarLip@users.noreply.github.com>
* mla ps support paged 64 and 3buffer layout for ds3.2

* fix the github-actions

* upload kernel

* fix the comments

* modify the scale values
Signed-off-by: Double Young <yang.yang2@amd.com>
@JaxChen29 JaxChen29 requested a review from a team March 2, 2026 12:06
@JaxChen29 JaxChen29 marked this pull request as draft March 2, 2026 12:06
@valarLip valarLip closed this Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.