Skip to content

Tune a8w8_blockscale_bpreshuffle_tuned_gemm for gfx942(MI308X)#2172

Closed
chuanbowang2026 wants to merge 24 commits intoROCm:mainfrom
chuanbowang2026:tune/a8w8-blockscale-bpreshuffle-gfx942-mi308x
Closed

Tune a8w8_blockscale_bpreshuffle_tuned_gemm for gfx942(MI308X)#2172
chuanbowang2026 wants to merge 24 commits intoROCm:mainfrom
chuanbowang2026:tune/a8w8-blockscale-bpreshuffle-gfx942-mi308x

Conversation

@chuanbowang2026
Copy link
Copy Markdown
Contributor

  • Tune a8w8_blockscale_bpreshuffle_tuned_gemm for gfx942 (MI308X)
  • Update tuned/untuned CSV configs
  • Update tuner splitK behavior for ASM
  • Ran tuner and checked output CSV format."

@chuanbowang2026 chuanbowang2026 requested a review from a team March 4, 2026 10:06
@chuanbowang2026
Copy link
Copy Markdown
Contributor Author

Addressed, removed redundant local definitions and imported from aiter.jit.core directly.

@valarLip
Copy link
Copy Markdown
Collaborator

valarLip commented Mar 4, 2026

@chuanbowang2026 please resolve the failed ci

@chuanbowang2026
Copy link
Copy Markdown
Contributor Author

Addressed CI lint failures from black and ruff in gemm_a8w8_blockscale_bpreshuffle_tune.py.

Changes in this update:

  • Added file-level # ruff: noqa: E402 to allow intentional import ordering (we modify sys.path before module imports).
  • Removed unused local variables reported by Ruff (F841) in tune().
  • Cleaned up comment style and removed stale commented-out lines.
  • Applied Black formatting only.

No functional tuning logic was changed; this is a lint/format cleanup to pass CI.

@junxiaguo junxiaguo requested review from DDEle and yzhou103 and removed request for DDEle March 5, 2026 07:58
Comment thread csrc/ck_gemm_a8w8_blockscale_bpreshuffle/gemm_a8w8_blockscale_bpreshuffle_tune.py Outdated
@yzhou103
Copy link
Copy Markdown
Contributor

yzhou103 commented Mar 5, 2026

aiter/csrc/ck_gemm_a8w8_blockscale_bpreshuffle/gen_instances.py should be updated, you refer to gen_instances.py in ck_gemm_a88_bpreshuffle
image
we should filter asm solutions out when generating ck lookup file.

@chuanbowang2026
Copy link
Copy Markdown
Contributor Author

Thank you, it has been completed:
Set asm_kernel_id to start from 0 in gemm_a8w8_blockscale_bpreshuffle_tune.py.
In gen_instances.py, filter tuned results with libtype == "ck" before building the CK lookup table, so ASM solutions are excluded from CK lookup generation.

@amd-ruitang3
Copy link
Copy Markdown
Contributor

Hi @chuanbowang2026 , I commit "add_mi355_tuned"

@amd-ruitang3 amd-ruitang3 force-pushed the tune/a8w8-blockscale-bpreshuffle-gfx942-mi308x branch from 11cff67 to 7a64d88 Compare March 6, 2026 12:36
@valarLip
Copy link
Copy Markdown
Collaborator

valarLip commented Mar 7, 2026

let's split config to per model csv

amd-ruitang3 and others added 8 commits March 7, 2026 09:47
- Add a8w8_blockscale_bpreshuffle_tuned_gemm_dsv3.csv (MI308/MI355 tuned configs)
- Add a8w8_blockscale_bpreshuffle_untuned_gemm_dsv3.csv (M,N,K shapes for tuning)
- Update a8w8_blockscale_bpreshuffle_tuned/untuned_gemm.csv
- Keep a8w8_blockscale_bpreshuffle_tuned_gemm_dsv3.csv (MI308/MI355 DSv3 results)
- Keep tuned/untuned_gemm.csv with headers for config merge compatibility
- Add headers to root tuned/untuned for model_configs merge to work
Auto-select cu80/cu256 tuned and untuned files for blockscale bpreshuffle tuning and codegen so each machine only consumes its own config set.
@chuanbowang2026
Copy link
Copy Markdown
Contributor Author

Split DSV3 blockscale bpreshuffle tuned/untuned configs into cu80 and cu256 variants, and updated tuning/codegen to auto-select the machine-specific config by CU count. Because the previously tuned results were concatenated, the untuned results were run separately, resulting in a timeout of 1 hour.

Drop the temporary cu80/cu256 split config flow and restore the merged blockscale bpreshuffle tuning/codegen paths so CI keeps using the original non-retuning config layout.
@chuanbowang2026
Copy link
Copy Markdown
Contributor Author

This CI timeout issue has also been observed by others. One suspicion is that MAX_JOBS is set too low for forked PRs, which may cause longer runtimes on some GPUs.
Huang, Xin is currently verifying this.

yzhou103
yzhou103 previously approved these changes Mar 17, 2026
@chuanbowang2026
Copy link
Copy Markdown
Contributor Author

This PR is no longer needed due to history issues. Please submit it now and transfer it to #2366 To prevent data loss, this PR will be closed after the new PR is completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants