Add fused RoPE kernel by amd-weisun · Pull Request #272 · ROCm/FlyDSL

amd-weisun · 2026-03-23T16:50:12Z

Motivation

FlyDSL implementation of fused Rotary Position Embedding + KV cache write, (AITER's Triton fused_qk_rope_reshape_and_cache kernel), used in GPT-OSS and some other models(Qwen3, Llama-3.1)
Supports both flash and non-flash (ATOM production default, x-packed) KV cache layouts
Computes rotation in native bf16 precision matching AITER/Triton — cross-validation (0 error)

Test Plan

Usage:

# Fast CI — correctness only (GPT-OSS 120B TP=8, 10 tests):
PYTHONPATH=./ pytest tests/kernels/test_fused_rope_cache.py -v -s

# All models × TPs (multi-model sweep):
FLYDSL_ALL_MODELS=1 PYTHONPATH=./ pytest tests/kernels/test_fused_rope_cache.py -v -s

# With benchmarking + optional AITER comparison:
FLYDSL_BENCH=1 AITER_REPO=../aiter PYTHONPATH=./ pytest tests/kernels/test_fused_rope_cache.py -v -s

# CLI — all models:
PYTHONPATH=./ python tests/kernels/test_fused_rope_cache.py --all-models

# CLI — with benchmark + AITER comparison:
FLYDSL_BENCH=1 AITER_REPO=../aiter PYTHONPATH=./ python tests/kernels/test_fused_rope_cache.py --all-models

Test Result

Tested on MI350: 0 numerical error vs PyTorch reference. Performance: 1.4-1.8x faster than Triton (AITER) across all configs ( GPT-OSS-120B, Qwen3, Llama-3.1), both layouts verified. Cross-validated against AITER output.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Copilot

Pull request overview

Adds a new FlyDSL fused RoPE rotation + KV-cache write kernel (supporting both flash and ATOM non-flash cache layouts) along with a correctness/perf test harness and optional AITER cross-validation to align with GPT-OSS/ATOM usage.

Changes:

Introduce build_fused_rope_cache_module() that emits two GPU kernels: Q RoPE, then K RoPE + KV-cache write.
Add correctness tests for both cache layouts, plus an optional multi-model sweep and optional AITER perf/cross-check path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.

File	Description
`kernels/fused_rope_cache_kernel.py`	New fused RoPE + KV-cache kernel builder using `@flyc.kernel` + `@flyc.jit` with flash and non-flash cache layout support.
`tests/kernels/test_fused_rope_cache.py`	New GPU test validating Q/K outputs and KV-cache writes against a PyTorch reference, with optional AITER comparison.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/kernels/test_fused_rope_cache.py

kernels/fused_rope_cache_kernel.py

tests/kernels/test_fused_rope_cache.py

kernels/fused_rope_cache_kernel.py

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kernels/fused_rope_cache_kernel.py

tests/kernels/test_fused_rope_cache.py

kernels/fused_rope_cache_kernel.py

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/kernels/test_fused_rope_cache.py

kernels/fused_rope_cache_kernel.py

tests/kernels/test_fused_rope_cache.py

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/kernels/test_fused_rope_cache.py

kernels/fused_rope_cache_kernel.py

tests/kernels/test_fused_rope_cache.py

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/kernels/test_fused_rope_cache.py

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kernels/fused_rope_cache_kernel.py

FlyDSL implementation of fused RoPE rotation + KV cache write, replacing AITER's Triton fused_qk_rope_reshape_and_cache kernel. - kernels/fused_rope_cache_kernel.py: Two-kernel design (Q RoPE + K RoPE/cache), supports flash [T,BS,KH,D] and non-flash x-packed [T,KH,D//16,BS,16] key_cache layouts. Computes rotation in native bf16 matching AITER/Triton precision (bit-exact cross-validation). - tests/kernels/test_fused_rope_cache.py: 10 default + 72 multi-model correctness tests, optional AITER perf comparison with cross-validation. Cached kernel compilation, vectorized reference, CUDA event timing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings March 23, 2026 16:50

Copilot started reviewing on behalf of amd-weisun March 23, 2026 16:51 View session

amd-weisun requested review from coderfeli and liligwu March 23, 2026 16:52

Copilot AI reviewed Mar 23, 2026

View reviewed changes

amd-weisun marked this pull request as draft March 24, 2026 10:44

amd-weisun removed request for coderfeli and liligwu March 24, 2026 10:54

amd-weisun force-pushed the fused-rope-cache-kernel branch 3 times, most recently from f5deef8 to e19f417 Compare March 24, 2026 11:16

amd-weisun requested a review from Copilot March 24, 2026 11:16

amd-weisun marked this pull request as ready for review March 24, 2026 11:17

Copilot started reviewing on behalf of amd-weisun March 24, 2026 11:18 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

kernels/fused_rope_cache_kernel.py Outdated Show resolved Hide resolved

tests/kernels/test_fused_rope_cache.py Outdated Show resolved Hide resolved

tests/kernels/test_fused_rope_cache.py Outdated Show resolved Hide resolved

kernels/fused_rope_cache_kernel.py Show resolved Hide resolved

amd-weisun marked this pull request as draft March 24, 2026 12:00

amd-weisun requested a review from Copilot March 24, 2026 12:28

Copilot AI reviewed Mar 24, 2026

View reviewed changes

tests/kernels/test_fused_rope_cache.py Show resolved Hide resolved

kernels/fused_rope_cache_kernel.py Outdated Show resolved Hide resolved

kernels/fused_rope_cache_kernel.py Outdated Show resolved Hide resolved

tests/kernels/test_fused_rope_cache.py Outdated Show resolved Hide resolved

amd-weisun requested a review from Copilot March 24, 2026 12:50

Copilot started reviewing on behalf of amd-weisun March 24, 2026 12:53 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

tests/kernels/test_fused_rope_cache.py Show resolved Hide resolved

kernels/fused_rope_cache_kernel.py Outdated Show resolved Hide resolved

kernels/fused_rope_cache_kernel.py Show resolved Hide resolved

tests/kernels/test_fused_rope_cache.py Outdated Show resolved Hide resolved

amd-weisun requested a review from Copilot March 24, 2026 13:17

Copilot started reviewing on behalf of amd-weisun March 24, 2026 13:19 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

tests/kernels/test_fused_rope_cache.py Outdated Show resolved Hide resolved

tests/kernels/test_fused_rope_cache.py Outdated Show resolved Hide resolved

tests/kernels/test_fused_rope_cache.py Outdated Show resolved Hide resolved

amd-weisun requested a review from Copilot March 24, 2026 13:49

Copilot started reviewing on behalf of amd-weisun March 24, 2026 13:52 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

kernels/fused_rope_cache_kernel.py Outdated Show resolved Hide resolved

amd-weisun force-pushed the fused-rope-cache-kernel branch from 966355e to 0c60dae Compare March 24, 2026 14:58

amd-weisun marked this pull request as ready for review March 24, 2026 15:01

amd-weisun requested review from coderfeli and liligwu March 24, 2026 15:18

amd-weisun changed the title ~~Add fused RoPE + KV cache kernel~~ Add fused RoPE kernel Mar 24, 2026

Conversation

amd-weisun commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Test Plan

Test Result

Submission Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amd-weisun commented Mar 23, 2026 •

edited

Loading