Skip to content

Add fused RoPE kernel #272

Open
amd-weisun wants to merge 1 commit intoROCm:mainfrom
amd-weisun:fused-rope-cache-kernel
Open

Add fused RoPE kernel #272
amd-weisun wants to merge 1 commit intoROCm:mainfrom
amd-weisun:fused-rope-cache-kernel

Conversation

@amd-weisun
Copy link

@amd-weisun amd-weisun commented Mar 23, 2026

Motivation

  • FlyDSL implementation of fused Rotary Position Embedding + KV cache write, (AITER's Triton fused_qk_rope_reshape_and_cache kernel), used in GPT-OSS and some other models(Qwen3, Llama-3.1)

  • Supports both flash and non-flash (ATOM production default, x-packed) KV cache layouts

  • Computes rotation in native bf16 precision matching AITER/Triton — cross-validation (0 error)

Test Plan

Usage:

# Fast CI — correctness only (GPT-OSS 120B TP=8, 10 tests):
PYTHONPATH=./ pytest tests/kernels/test_fused_rope_cache.py -v -s

# All models × TPs (multi-model sweep):
FLYDSL_ALL_MODELS=1 PYTHONPATH=./ pytest tests/kernels/test_fused_rope_cache.py -v -s

# With benchmarking + optional AITER comparison:
FLYDSL_BENCH=1 AITER_REPO=../aiter PYTHONPATH=./ pytest tests/kernels/test_fused_rope_cache.py -v -s

# CLI — all models:
PYTHONPATH=./ python tests/kernels/test_fused_rope_cache.py --all-models

# CLI — with benchmark + AITER comparison:
FLYDSL_BENCH=1 AITER_REPO=../aiter PYTHONPATH=./ python tests/kernels/test_fused_rope_cache.py --all-models

Test Result

  • Tested on MI350: 0 numerical error vs PyTorch reference. Performance: 1.4-1.8x faster than Triton (AITER) across all configs ( GPT-OSS-120B, Qwen3, Llama-3.1), both layouts verified. Cross-validated against AITER output.

Submission Checklist

Copilot AI review requested due to automatic review settings March 23, 2026 16:50
@amd-weisun amd-weisun requested review from coderfeli and liligwu March 23, 2026 16:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new FlyDSL fused RoPE rotation + KV-cache write kernel (supporting both flash and ATOM non-flash cache layouts) along with a correctness/perf test harness and optional AITER cross-validation to align with GPT-OSS/ATOM usage.

Changes:

  • Introduce build_fused_rope_cache_module() that emits two GPU kernels: Q RoPE, then K RoPE + KV-cache write.
  • Add correctness tests for both cache layouts, plus an optional multi-model sweep and optional AITER perf/cross-check path.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.

File Description
kernels/fused_rope_cache_kernel.py New fused RoPE + KV-cache kernel builder using @flyc.kernel + @flyc.jit with flash and non-flash cache layout support.
tests/kernels/test_fused_rope_cache.py New GPU test validating Q/K outputs and KV-cache writes against a PyTorch reference, with optional AITER comparison.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@amd-weisun amd-weisun marked this pull request as draft March 24, 2026 10:44
@amd-weisun amd-weisun force-pushed the fused-rope-cache-kernel branch 3 times, most recently from f5deef8 to e19f417 Compare March 24, 2026 11:16
@amd-weisun amd-weisun requested a review from Copilot March 24, 2026 11:16
@amd-weisun amd-weisun marked this pull request as ready for review March 24, 2026 11:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@amd-weisun amd-weisun marked this pull request as draft March 24, 2026 12:00
@amd-weisun amd-weisun requested a review from Copilot March 24, 2026 12:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

FlyDSL implementation of fused RoPE rotation + KV cache write,
replacing AITER's Triton fused_qk_rope_reshape_and_cache kernel.

- kernels/fused_rope_cache_kernel.py: Two-kernel design (Q RoPE +
  K RoPE/cache), supports flash [T,BS,KH,D] and non-flash x-packed
  [T,KH,D//16,BS,16] key_cache layouts. Computes rotation in native
  bf16 matching AITER/Triton precision (bit-exact cross-validation).

- tests/kernels/test_fused_rope_cache.py: 10 default + 72 multi-model
  correctness tests, optional AITER perf comparison with cross-validation.
  Cached kernel compilation, vectorized reference, CUDA event timing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@amd-weisun amd-weisun force-pushed the fused-rope-cache-kernel branch from 966355e to 0c60dae Compare March 24, 2026 14:58
@amd-weisun amd-weisun marked this pull request as ready for review March 24, 2026 15:01
@amd-weisun amd-weisun requested review from coderfeli and liligwu March 24, 2026 15:18
@amd-weisun amd-weisun changed the title Add fused RoPE + KV cache kernel Add fused RoPE kernel Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants