Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new FlyDSL fused RoPE rotation + KV-cache write kernel (supporting both flash and ATOM non-flash cache layouts) along with a correctness/perf test harness and optional AITER cross-validation to align with GPT-OSS/ATOM usage.
Changes:
- Introduce
build_fused_rope_cache_module()that emits two GPU kernels: Q RoPE, then K RoPE + KV-cache write. - Add correctness tests for both cache layouts, plus an optional multi-model sweep and optional AITER perf/cross-check path.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 10 comments.
| File | Description |
|---|---|
kernels/fused_rope_cache_kernel.py |
New fused RoPE + KV-cache kernel builder using @flyc.kernel + @flyc.jit with flash and non-flash cache layout support. |
tests/kernels/test_fused_rope_cache.py |
New GPU test validating Q/K outputs and KV-cache writes against a PyTorch reference, with optional AITER comparison. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
f5deef8 to
e19f417
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
FlyDSL implementation of fused RoPE rotation + KV cache write, replacing AITER's Triton fused_qk_rope_reshape_and_cache kernel. - kernels/fused_rope_cache_kernel.py: Two-kernel design (Q RoPE + K RoPE/cache), supports flash [T,BS,KH,D] and non-flash x-packed [T,KH,D//16,BS,16] key_cache layouts. Computes rotation in native bf16 matching AITER/Triton precision (bit-exact cross-validation). - tests/kernels/test_fused_rope_cache.py: 10 default + 72 multi-model correctness tests, optional AITER perf comparison with cross-validation. Cached kernel compilation, vectorized reference, CUDA event timing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
966355e to
0c60dae
Compare
Motivation
FlyDSL implementation of fused Rotary Position Embedding + KV cache write, (AITER's Triton fused_qk_rope_reshape_and_cache kernel), used in GPT-OSS and some other models(Qwen3, Llama-3.1)
Supports both flash and non-flash (ATOM production default, x-packed) KV cache layouts
Computes rotation in native bf16 precision matching AITER/Triton — cross-validation (0 error)
Test Plan
Usage:
Test Result
Submission Checklist