Skip to content

[WIP] [KDA] support GVA for delta_h and fwd_o#73

Open
KevinZeng08 wants to merge 13 commits into
mainfrom
feat/gva-cutedsl
Open

[WIP] [KDA] support GVA for delta_h and fwd_o#73
KevinZeng08 wants to merge 13 commits into
mainfrom
feat/gva-cutedsl

Conversation

@KevinZeng08
Copy link
Copy Markdown
Collaborator

@KevinZeng08 KevinZeng08 commented May 20, 2026

📌 Description

🔍 Related Issues

#55

🚀 Pull Request Checklist

Thank you for contributing to cuLA! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing.

⚡ Performance

Reviewer Notes

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the flash-linear-attention baseline to v0.5.0 and introduces support for Grouped Value Attention (GVA) across the chunk_delta_h and fwd_o kernels. The implementation includes updated indexing logic to map value heads to QK heads and extends benchmark scripts to support configurable head counts. Documentation and benchmark results have been refreshed to reflect performance improvements on Blackwell and Hopper architectures. Feedback was provided to include explicit assertions validating that the number of value heads is a multiple of the QK heads and that head dimensions are restricted to 128, as required by the current kernel tiling logic.

Comment thread cula/ops/chunk_delta_h.py
HV = u.shape[2]
V_dim = u.shape[3]
BT = chunk_size
is_varlen = cu_seqlens is not None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For Grouped Value Attention (GVA) to work correctly with the current head-mapping logic (i_h = hidx // (HV // H)), the number of value heads (HV) must be a multiple of the number of QK heads (H). Additionally, since the kernel tiling is hardcoded for specific dimensions, we should also validate that V_dim matches the expected 128.

Suggested change
is_varlen = cu_seqlens is not None
is_varlen = cu_seqlens is not None
assert HV >= H and HV % H == 0, f"HV ({HV}) must be >= H ({H}) and divisible by H"
assert K_dim == 128 and V_dim == 128, f"current kernel only supports head_dim=128, got K={K_dim}, V={V_dim}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant