Releases: NVIDIA/cudnn-frontend
v1.23.0-release
cuDNN Frontend v1.23.0 is the recommended version for cuDNN 9.21.0 and later releases.
cudnn-frontend now has pip wheels for python 3.14t.
New APIs 🚀 🚀
Causal Conv1d
- Depthwise causal 1-D convolution with optional fused silu activation (requires cuDNN 9.22.0):
y = activation(conv1d_causal(x, w) + b)Supports forward and backward passes withtorch.autogradandtorch.compile. (Not supported on Windows yet)
Updates to Graph API
Transpose (requires cuDNN 9.22.0)
- Added new
Graph::transposewithTranspose_attributes(permutation, optional compute dtype, name)
Slice (requires cuDNN 9.22.0)
- Extend
Slice_attributeswithset_stridesfor per-axis slice steps; strided slices update inferred output shape and strides accordingly. - Python:
pygraph.slicenow honors each dimension's slice.step
Concatenate (requires cuDNN 9.22.0)
- Extend
Concatenate_attributeswithset_in_place_index(optional). When unset, concatenate runs out-of-place per backend rules.
Reshape (requires cuDNN 9.22.0)
- Introduce
ReshapeMode_t(VIEW_ONLY,LOGICAL)andReshape_attributes::set_reshape_modeso reshapes can select view-style vs lexicographic logical reshape.
Compile-time constants (requires cuDNN 9.22.0)
- Added
cudnn.scalar_type(RUNTIME_PARAM,COMPILE_TIME_CONST)andGraph::tensor(scalar, ScalarType)overloads, so scalars can be execution-time variant-pack inputs or constants embedded in the plan. Tensor_attributescan be marked as a compile-time constant or a normal runtimepass-by-value scalar;
Open source kernels 🚀 🚀
- GEMM + sReLU: High-performance implementation of squared-ReLU fused with GEMM.
- GEMM + dsReLU: High-performance implementation of dsquared-ReLU fused with GEMM.
- Grouped GEMM + GLU + Hadamard: Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
- Grouped GEMM + sReLU: Contiguous grouped squared-ReLU GEMM for MoE workloads.
- Grouped GEMM + dsReLU: Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
- RMSNorm + RHT + amax: A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size
16, and a per-CTAamaxreduction.
Fix block-scale quantize
The scale tensor uses a 128x4 reordered layout (TensorReordering_t::F8_128x4). When the reordering type is set on the scale tensor, the frontend will automatically pad the inferred scale dimensions to align with the 128x4 block structure (non-batch, non-axis dimensions are padded to multiples of 128, and the quantize axis dimension is padded to multiples of 4).
- GEMM + sReLU: High-performance implementation of squared-ReLU fused with GEMM.
- GEMM + dsReLU: High-performance implementation of dsquared-ReLU fused with GEMM.
- Grouped GEMM + GLU + Hadamard: Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
- Grouped GEMM + sReLU: Contiguous grouped squared-ReLU GEMM for MoE workloads.
- Grouped GEMM + dsReLU: Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
- RMSNorm + RHT + amax: A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size
16, and a per-CTAamaxreduction.
General Improvements ✨✨
-
Grouped GEMM APIs now default to dynamic MNKL compilation across GLU, dGLU, SwiGLU, dSwiGLU, SReLU, dSReLU, and quant wrappers. Set
CUDNN_FE_GROUPED_GEMM_DYNAMIC_MNKL=0to restore the previous M-only dynamic behavior. -
Grouped GEMM wgrad wrapper APIs now support caller-provided output buffers (wgrad_tensor for dense, wgrad_ptrs for discrete)
-
Unused internal c_tensor removed from Grouped GEMM quant path
Bug fix 🐛
-
Grouped GEMM GLU bias compilation issue for 64B-aligned inputs with dynamic MNKL
-
Fix an issue with dropout in Blackwell when cudnn frontend 1.21 version is used with cudnn backend 9.21 and 9.22.
Benchmarking 📊
- Updated the benchmark results for the SDPA improvements. Added
Kimi-K2.6,LTX-2,Qwen 2.5,Wan2.2to the benchmark results page.
Acknowledgements:
- Thanks @haowen-han for fixing a bug in the block-scale matmul sample.
v1.22.1-release
cuDNN Frontend v1.22.1 is the recommended version for cuDNN 9.20.0 and later releases.
General Improvements 🚀 🚀
-
Introducing PyTorch custom operator wrapping cuDNN's MoE Grouped Gemm operation.
```python def moe_grouped_matmul( token: torch.Tensor, weight: torch.Tensor, first_token_offset: torch.Tensor, token_index: Optional[torch.Tensor] = None, token_ks: Optional[torch.Tensor] = None, mode: str = "none", top_k: int = 1, ) -> torch.Tensor ```See test/python/test_moe_grouped_matmul_op.py for usage.
-
🕒 We will be rolling out new native custom torch ops in upcoming releases – stay tuned! 😃
Open-Source Kernels 🚀 🚀
- Blackwell sdpa fprop kernel supporting head dim = 256, written in cuteDSL. Support added through the torch-op above or callable as a standalone API. See samples for the API usage. Requires
nvidia-cutlass-dsl[cu13]==4.4.1
Updates:
GroupedGemmWgradSm100andgrouped_gemm_wgrad_wrapper_sm100expose the grouped GEMM weight-gradient kernel. See grouped_gemm_wgrad.html for API reference moe_blockscaled_grouped_gemm_wgrad.py for samples.
Acknowledgements:
Blackwell sdpa fprop kernel supporting head dim = 256, written in cuteDSL kernel was jointly developed by Shengbin Di, Yuxi Chi, and Linfeng Zheng in close collaboration with Alibaba. We would like to extend special thanks to the core contributors from Alibaba: Siyu Wang, Haoyan Huang, Lanbo Li, Yun Zhong, Man Yuan, Minmin Sun, Yong Li, and Wei Lin for their significant contributions to this work.
v1.22.0-release
cuDNN Frontend v1.22.0 Release Notes
cuDNN Frontend v1.22.0 is the recommended version for cuDNN 9.20.0 and later releases.
General Improvements 🚀 🚀
-
Introducing PyTorch custom operator wrapping cuDNN's Scaled Dot-Product Attention (SDPA).
scaled_dot_product_attentionas the public entry point, closely
matching the signature oftorch.nn.functional.scaled_dot_product_attention.```python def scaled_dot_product_attention( query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, attn_mask: Optional[torch.Tensor] = None, dropout_p: float = 0.0, is_causal: bool = False, scale: Optional[float] = None, enable_gqa: bool = False, *, diagonal_alignment: int = 0, left_bound: int = -1, right_bound: int = -1, seq_len_q: Optional[torch.Tensor] = None, seq_len_kv: Optional[torch.Tensor] = None, cumulative_seq_len_q: Optional[torch.Tensor] = None, cumulative_seq_len_kv: Optional[torch.Tensor] = None, ) -> torch.Tensor: ``` -
Introduce a preindexed execute method, that reduces the CPU execution overhead.
-
Improve the reproducer tool to report and reproduce SDPA failures for fp8 data types as well.
-
🕒 We will be rolling out new native custom torch ops in upcoming releases – stay tuned! 😃
Open-Source Kernels 🚀 🚀
-
Blackwell sdpa bprop kernel supporting head dim = 256, written in cuteDSL. Support added through the torch-op above or callable as a standalone API. See samples for the API usage. Requires
nvidia-cutlass-dsl[cu13]==4.4.1 -
Grouped Gemm + quantize kernels now support dynamic shape and layout. This is controllable via an environment toggle.
-
Grouped Gemm + Glu/Swiglu now supoprt optional bias fusion in both dense and discrete modes, including partial‑N support and optional bias‑gradient generation for discrete backward paths.
Updates:
-
fp8 datatype with packed variable sequences (THD) is no longer supported for SM90 (Hopper) architecture.
-
Fix an issue where sdpa fp8 was failing when used with cuda toolkit 12.9
Acknowledgements:
Blackwell sdpa bprop kernel supporting head dim = 256, written in cuteDSL kernel was jointly developed by Shengbin Di, Yuxi Chi, and Linfeng Zheng in close collaboration with Alibaba. We would like to extend special thanks to the core contributors from Alibaba: Siyu Wang, Haoyan Huang, Lanbo Li, Yun Zhong, Man Yuan, Minmin Sun, Yong Li, and Wei Lin for their significant contributions to this work.
v1.21.0-release
cuDNN Frontend v1.21.0 Release Notes (#213)
cuDNN Frontend v1.21.0 is the recommended version for cuDNN 9.20.0 and later releases.
General Improvements 🚀
- Dropped dependency on the CUDA driver API for the frontend library, enabling builds without direct CUDA driver linkage.
Open-Source Kernels
Added new kernels for the GEMM fusions.
Grouped GEMM + GLU: Unified grouped GEMM GLU API supporting dense and discrete MoE weight layouts with optional bias.
Grouped GEMM + dGLU: Unified grouped GEMM dGLU backward API supporting dense and discrete MoE weight layouts with optional bias.
Discrete Grouped GEMM + SwiGLU: Per-expert-pointer SwiGLU grouped GEMM for MoE workloads without weight packing.
Discrete Grouped GEMM + dSwiGLU: Per-expert-pointer dSwiGLU backward grouped GEMM for MoE workloads without weight packing. Uses dSwiGLU/dGeGLU backward epilogue.
Grouped GEMM + dSwiglu: dSwiglu activation fused with Grouped GEMM
Grouped GEMM + Quant: Grouped GEMM with output quantization for MoE FC2/dFC1 workloads
v1.20.0 release
cuDNN Frontend v1.20.0 is the recommended version for cuDNN 9.20.0 and later releases.
Open-Source Kernels 🚀 🚀
- Fused RMSNorm + SiLU: The Fused RMSNorm + SiLU engine implements a single-kernel fusion of RMS normalization followed by SiLU (Swish) activation. It is designed and optimized specifically for the WAN VAE decoder's L2Norm + SiLU pattern on B200, but supports arbitrary problem sizes on SM80 to SM103 GPUs.
Improvements:
-
Allow
GEMM + Amax,GEMM + SwiGLU,Grouped GEMM + SwiGLU,Grouped GEMM + dSwiglu, andNSAkernels to run on GB300. -
Improve the reproducer tool to report and reproduce SDPA failures.
v1.19.1 release
cuDNN Frontend v1.19.1 Release
Pinning the pybind version to prevent failures with older versions.
Restore support for cuda-12 toolkit that was accidentally dropped in 1.19.0 release.
cuDNN Frontend v1.19.0 Release Notes
cuDNN Frontend v1.19.0 is the recommended version for cuDNN 9.19.1 and later releases.
Open-Source Kernels 🚀 🚀
- Blackwell and Hopper SDPA Fprop Kernels: cuDNN's SDPA Fprop implementation is now open source. This kernel supports causal masking and outputs stats for use in bprop. Additional kernels will be added in future releases.
- Grouped GEMM + dSwiGLU Fusion: A contiguous grouped block-scaled GEMM fused with a dSwiGLU backward epilogue on NVIDIA Blackwell GPUs (SM100+), designed for MoE (Mixture of Experts) workloads.
General Improvements 🚀
- Removed multiple device queries for SM version during graph validation and replaced with a single query that can be skipped by setting
sm_versionon the cuDNN graph. - Fixed an issue where enabling logging with CUDA graphs in certain scenarios would cause a crash.
- Significantly reduced the CPU overhead of the cuDNN OSS API by using tvm-ffi.
- We are adding a new cudnn-repro tool to have a standalone reproducer from the cudnn frontend logs. See details
Enhancements ✨
Scaled Dot-Product Attention (SDPA)
- Support Checks: Improved support checks for cleaner support surface queries.
- New API: Added Python bindings for score-mod bprop function to enable the score bprop API.
- Stats: Support independent generation of SDPA stats (LSE, SE, Max) in sdpa fprop (Requires 9.20.0 and up).
Normalization
- More Benchmarks: New normalization benchmark results posted for GB200, GB300, and H200.
Benchmarking 📊
- Updated the benchmark results for the SDPA improvements added in cuDNN 9.19.1
v1.19.0-release
cuDNN Frontend v1.19.0 Release Notes
cuDNN Frontend v1.19.0 is the recommended version for cuDNN 9.19.1 and later releases.
Open-Source Kernels 🚀 🚀
- Blackwell and Hopper SDPA Fprop Kernels: cuDNN's SDPA Fprop implementation is now open source. This kernel supports causal masking and outputs stats for use in bprop. Additional kernels will be added in future releases.
- Grouped GEMM + dSwiGLU Fusion: A contiguous grouped block-scaled GEMM fused with a dSwiGLU backward epilogue on NVIDIA Blackwell GPUs (SM100+), designed for MoE (Mixture of Experts) workloads.
General Improvements 🚀
- Removed multiple device queries for SM version during graph validation and replaced with a single query that can be skipped by setting
sm_versionon the cuDNN graph. - Fixed an issue where enabling logging with CUDA graphs in certain scenarios would cause a crash.
- Significantly reduced the CPU overhead of the cuDNN OSS API by using tvm-ffi.
- We are adding a new cudnn-repro tool to have a standalone reproducer from the cudnn frontend logs. See details
Enhancements ✨
Scaled Dot-Product Attention (SDPA)
- Support Checks: Improved support checks for cleaner support surface queries.
- New API: Added Python bindings for score-mod bprop function to enable the score bprop API.
- Stats: Support independent generation of SDPA stats (LSE, SE, Max) in sdpa fprop (Requires 9.20.0 and up).
Normalization
- More Benchmarks: New normalization benchmark results posted for GB200, GB300, and H200.
Benchmarking 📊
- Updated the benchmark results for the SDPA improvements added in cuDNN 9.19.1
v1.18.0-release
cuDNN Frontend v1.18.0 Release Notes
cuDNN Frontend v1.18.0 is the recommended version for cuDNN 9.18.1 and later releases.
General Improvements 🚀
- Move away from internally using the v0.x API. Rather, now the cudnn backend API is directly called.
- Improve the execution overhead by caching repeated graph query.
Open-Source Kernels
New open source kernel for Grouped Gemm and Swiglu fussion
Enhancements ✨
Scaled Dot-Product Attention (SDPA)
-
New Features: Allows support for dynamic shapes for fprop. This will help reduce the graph building across different batch and sequence lengths.
-
Support Surface:
- Now allows deterministic bprop for SDPA
- Added support for bprop for ragged tensors in A100
-
More samples:
- Open sourcing our sdpa test harness. Showcase additional testing for determinism, fp8 sizes for MLA
- Added samples to showcase chunked prefill.
Mixture of Expers (MoE)
- New API: Added support for
moe_grouped_matmul. See cpp sample and documentation for API reference.
Matmul
- More samples: Open sourcing cudnn`s fuzzy testing of matmuls
Convolution
- More samples: Open sourcing cudnn`s fuzzy testing of convolutions
Additional Improvements
Benchmarking 📊
- Updated the benchmark results for the sdpa improvements added in cuDNN 9.18.1
v1.17.0-release
cuDNN Frontend v1.17.0 Release Notes
cuDNN Frontend v1.17.0 is the recommended version for cuDNN 9.17.0 and later releases.
New Features 🚀
Open-Source Kernels
-
Native Sparse Attention : The Native Sparse Attention (NSA) module implements Native Sparse attention as described in the Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. Samples of usage for Blackwell architecture in test/python/fe_api/nsa
-
Gemm/Swiglu : Gemm_Swiglu now supports block-scaled FP8/FP4 datatypes.
API changes:- Output tensors have been renamed from "C" and "Glu" to "AB12" and "C", respectively.
- "use_2cta_intrs" Option has been removed. This will be inferred automatically from tile shape.
Enhancements ✨
Scaled Dot-Product Attention (SDPA)
- More samples: Open sourcing our sdpa test harness and fp8 samples in test/python/test_sdpa_fp8.py
Additional Improvements
- Tensor properties: Added vector Dim and vectorization count to the tensor properties.
- Graph wrapper: Fixed an issue in the native graph wrapper that caused
BufferErrorin non-pytorch tensors.
Benchmarking 📊
- Updated the benchmark results for the sdpa improvements added in cuDNN 9.17.0. GB200 and GB300 data.
Samples
- ** cudnn Llama model **: Added reference implementation of the Llama model completely in cuDNN.
v1.16.1-release
What's Changed
- Find cudnn libraries with NAMES_PER_DIR for python site by @take-cheeze in #180
- Dont override if users provide max/sum_exp shape and stride #181
- Fix issues in warmup function leading to error in deserialize. #183
New Contributors
- @take-cheeze made their first contribution in #180
Full Changelog: v1.16.0...v1.16.1