Skip to content

Releases: NVIDIA/cudnn-frontend

v1.23.0-release

29 Apr 18:04
fb682ce

Choose a tag to compare

cuDNN Frontend v1.23.0 is the recommended version for cuDNN 9.21.0 and later releases.

cudnn-frontend now has pip wheels for python 3.14t.

New APIs 🚀 🚀

Causal Conv1d

  • Depthwise causal 1-D convolution with optional fused silu activation (requires cuDNN 9.22.0): y = activation(conv1d_causal(x, w) + b) Supports forward and backward passes with torch.autograd and torch.compile. (Not supported on Windows yet)

Updates to Graph API

Transpose (requires cuDNN 9.22.0)

  • Added new Graph::transpose with Transpose_attributes(permutation, optional compute dtype, name)

Slice (requires cuDNN 9.22.0)

  • Extend Slice_attributes with set_strides for per-axis slice steps; strided slices update inferred output shape and strides accordingly.
  • Python: pygraph.slice now honors each dimension's slice.step

Concatenate (requires cuDNN 9.22.0)

  • Extend Concatenate_attributes with set_in_place_index (optional). When unset, concatenate runs out-of-place per backend rules.

Reshape (requires cuDNN 9.22.0)

  • Introduce ReshapeMode_t(VIEW_ONLY,LOGICAL) and Reshape_attributes::set_reshape_mode so reshapes can select view-style vs lexicographic logical reshape.

Compile-time constants (requires cuDNN 9.22.0)

  • Added cudnn.scalar_type(RUNTIME_PARAM,COMPILE_TIME_CONST) and Graph::tensor(scalar, ScalarType) overloads, so scalars can be execution-time variant-pack inputs or constants embedded in the plan.
  • Tensor_attributes can be marked as a compile-time constant or a normal runtimepass-by-value scalar;

Open source kernels 🚀 🚀

  • GEMM + sReLU: High-performance implementation of squared-ReLU fused with GEMM.
  • GEMM + dsReLU: High-performance implementation of dsquared-ReLU fused with GEMM.
  • Grouped GEMM + GLU + Hadamard: Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
  • Grouped GEMM + sReLU: Contiguous grouped squared-ReLU GEMM for MoE workloads.
  • Grouped GEMM + dsReLU: Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
  • RMSNorm + RHT + amax: A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size 16, and a per-CTA amax reduction.

Fix block-scale quantize
The scale tensor uses a 128x4 reordered layout (TensorReordering_t::F8_128x4). When the reordering type is set on the scale tensor, the frontend will automatically pad the inferred scale dimensions to align with the 128x4 block structure (non-batch, non-axis dimensions are padded to multiples of 128, and the quantize axis dimension is padded to multiples of 4).

  • GEMM + sReLU: High-performance implementation of squared-ReLU fused with GEMM.
  • GEMM + dsReLU: High-performance implementation of dsquared-ReLU fused with GEMM.
  • Grouped GEMM + GLU + Hadamard: Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
  • Grouped GEMM + sReLU: Contiguous grouped squared-ReLU GEMM for MoE workloads.
  • Grouped GEMM + dsReLU: Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
  • RMSNorm + RHT + amax: A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size 16, and a per-CTA amax reduction.

General Improvements ✨✨

  • Grouped GEMM APIs now default to dynamic MNKL compilation across GLU, dGLU, SwiGLU, dSwiGLU, SReLU, dSReLU, and quant wrappers. Set CUDNN_FE_GROUPED_GEMM_DYNAMIC_MNKL=0 to restore the previous M-only dynamic behavior.

  • Grouped GEMM wgrad wrapper APIs now support caller-provided output buffers (wgrad_tensor for dense, wgrad_ptrs for discrete)

  • Unused internal c_tensor removed from Grouped GEMM quant path

Bug fix 🐛

  • Grouped GEMM GLU bias compilation issue for 64B-aligned inputs with dynamic MNKL

  • Fix an issue with dropout in Blackwell when cudnn frontend 1.21 version is used with cudnn backend 9.21 and 9.22.

Benchmarking 📊

  • Updated the benchmark results for the SDPA improvements. Added Kimi-K2.6, LTX-2, Qwen 2.5 , Wan2.2 to the benchmark results page.

Acknowledgements:

  • Thanks @haowen-han for fixing a bug in the block-scale matmul sample.

v1.22.1-release

10 Apr 17:29
a91f0e0

Choose a tag to compare

cuDNN Frontend v1.22.1 is the recommended version for cuDNN 9.20.0 and later releases.

General Improvements 🚀 🚀

  • Introducing PyTorch custom operator wrapping cuDNN's MoE Grouped Gemm operation.

      ```python
          def moe_grouped_matmul(
              token: torch.Tensor,
              weight: torch.Tensor,
              first_token_offset: torch.Tensor,
              token_index: Optional[torch.Tensor] = None,
              token_ks: Optional[torch.Tensor] = None,
              mode: str = "none",
              top_k: int = 1,
          ) -> torch.Tensor
      ```
    

    See test/python/test_moe_grouped_matmul_op.py for usage.

  • 🕒 We will be rolling out new native custom torch ops in upcoming releases – stay tuned! 😃

Open-Source Kernels 🚀 🚀

  • Blackwell sdpa fprop kernel supporting head dim = 256, written in cuteDSL. Support added through the torch-op above or callable as a standalone API. See samples for the API usage. Requires nvidia-cutlass-dsl[cu13]==4.4.1

Updates:

Acknowledgements:

Blackwell sdpa fprop kernel supporting head dim = 256, written in cuteDSL kernel was jointly developed by Shengbin Di, Yuxi Chi, and Linfeng Zheng in close collaboration with Alibaba. We would like to extend special thanks to the core contributors from Alibaba: Siyu Wang, Haoyan Huang, Lanbo Li, Yun Zhong, Man Yuan, Minmin Sun, Yong Li, and Wei Lin for their significant contributions to this work.

v1.22.0-release

03 Apr 02:24
97f6cb3

Choose a tag to compare

cuDNN Frontend v1.22.0 Release Notes

cuDNN Frontend v1.22.0 is the recommended version for cuDNN 9.20.0 and later releases.

General Improvements 🚀 🚀

  • Introducing PyTorch custom operator wrapping cuDNN's Scaled Dot-Product Attention (SDPA). scaled_dot_product_attention as the public entry point, closely
    matching the signature of torch.nn.functional.scaled_dot_product_attention.

      ```python
      def scaled_dot_product_attention(
          query: torch.Tensor,
          key: torch.Tensor,
          value: torch.Tensor,
          attn_mask: Optional[torch.Tensor] = None,
          dropout_p: float = 0.0,
          is_causal: bool = False,
          scale: Optional[float] = None,
          enable_gqa: bool = False,
          *,
          diagonal_alignment: int = 0,
          left_bound: int = -1,
          right_bound: int = -1,
          seq_len_q: Optional[torch.Tensor] = None,
          seq_len_kv: Optional[torch.Tensor] = None,
          cumulative_seq_len_q: Optional[torch.Tensor] = None,
          cumulative_seq_len_kv: Optional[torch.Tensor] = None,
      ) -> torch.Tensor:
      ```
    
  • Introduce a preindexed execute method, that reduces the CPU execution overhead.

  • Improve the reproducer tool to report and reproduce SDPA failures for fp8 data types as well.

  • 🕒 We will be rolling out new native custom torch ops in upcoming releases – stay tuned! 😃

Open-Source Kernels 🚀 🚀

  • Blackwell sdpa bprop kernel supporting head dim = 256, written in cuteDSL. Support added through the torch-op above or callable as a standalone API. See samples for the API usage. Requires nvidia-cutlass-dsl[cu13]==4.4.1

  • Grouped Gemm + quantize kernels now support dynamic shape and layout. This is controllable via an environment toggle.

  • Grouped Gemm + Glu/Swiglu now supoprt optional bias fusion in both dense and discrete modes, including partial‑N support and optional bias‑gradient generation for discrete backward paths.

Updates:

  • fp8 datatype with packed variable sequences (THD) is no longer supported for SM90 (Hopper) architecture.

  • Fix an issue where sdpa fp8 was failing when used with cuda toolkit 12.9

Acknowledgements:

Blackwell sdpa bprop kernel supporting head dim = 256, written in cuteDSL kernel was jointly developed by Shengbin Di, Yuxi Chi, and Linfeng Zheng in close collaboration with Alibaba. We would like to extend special thanks to the core contributors from Alibaba: Siyu Wang, Haoyan Huang, Lanbo Li, Yun Zhong, Man Yuan, Minmin Sun, Yong Li, and Wei Lin for their significant contributions to this work.

v1.21.0-release

25 Mar 03:18
7b9b711

Choose a tag to compare

cuDNN Frontend v1.21.0 Release Notes (#213)

cuDNN Frontend v1.21.0 is the recommended version for cuDNN 9.20.0 and later releases.

General Improvements 🚀

  • Dropped dependency on the CUDA driver API for the frontend library, enabling builds without direct CUDA driver linkage.

Open-Source Kernels

Added new kernels for the GEMM fusions.

Grouped GEMM + GLU: Unified grouped GEMM GLU API supporting dense and discrete MoE weight layouts with optional bias.
Grouped GEMM + dGLU: Unified grouped GEMM dGLU backward API supporting dense and discrete MoE weight layouts with optional bias.
Discrete Grouped GEMM + SwiGLU: Per-expert-pointer SwiGLU grouped GEMM for MoE workloads without weight packing.
Discrete Grouped GEMM + dSwiGLU: Per-expert-pointer dSwiGLU backward grouped GEMM for MoE workloads without weight packing. Uses dSwiGLU/dGeGLU backward epilogue.
Grouped GEMM + dSwiglu: dSwiglu activation fused with Grouped GEMM
Grouped GEMM + Quant: Grouped GEMM with output quantization for MoE FC2/dFC1 workloads

v1.20.0 release

16 Mar 18:09
d33027a

Choose a tag to compare

cuDNN Frontend v1.20.0 is the recommended version for cuDNN 9.20.0 and later releases.

Open-Source Kernels 🚀 🚀

  • Fused RMSNorm + SiLU: The Fused RMSNorm + SiLU engine implements a single-kernel fusion of RMS normalization followed by SiLU (Swish) activation. It is designed and optimized specifically for the WAN VAE decoder's L2Norm + SiLU pattern on B200, but supports arbitrary problem sizes on SM80 to SM103 GPUs.

Improvements:

  • Allow GEMM + Amax, GEMM + SwiGLU, Grouped GEMM + SwiGLU, Grouped GEMM + dSwiglu, and NSA kernels to run on GB300.

  • Improve the reproducer tool to report and reproduce SDPA failures.

v1.19.1 release

11 Mar 05:11
7500fd8

Choose a tag to compare

cuDNN Frontend v1.19.1 Release

Pinning the pybind version to prevent failures with older versions.

Restore support for cuda-12 toolkit that was accidentally dropped in 1.19.0 release.

cuDNN Frontend v1.19.0 Release Notes

cuDNN Frontend v1.19.0 is the recommended version for cuDNN 9.19.1 and later releases.

Open-Source Kernels 🚀 🚀

  • Blackwell and Hopper SDPA Fprop Kernels: cuDNN's SDPA Fprop implementation is now open source. This kernel supports causal masking and outputs stats for use in bprop. Additional kernels will be added in future releases.
  • Grouped GEMM + dSwiGLU Fusion: A contiguous grouped block-scaled GEMM fused with a dSwiGLU backward epilogue on NVIDIA Blackwell GPUs (SM100+), designed for MoE (Mixture of Experts) workloads.

General Improvements 🚀

  • Removed multiple device queries for SM version during graph validation and replaced with a single query that can be skipped by setting sm_version on the cuDNN graph.
  • Fixed an issue where enabling logging with CUDA graphs in certain scenarios would cause a crash.
  • Significantly reduced the CPU overhead of the cuDNN OSS API by using tvm-ffi.
  • We are adding a new cudnn-repro tool to have a standalone reproducer from the cudnn frontend logs. See details

Enhancements ✨

Scaled Dot-Product Attention (SDPA)

  • Support Checks: Improved support checks for cleaner support surface queries.
  • New API: Added Python bindings for score-mod bprop function to enable the score bprop API.
  • Stats: Support independent generation of SDPA stats (LSE, SE, Max) in sdpa fprop (Requires 9.20.0 and up).

Normalization

  • More Benchmarks: New normalization benchmark results posted for GB200, GB300, and H200.

Benchmarking 📊

  • Updated the benchmark results for the SDPA improvements added in cuDNN 9.19.1

v1.19.0-release

09 Mar 17:35
df73764

Choose a tag to compare

cuDNN Frontend v1.19.0 Release Notes

cuDNN Frontend v1.19.0 is the recommended version for cuDNN 9.19.1 and later releases.

Open-Source Kernels 🚀 🚀

  • Blackwell and Hopper SDPA Fprop Kernels: cuDNN's SDPA Fprop implementation is now open source. This kernel supports causal masking and outputs stats for use in bprop. Additional kernels will be added in future releases.
  • Grouped GEMM + dSwiGLU Fusion: A contiguous grouped block-scaled GEMM fused with a dSwiGLU backward epilogue on NVIDIA Blackwell GPUs (SM100+), designed for MoE (Mixture of Experts) workloads.

General Improvements 🚀

  • Removed multiple device queries for SM version during graph validation and replaced with a single query that can be skipped by setting sm_version on the cuDNN graph.
  • Fixed an issue where enabling logging with CUDA graphs in certain scenarios would cause a crash.
  • Significantly reduced the CPU overhead of the cuDNN OSS API by using tvm-ffi.
  • We are adding a new cudnn-repro tool to have a standalone reproducer from the cudnn frontend logs. See details

Enhancements ✨

Scaled Dot-Product Attention (SDPA)

  • Support Checks: Improved support checks for cleaner support surface queries.
  • New API: Added Python bindings for score-mod bprop function to enable the score bprop API.
  • Stats: Support independent generation of SDPA stats (LSE, SE, Max) in sdpa fprop (Requires 9.20.0 and up).

Normalization

  • More Benchmarks: New normalization benchmark results posted for GB200, GB300, and H200.

Benchmarking 📊

  • Updated the benchmark results for the SDPA improvements added in cuDNN 9.19.1

v1.18.0-release

27 Jan 23:08
b8c0656

Choose a tag to compare

cuDNN Frontend v1.18.0 Release Notes

cuDNN Frontend v1.18.0 is the recommended version for cuDNN 9.18.1 and later releases.

General Improvements 🚀

  • Move away from internally using the v0.x API. Rather, now the cudnn backend API is directly called.
  • Improve the execution overhead by caching repeated graph query.

Open-Source Kernels

New open source kernel for Grouped Gemm and Swiglu fussion

Enhancements ✨

Scaled Dot-Product Attention (SDPA)

  • New Features: Allows support for dynamic shapes for fprop. This will help reduce the graph building across different batch and sequence lengths.

  • Support Surface:

    • Now allows deterministic bprop for SDPA
    • Added support for bprop for ragged tensors in A100
  • More samples:

    • Open sourcing our sdpa test harness. Showcase additional testing for determinism, fp8 sizes for MLA
    • Added samples to showcase chunked prefill.

Mixture of Expers (MoE)

  • New API: Added support for moe_grouped_matmul. See cpp sample and documentation for API reference.

Matmul

Convolution

Additional Improvements

Benchmarking 📊

  • Updated the benchmark results for the sdpa improvements added in cuDNN 9.18.1

v1.17.0-release

20 Dec 00:10
b372d39

Choose a tag to compare

cuDNN Frontend v1.17.0 Release Notes

cuDNN Frontend v1.17.0 is the recommended version for cuDNN 9.17.0 and later releases.

New Features 🚀

Open-Source Kernels

  • Native Sparse Attention : The Native Sparse Attention (NSA) module implements Native Sparse attention as described in the Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. Samples of usage for Blackwell architecture in test/python/fe_api/nsa

  • Gemm/Swiglu : Gemm_Swiglu now supports block-scaled FP8/FP4 datatypes.
    API changes:

    • Output tensors have been renamed from "C" and "Glu" to "AB12" and "C", respectively.
    • "use_2cta_intrs" Option has been removed. This will be inferred automatically from tile shape.

Enhancements ✨

Scaled Dot-Product Attention (SDPA)

Additional Improvements

  • Tensor properties: Added vector Dim and vectorization count to the tensor properties.
  • Graph wrapper: Fixed an issue in the native graph wrapper that caused BufferError in non-pytorch tensors.

Benchmarking 📊

  • Updated the benchmark results for the sdpa improvements added in cuDNN 9.17.0. GB200 and GB300 data.

Samples

v1.16.1-release

01 Dec 21:33
0258951

Choose a tag to compare

What's Changed

  • Find cudnn libraries with NAMES_PER_DIR for python site by @take-cheeze in #180
  • Dont override if users provide max/sum_exp shape and stride #181
  • Fix issues in warmup function leading to error in deserialize. #183

New Contributors

Full Changelog: v1.16.0...v1.16.1