29 Apr 18:04

Anerudhan

fb682ce

v1.23.0-release Latest

Latest

cuDNN Frontend v1.23.0 is the recommended version for cuDNN 9.21.0 and later releases.

cudnn-frontend now has pip wheels for python 3.14t.

New APIs 🚀 🚀

Causal Conv1d

Depthwise causal 1-D convolution with optional fused silu activation (requires cuDNN 9.22.0): y = activation(conv1d_causal(x, w) + b) Supports forward and backward passes with torch.autograd and torch.compile. (Not supported on Windows yet)

Updates to Graph API

Transpose (requires cuDNN 9.22.0)

Added new Graph::transpose with Transpose_attributes(permutation, optional compute dtype, name)

Slice (requires cuDNN 9.22.0)

Extend Slice_attributes with set_strides for per-axis slice steps; strided slices update inferred output shape and strides accordingly.
Python: pygraph.slice now honors each dimension's slice.step

Concatenate (requires cuDNN 9.22.0)

Extend Concatenate_attributes with set_in_place_index (optional). When unset, concatenate runs out-of-place per backend rules.

Reshape (requires cuDNN 9.22.0)

Introduce ReshapeMode_t(VIEW_ONLY,LOGICAL) and Reshape_attributes::set_reshape_mode so reshapes can select view-style vs lexicographic logical reshape.

Compile-time constants (requires cuDNN 9.22.0)

Added cudnn.scalar_type(RUNTIME_PARAM,COMPILE_TIME_CONST) and Graph::tensor(scalar, ScalarType) overloads, so scalars can be execution-time variant-pack inputs or constants embedded in the plan.
Tensor_attributes can be marked as a compile-time constant or a normal runtimepass-by-value scalar;

Open source kernels 🚀 🚀

GEMM + sReLU: High-performance implementation of squared-ReLU fused with GEMM.
GEMM + dsReLU: High-performance implementation of dsquared-ReLU fused with GEMM.
Grouped GEMM + GLU + Hadamard: Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
Grouped GEMM + sReLU: Contiguous grouped squared-ReLU GEMM for MoE workloads.
Grouped GEMM + dsReLU: Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
RMSNorm + RHT + amax: A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size 16, and a per-CTA amax reduction.

Fix block-scale quantize
The scale tensor uses a 128x4 reordered layout (TensorReordering_t::F8_128x4). When the reordering type is set on the scale tensor, the frontend will automatically pad the inferred scale dimensions to align with the 128x4 block structure (non-batch, non-axis dimensions are padded to multiples of 128, and the quantize axis dimension is padded to multiples of 4).

GEMM + sReLU: High-performance implementation of squared-ReLU fused with GEMM.
GEMM + dsReLU: High-performance implementation of dsquared-ReLU fused with GEMM.
Grouped GEMM + GLU + Hadamard: Dense grouped GEMM GLU forward fusion with a fused Hadamard transform and per-expert AMAX reduction.
Grouped GEMM + sReLU: Contiguous grouped squared-ReLU GEMM for MoE workloads.
Grouped GEMM + dsReLU: Contiguous and discrete grouped dsquared-ReLU GEMM for MoE workloads.
RMSNorm + RHT + amax: A fused CUTE DSL kernel for NVIDIA Blackwell GPUs (SM100+) that applies RMS normalization, a block-diagonal Hadamard transform with fixed block size 16, and a per-CTA amax reduction.

General Improvements ✨✨

Grouped GEMM APIs now default to dynamic MNKL compilation across GLU, dGLU, SwiGLU, dSwiGLU, SReLU, dSReLU, and quant wrappers. Set CUDNN_FE_GROUPED_GEMM_DYNAMIC_MNKL=0 to restore the previous M-only dynamic behavior.
Grouped GEMM wgrad wrapper APIs now support caller-provided output buffers (wgrad_tensor for dense, wgrad_ptrs for discrete)
Unused internal c_tensor removed from Grouped GEMM quant path

Bug fix 🐛

Grouped GEMM GLU bias compilation issue for 64B-aligned inputs with dynamic MNKL
Fix an issue with dropout in Blackwell when cudnn frontend 1.21 version is used with cudnn backend 9.21 and 9.22.

Benchmarking 📊

Updated the benchmark results for the SDPA improvements. Added Kimi-K2.6, LTX-2, Qwen 2.5 , Wan2.2 to the benchmark results page.

Acknowledgements:

Thanks @haowen-han for fixing a bug in the block-scale matmul sample.

Contributors

haowen-han

Assets 2

10 Apr 17:29

Anerudhan

v1.22.1

a91f0e0

v1.22.1-release

cuDNN Frontend v1.22.1 is the recommended version for cuDNN 9.20.0 and later releases.

General Improvements 🚀 🚀

Introducing PyTorch custom operator wrapping cuDNN's MoE Grouped Gemm operation.

  ```python
      def moe_grouped_matmul(
          token: torch.Tensor,
          weight: torch.Tensor,
          first_token_offset: torch.Tensor,
          token_index: Optional[torch.Tensor] = None,
          token_ks: Optional[torch.Tensor] = None,
          mode: str = "none",
          top_k: int = 1,
      ) -> torch.Tensor
  ```

See test/python/test_moe_grouped_matmul_op.py for usage.

🕒 We will be rolling out new native custom torch ops in upcoming releases – stay tuned! 😃

Open-Source Kernels 🚀 🚀

Blackwell sdpa fprop kernel supporting head dim = 256, written in cuteDSL. Support added through the torch-op above or callable as a standalone API. See samples for the API usage. Requires nvidia-cutlass-dsl[cu13]==4.4.1

Updates:

GroupedGemmWgradSm100 and grouped_gemm_wgrad_wrapper_sm100 expose the grouped GEMM weight-gradient kernel. See grouped_gemm_wgrad.html for API reference moe_blockscaled_grouped_gemm_wgrad.py for samples.

Acknowledgements:

Blackwell sdpa fprop kernel supporting head dim = 256, written in cuteDSL kernel was jointly developed by Shengbin Di, Yuxi Chi, and Linfeng Zheng in close collaboration with Alibaba. We would like to extend special thanks to the core contributors from Alibaba: Siyu Wang, Haoyan Huang, Lanbo Li, Yun Zhong, Man Yuan, Minmin Sun, Yong Li, and Wei Lin for their significant contributions to this work.

Assets 2

03 Apr 02:24

Anerudhan

v1.22.0

97f6cb3

v1.22.0-release

cuDNN Frontend v1.22.0 Release Notes

cuDNN Frontend v1.22.0 is the recommended version for cuDNN 9.20.0 and later releases.

General Improvements 🚀 🚀

Introducing PyTorch custom operator wrapping cuDNN's Scaled Dot-Product Attention (SDPA). scaled_dot_product_attention as the public entry point, closely
matching the signature of torch.nn.functional.scaled_dot_product_attention.

  ```python
  def scaled_dot_product_attention(
      query: torch.Tensor,
      key: torch.Tensor,
      value: torch.Tensor,
      attn_mask: Optional[torch.Tensor] = None,
      dropout_p: float = 0.0,
      is_causal: bool = False,
      scale: Optional[float] = None,
      enable_gqa: bool = False,
      *,
      diagonal_alignment: int = 0,
      left_bound: int = -1,
      right_bound: int = -1,
      seq_len_q: Optional[torch.Tensor] = None,
      seq_len_kv: Optional[torch.Tensor] = None,
      cumulative_seq_len_q: Optional[torch.Tensor] = None,
      cumulative_seq_len_kv: Optional[torch.Tensor] = None,
  ) -> torch.Tensor:
  ```

Introduce a preindexed execute method, that reduces the CPU execution overhead.
Improve the reproducer tool to report and reproduce SDPA failures for fp8 data types as well.
🕒 We will be rolling out new native custom torch ops in upcoming releases – stay tuned! 😃

Open-Source Kernels 🚀 🚀

Blackwell sdpa bprop kernel supporting head dim = 256, written in cuteDSL. Support added through the torch-op above or callable as a standalone API. See samples for the API usage. Requires nvidia-cutlass-dsl[cu13]==4.4.1
Grouped Gemm + quantize kernels now support dynamic shape and layout. This is controllable via an environment toggle.
Grouped Gemm + Glu/Swiglu now supoprt optional bias fusion in both dense and discrete modes, including partial‑N support and optional bias‑gradient generation for discrete backward paths.

Updates:

fp8 datatype with packed variable sequences (THD) is no longer supported for SM90 (Hopper) architecture.
Fix an issue where sdpa fp8 was failing when used with cuda toolkit 12.9

Acknowledgements:

Blackwell sdpa bprop kernel supporting head dim = 256, written in cuteDSL kernel was jointly developed by Shengbin Di, Yuxi Chi, and Linfeng Zheng in close collaboration with Alibaba. We would like to extend special thanks to the core contributors from Alibaba: Siyu Wang, Haoyan Huang, Lanbo Li, Yun Zhong, Man Yuan, Minmin Sun, Yong Li, and Wei Lin for their significant contributions to this work.

Assets 2

25 Mar 03:18

Anerudhan

v1.21.0

7b9b711

v1.21.0-release

cuDNN Frontend v1.21.0 Release Notes (#213)

cuDNN Frontend v1.21.0 is the recommended version for cuDNN 9.20.0 and later releases.

General Improvements 🚀

Dropped dependency on the CUDA driver API for the frontend library, enabling builds without direct CUDA driver linkage.

Open-Source Kernels

Added new kernels for the GEMM fusions.

Grouped GEMM + GLU: Unified grouped GEMM GLU API supporting dense and discrete MoE weight layouts with optional bias.
Grouped GEMM + dGLU: Unified grouped GEMM dGLU backward API supporting dense and discrete MoE weight layouts with optional bias.
Discrete Grouped GEMM + SwiGLU: Per-expert-pointer SwiGLU grouped GEMM for MoE workloads without weight packing.
Discrete Grouped GEMM + dSwiGLU: Per-expert-pointer dSwiGLU backward grouped GEMM for MoE workloads without weight packing. Uses dSwiGLU/dGeGLU backward epilogue.
Grouped GEMM + dSwiglu: dSwiglu activation fused with Grouped GEMM
Grouped GEMM + Quant: Grouped GEMM with output quantization for MoE FC2/dFC1 workloads

Assets 2

16 Mar 18:09

Anerudhan

v1.20.0

d33027a

v1.20.0 release

cuDNN Frontend v1.20.0 is the recommended version for cuDNN 9.20.0 and later releases.

Open-Source Kernels 🚀 🚀

Fused RMSNorm + SiLU: The Fused RMSNorm + SiLU engine implements a single-kernel fusion of RMS normalization followed by SiLU (Swish) activation. It is designed and optimized specifically for the WAN VAE decoder's L2Norm + SiLU pattern on B200, but supports arbitrary problem sizes on SM80 to SM103 GPUs.

Improvements:

Allow GEMM + Amax, GEMM + SwiGLU, Grouped GEMM + SwiGLU, Grouped GEMM + dSwiglu, and NSA kernels to run on GB300.
Improve the reproducer tool to report and reproduce SDPA failures.

Assets 2

11 Mar 05:11

Anerudhan

v1.19.1

7500fd8

v1.19.1 release

cuDNN Frontend v1.19.1 Release

Pinning the pybind version to prevent failures with older versions.

Restore support for cuda-12 toolkit that was accidentally dropped in 1.19.0 release.

cuDNN Frontend v1.19.0 Release Notes

cuDNN Frontend v1.19.0 is the recommended version for cuDNN 9.19.1 and later releases.

Open-Source Kernels 🚀 🚀

Blackwell and Hopper SDPA Fprop Kernels: cuDNN's SDPA Fprop implementation is now open source. This kernel supports causal masking and outputs stats for use in bprop. Additional kernels will be added in future releases.
Grouped GEMM + dSwiGLU Fusion: A contiguous grouped block-scaled GEMM fused with a dSwiGLU backward epilogue on NVIDIA Blackwell GPUs (SM100+), designed for MoE (Mixture of Experts) workloads.

General Improvements 🚀

Removed multiple device queries for SM version during graph validation and replaced with a single query that can be skipped by setting sm_version on the cuDNN graph.
Fixed an issue where enabling logging with CUDA graphs in certain scenarios would cause a crash.
Significantly reduced the CPU overhead of the cuDNN OSS API by using tvm-ffi.
We are adding a new cudnn-repro tool to have a standalone reproducer from the cudnn frontend logs. See details

Enhancements ✨

Scaled Dot-Product Attention (SDPA)

Support Checks: Improved support checks for cleaner support surface queries.
New API: Added Python bindings for score-mod bprop function to enable the score bprop API.
Stats: Support independent generation of SDPA stats (LSE, SE, Max) in sdpa fprop (Requires 9.20.0 and up).

Normalization

More Benchmarks: New normalization benchmark results posted for GB200, GB300, and H200.

Benchmarking 📊

Updated the benchmark results for the SDPA improvements added in cuDNN 9.19.1

Assets 2

09 Mar 17:35

Anerudhan

v1.19.0

df73764

v1.19.0-release

cuDNN Frontend v1.19.0 Release Notes

cuDNN Frontend v1.19.0 is the recommended version for cuDNN 9.19.1 and later releases.

Open-Source Kernels 🚀 🚀

Blackwell and Hopper SDPA Fprop Kernels: cuDNN's SDPA Fprop implementation is now open source. This kernel supports causal masking and outputs stats for use in bprop. Additional kernels will be added in future releases.
Grouped GEMM + dSwiGLU Fusion: A contiguous grouped block-scaled GEMM fused with a dSwiGLU backward epilogue on NVIDIA Blackwell GPUs (SM100+), designed for MoE (Mixture of Experts) workloads.

General Improvements 🚀

Removed multiple device queries for SM version during graph validation and replaced with a single query that can be skipped by setting sm_version on the cuDNN graph.
Fixed an issue where enabling logging with CUDA graphs in certain scenarios would cause a crash.
Significantly reduced the CPU overhead of the cuDNN OSS API by using tvm-ffi.
We are adding a new cudnn-repro tool to have a standalone reproducer from the cudnn frontend logs. See details

Enhancements ✨

Scaled Dot-Product Attention (SDPA)

Support Checks: Improved support checks for cleaner support surface queries.
New API: Added Python bindings for score-mod bprop function to enable the score bprop API.
Stats: Support independent generation of SDPA stats (LSE, SE, Max) in sdpa fprop (Requires 9.20.0 and up).

Normalization

More Benchmarks: New normalization benchmark results posted for GB200, GB300, and H200.

Benchmarking 📊

Updated the benchmark results for the SDPA improvements added in cuDNN 9.19.1

Assets 2

27 Jan 23:08

Anerudhan

v1.18.0

b8c0656

v1.18.0-release

cuDNN Frontend v1.18.0 Release Notes

cuDNN Frontend v1.18.0 is the recommended version for cuDNN 9.18.1 and later releases.

General Improvements 🚀

Move away from internally using the v0.x API. Rather, now the cudnn backend API is directly called.
Improve the execution overhead by caching repeated graph query.

Open-Source Kernels

New open source kernel for Grouped Gemm and Swiglu fussion

Grouped GEMM + SwiGLU

Enhancements ✨

Scaled Dot-Product Attention (SDPA)

New Features: Allows support for dynamic shapes for fprop. This will help reduce the graph building across different batch and sequence lengths.
Support Surface:
- Now allows deterministic bprop for SDPA
- Added support for bprop for ragged tensors in A100
More samples:
- Open sourcing our sdpa test harness. Showcase additional testing for determinism, fp8 sizes for MLA
- Added samples to showcase chunked prefill.

Mixture of Expers (MoE)

New API: Added support for moe_grouped_matmul. See cpp sample and documentation for API reference.

Matmul

More samples: Open sourcing cudnn`s fuzzy testing of matmuls

Convolution

More samples: Open sourcing cudnn`s fuzzy testing of convolutions

Additional Improvements

Benchmarking 📊

Updated the benchmark results for the sdpa improvements added in cuDNN 9.18.1

Assets 2

20 Dec 00:10

Anerudhan

v1.17.0

b372d39

v1.17.0-release

cuDNN Frontend v1.17.0 Release Notes

cuDNN Frontend v1.17.0 is the recommended version for cuDNN 9.17.0 and later releases.

New Features 🚀

Open-Source Kernels

Native Sparse Attention : The Native Sparse Attention (NSA) module implements Native Sparse attention as described in the Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention. Samples of usage for Blackwell architecture in test/python/fe_api/nsa
Gemm/Swiglu : Gemm_Swiglu now supports block-scaled FP8/FP4 datatypes.
API changes:
- Output tensors have been renamed from "C" and "Glu" to "AB12" and "C", respectively.
- "use_2cta_intrs" Option has been removed. This will be inferred automatically from tile shape.

Enhancements ✨

Scaled Dot-Product Attention (SDPA)

More samples: Open sourcing our sdpa test harness and fp8 samples in test/python/test_sdpa_fp8.py

Additional Improvements

Tensor properties: Added vector Dim and vectorization count to the tensor properties.
Graph wrapper: Fixed an issue in the native graph wrapper that caused BufferError in non-pytorch tensors.

Benchmarking 📊

Updated the benchmark results for the sdpa improvements added in cuDNN 9.17.0. GB200 and GB300 data.

Samples

** cudnn Llama model **: Added reference implementation of the Llama model completely in cuDNN.

Assets 2

01 Dec 21:33

Anerudhan

v1.16.1

0258951

v1.16.1-release

What's Changed

Find cudnn libraries with NAMES_PER_DIR for python site by @take-cheeze in #180
Dont override if users provide max/sum_exp shape and stride #181
Fix issues in warmup function leading to error in deserialize. #183

New Contributors

@take-cheeze made their first contribution in #180

Full Changelog: v1.16.0...v1.16.1

Contributors

take-cheeze

Assets 2

Releases: NVIDIA/cudnn-frontend

v1.23.0-release

New APIs 🚀 🚀

Causal Conv1d

Updates to Graph API

Transpose (requires cuDNN 9.22.0)

Slice (requires cuDNN 9.22.0)

Concatenate (requires cuDNN 9.22.0)

Reshape (requires cuDNN 9.22.0)

Compile-time constants (requires cuDNN 9.22.0)

Open source kernels 🚀 🚀

General Improvements ✨✨

Bug fix 🐛

Benchmarking 📊

Acknowledgements:

Contributors

Uh oh!

v1.22.1-release

General Improvements 🚀 🚀

Open-Source Kernels 🚀 🚀

Updates:

Acknowledgements:

Uh oh!

v1.22.0-release

cuDNN Frontend v1.22.0 Release Notes

General Improvements 🚀 🚀

Open-Source Kernels 🚀 🚀

Updates:

Acknowledgements:

Uh oh!

v1.21.0-release

cuDNN Frontend v1.21.0 Release Notes (#213)

General Improvements 🚀

Open-Source Kernels

Uh oh!

v1.20.0 release

Open-Source Kernels 🚀 🚀

Improvements:

Uh oh!

v1.19.1 release

cuDNN Frontend v1.19.1 Release

cuDNN Frontend v1.19.0 Release Notes

Open-Source Kernels 🚀 🚀

General Improvements 🚀

Enhancements ✨

Scaled Dot-Product Attention (SDPA)

Normalization

Benchmarking 📊

Uh oh!

v1.19.0-release

cuDNN Frontend v1.19.0 Release Notes

Open-Source Kernels 🚀 🚀

General Improvements 🚀

Enhancements ✨

Scaled Dot-Product Attention (SDPA)

Normalization

Benchmarking 📊

Uh oh!

v1.18.0-release

cuDNN Frontend v1.18.0 Release Notes

General Improvements 🚀

Open-Source Kernels

Enhancements ✨

Scaled Dot-Product Attention (SDPA)

Mixture of Expers (MoE)

Matmul

Convolution

Additional Improvements

Benchmarking 📊

Uh oh!

v1.17.0-release

cuDNN Frontend v1.17.0 Release Notes

New Features 🚀

Open-Source Kernels

Enhancements ✨

Scaled Dot-Product Attention (SDPA)

Additional Improvements

Benchmarking 📊

Samples

Uh oh!