[CUDNN] Support BFloat16 by AntonOresten · Pull Request #2987 · JuliaGPU/CUDA.jl

AntonOresten · 2025-11-28T17:30:45Z

This PR defines methods for making cuDNN work with BFloat16s.BFloat16.

In the following example, I show how the new methods fixes the BFloat16 backward pass of Flux.logitcrossentropy:

Before

Note: Core.BFloat16 === BFloat16s.BFloat16, but I didn't explicitly import in this REPL session.

julia> x, y = CUDA.randn(Core.BFloat16, 32), CUDA.randn(Core.BFloat16, 32); Flux.gradient(x) do x
           Flux.logitcrossentropy(x, y)
       end
ERROR: MethodError: no method matching cudnnDataType(::Type{Core.BFloat16})
The function `cudnnDataType` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  cudnnDataType(::Type{Float16})
   @ cuDNN ~/.julia/packages/cuDNN/vKsqU/src/util.jl:7
  cudnnDataType(::Type{Float32})
   @ cuDNN ~/.julia/packages/cuDNN/vKsqU/src/util.jl:8
  cudnnDataType(::Type{Float64})
   @ cuDNN ~/.julia/packages/cuDNN/vKsqU/src/util.jl:9
  ...

Stacktrace:
  [1] cuDNN.cudnnTensorDescriptor(array::CuArray{Core.BFloat16, 4, CUDA.DeviceMemory}; format::cuDNN.cudnnTensorFormat_t, dims::Vector{Int32})
    @ cuDNN ~/.julia/packages/cuDNN/vKsqU/src/tensor.jl:9
  [2] cudnnSoftmaxForward!(y::CuArray{…}, x::CuArray{…}; o::@Kwargs{…})
    @ cuDNN ~/.julia/packages/cuDNN/vKsqU/src/softmax.jl:17
  [3] logsoftmax!(y::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory}, x::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory}; dims::Int64)
    @ NNlibCUDACUDNNExt ~/.julia/packages/NNlib/1TYHL/ext/NNlibCUDACUDNNExt/softmax.jl:90
  [4] logsoftmax!
    @ ~/.julia/packages/NNlib/1TYHL/ext/NNlibCUDACUDNNExt/softmax.jl:87 [inlined]
  [5] #logsoftmax#41
    @ ~/.julia/packages/NNlib/1TYHL/ext/NNlibCUDACUDNNExt/softmax.jl:20 [inlined]
  [6] logsoftmax
    @ ~/.julia/packages/NNlib/1TYHL/ext/NNlibCUDACUDNNExt/softmax.jl:19 [inlined]
  [7] #rrule#109
    @ ~/.julia/packages/NNlib/1TYHL/src/softmax.jl:129 [inlined]
  [8] rrule
    @ ~/.julia/packages/NNlib/1TYHL/src/softmax.jl:128 [inlined]
  [9] rrule
    @ ~/.julia/packages/ChainRulesCore/Vsbj9/src/rules.jl:144 [inlined]
 [10] chain_rrule_kw
    @ ~/.julia/packages/Zygote/55SqB/src/compiler/chainrules.jl:246 [inlined]
 [11] macro expansion
    @ ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:-1 [inlined]
 [12] _pullback
    @ ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:81 [inlined]
 [13] #logitcrossentropy#20
    @ ~/.julia/packages/Flux/uRn8o/src/losses/functions.jl:272 [inlined]
 [14] _pullback(::Zygote.Context{…}, ::Flux.Losses.var"##logitcrossentropy#20", ::Int64, ::typeof(mean), ::typeof(Flux.Losses.logitcrossentropy), ::CuArray{…}, ::CuArray{…})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:0
 [15] _pullback(::Zygote.Context{…}, ::typeof(Flux.Losses.logitcrossentropy), ::CuArray{…}, ::CuArray{…})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:81
 [16] #8
    @ ./REPL[14]:2 [inlined]
 [17] _pullback(ctx::Zygote.Context{false}, f::var"#8#9", args::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:0
 [18] pullback(f::Function, cx::Zygote.Context{false}, args::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:96
 [19] pullback
    @ ~/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:94 [inlined]
 [20] gradient(f::Function, args::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:153
 [21] #gradient#1
    @ ~/.julia/packages/Flux/uRn8o/src/gradient.jl:44 [inlined]
 [22] gradient(f::Function, args::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Flux ~/.julia/packages/Flux/uRn8o/src/gradient.jl:31
 [23] top-level scope
    @ REPL[14]:1
Some type information was truncated. Use `show(err)` to see complete types.

After defining cudnnDataType(::Type{BFloat16})

julia> x, y = CUDA.randn(Core.BFloat16, 32), CUDA.randn(Core.BFloat16, 32); Flux.gradient(x) do x
           Flux.logitcrossentropy(x, y)
       end
ERROR: Unknown tensor type Core.BFloat16
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:44
  [2] scalingParameter(T::Type, val::Int64)
    @ cuDNN ~/.julia/packages/cuDNN/vKsqU/src/util.jl:34
  [3] cudnnSoftmaxForwardWithDefaults(x::CuArray{…}; y::CuArray{…}, algo::cuDNN.cudnnSoftmaxAlgorithm_t, mode::cuDNN.cudnnSoftmaxMode_t, alpha::Int64, beta::Int64, format::cuDNN.cudnnTensorFormat_t, xDesc::cuDNN.cudnnTensorDescriptor, yDesc::cuDNN.cudnnTensorDescriptor)
    @ cuDNN ~/.julia/packages/cuDNN/vKsqU/src/softmax.jl:34
  [4] cudnnSoftmaxForward!(y::CuArray{…}, x::CuArray{…}; o::@Kwargs{…})
    @ cuDNN ~/.julia/packages/cuDNN/vKsqU/src/softmax.jl:17
  [5] logsoftmax!(y::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory}, x::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory}; dims::Int64)
    @ NNlibCUDACUDNNExt ~/.julia/packages/NNlib/1TYHL/ext/NNlibCUDACUDNNExt/softmax.jl:90
  [6] logsoftmax!
    @ ~/.julia/packages/NNlib/1TYHL/ext/NNlibCUDACUDNNExt/softmax.jl:87 [inlined]
  [7] #logsoftmax#41
    @ ~/.julia/packages/NNlib/1TYHL/ext/NNlibCUDACUDNNExt/softmax.jl:20 [inlined]
  [8] logsoftmax
    @ ~/.julia/packages/NNlib/1TYHL/ext/NNlibCUDACUDNNExt/softmax.jl:19 [inlined]
  [9] #rrule#109
    @ ~/.julia/packages/NNlib/1TYHL/src/softmax.jl:129 [inlined]
 [10] rrule
    @ ~/.julia/packages/NNlib/1TYHL/src/softmax.jl:128 [inlined]
 [11] rrule
    @ ~/.julia/packages/ChainRulesCore/Vsbj9/src/rules.jl:144 [inlined]
 [12] chain_rrule_kw
    @ ~/.julia/packages/Zygote/55SqB/src/compiler/chainrules.jl:246 [inlined]
 [13] macro expansion
    @ ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:-1 [inlined]
 [14] _pullback
    @ ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:81 [inlined]
 [15] #logitcrossentropy#20
    @ ~/.julia/packages/Flux/uRn8o/src/losses/functions.jl:272 [inlined]
 [16] _pullback(::Zygote.Context{…}, ::Flux.Losses.var"##logitcrossentropy#20", ::Int64, ::typeof(mean), ::typeof(Flux.Losses.logitcrossentropy), ::CuArray{…}, ::CuArray{…})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:0
 [17] _pullback(::Zygote.Context{false}, ::typeof(Flux.Losses.logitcrossentropy), ::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory}, ::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:81
 [18] #11
    @ ./REPL[19]:2 [inlined]
 [19] _pullback(ctx::Zygote.Context{false}, f::var"#11#12", args::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface2.jl:0
 [20] pullback(f::Function, cx::Zygote.Context{false}, args::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:96
 [21] pullback
    @ ~/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:94 [inlined]
 [22] gradient(f::Function, args::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Zygote ~/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:153
 [23] #gradient#1
    @ ~/.julia/packages/Flux/uRn8o/src/gradient.jl:44 [inlined]
 [24] gradient(f::Function, args::CuArray{Core.BFloat16, 1, CUDA.DeviceMemory})
    @ Flux ~/.julia/packages/Flux/uRn8o/src/gradient.jl:31
 [25] top-level scope
    @ REPL[19]:1
Some type information was truncated. Use `show(err)` to see complete types.

After defining scalingParameter(::Type{BFloat16}, val)

julia> x, y = CUDA.randn(Core.BFloat16, 32), CUDA.randn(Core.BFloat16, 32); Flux.gradient(x) do x
           Flux.logitcrossentropy(x, y)
       end
(Core.BFloat16[0.19335938, 0.32226562, -0.23828125, -0.85546875, 0.953125, 0.12207031, 1.15625, -0.64453125, -0.103515625, 0.61328125  …  0.4453125, -1.203125, 1.0234375, -1.46875, 0.19628906, -0.87890625, -1.3203125, 1.515625, 0.6484375, 0.44921875],)

I also define a cptr method for consistency, but it appears the function isn't used anywhere.

Tests are added for softmax, activations, and pooling. I initially also tested convolutions, normalization, RNNs, and MHA but they don't appear to support BFloat16.

Along with my proposed fix in FluxML/Optimisers.jl#215, this has allowed me to train LLMs in BFloat16 with Flux.jl in Julia v1.12. I am still tinkering with Optimisers.jl, but these together would be a significant unlock for my lab.

github-actions · 2025-11-28T17:31:20Z

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic master) to apply these changes.

Click here to view the suggested changes.

diff --git a/lib/cudnn/src/util.jl b/lib/cudnn/src/util.jl
index 8923ff9b5..c7ec0c2bd 100644
--- a/lib/cudnn/src/util.jl
+++ b/lib/cudnn/src/util.jl
@@ -4,13 +4,13 @@ using BFloat16s: BFloat16
 cptr(x,a::DenseCuArray{Float64})=Float64[x]
 cptr(x,a::DenseCuArray{Float32})=Float32[x]
 cptr(x,a::DenseCuArray{Float16})=Float32[x]
-cptr(x,a::DenseCuArray{BFloat16})=Float32[x]
+cptr(x, a::DenseCuArray{BFloat16}) = Float32[x]
 
 # Conversion between Julia and cuDNN datatypes
 cudnnDataType(::Type{Float16})=CUDNN_DATA_HALF
 cudnnDataType(::Type{Float32})=CUDNN_DATA_FLOAT
 cudnnDataType(::Type{Float64})=CUDNN_DATA_DOUBLE
-cudnnDataType(::Type{BFloat16})=CUDNN_DATA_BFLOAT16
+cudnnDataType(::Type{BFloat16}) = CUDNN_DATA_BFLOAT16
 cudnnDataType(::Type{Int8}) = CUDNN_DATA_INT8
 cudnnDataType(::Type{UInt8}) = CUDNN_DATA_UINT8
 cudnnDataType(::Type{Int32}) = CUDNN_DATA_INT32
@@ -21,7 +21,7 @@ cudnnDataType(::Type{Int32}) = CUDNN_DATA_INT32
 juliaDataType(a)=(a==CUDNN_DATA_HALF ? Float16 :
                   a==CUDNN_DATA_FLOAT ? Float32 :
                   a==CUDNN_DATA_DOUBLE ? Float64 :
-                  a==CUDNN_DATA_BFLOAT16 ? BFloat16 :
+        a == CUDNN_DATA_BFLOAT16 ? BFloat16 :
                   a==CUDNN_DATA_INT8 ? Int8 :
                   a==CUDNN_DATA_UINT8 ? UInt8 :
                   a==CUDNN_DATA_INT32 ? Int32 : error())
diff --git a/lib/cudnn/test/activation.jl b/lib/cudnn/test/activation.jl
index e25cf4c7c..33f76ed94 100644
--- a/lib/cudnn/test/activation.jl
+++ b/lib/cudnn/test/activation.jl
@@ -62,9 +62,9 @@ activationtest(alpha=2)
 activationtest(beta=2)
 
 if capability(device()) >= v"8.0"
-    (ax,ay) = randn.(BFloat16, (10,10))
-    (cx,cy) = CuArray.((ax,ay))
-    activationtest(mode=CUDNN_ACTIVATION_SIGMOID)
-    activationtest(mode=CUDNN_ACTIVATION_RELU)
-    activationtest(mode=CUDNN_ACTIVATION_TANH)
+    (ax, ay) = randn.(BFloat16, (10, 10))
+    (cx, cy) = CuArray.((ax, ay))
+    activationtest(mode = CUDNN_ACTIVATION_SIGMOID)
+    activationtest(mode = CUDNN_ACTIVATION_RELU)
+    activationtest(mode = CUDNN_ACTIVATION_TANH)
 end
diff --git a/lib/cudnn/test/softmax.jl b/lib/cudnn/test/softmax.jl
index 2102d6d02..74befeeae 100644
--- a/lib/cudnn/test/softmax.jl
+++ b/lib/cudnn/test/softmax.jl
@@ -46,8 +46,8 @@ softmaxtest(algo=CUDNN_SOFTMAX_ACCURATE)
 softmaxtest(algo=CUDNN_SOFTMAX_LOG)
 
 if capability(device()) >= v"8.0"
-    ax,ay = randn(BFloat16,10,10),randn(BFloat16,10,10)
-    cx,cy = CuArray.((ax,ay))
+    ax, ay = randn(BFloat16, 10, 10), randn(BFloat16, 10, 10)
+    cx, cy = CuArray.((ax, ay))
     softmaxtest()
-    softmaxtest(algo=CUDNN_SOFTMAX_LOG)
+    softmaxtest(algo = CUDNN_SOFTMAX_LOG)
 end

AntonOresten · 2025-11-30T14:05:14Z

Hm, duplicate of #1092? That one doesn't define the scalingParameter method though.

kshyatt · 2025-12-01T11:17:07Z

1.12 failure unrelated, retried CI

kshyatt · 2025-12-08T09:41:40Z

Second CI fail also seems unrelated, rerunning. If that succeeds you should rebase on top of master :)

codecov · 2025-12-08T10:21:22Z

Codecov Report

❌ Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 89.22%. Comparing base (ca67075) to head (6359bf0).

Files with missing lines	Patch %	Lines
lib/cudnn/src/util.jl	66.66%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #2987       +/-   ##
===========================================
+ Coverage   76.53%   89.22%   +12.68%     
===========================================
  Files         148      148               
  Lines       12860    12950       +90     
===========================================
+ Hits         9842    11554     +1712     
+ Misses       3018     1396     -1622

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

AntonOresten · 2025-12-08T12:38:30Z

Thanks, Katharine. I’ve updated the BFloat16s compat entry to align with CUDA.jl’s 0.5, 0.6, as some packages have not yet added a compat entry for 0.6.

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `a23ee46`	Previous: `5d9474a`	Ratio
`latency/precompile`	`55341296562.5` ns	`55510377029.5` ns	`1.00`
`latency/ttfp`	`7795931886` ns	`7790703567` ns	`1.00`
`latency/import`	`4120273995` ns	`4122189304` ns	`1.00`
`integration/volumerhs`	`9609087` ns	`9624973` ns	`1.00`
`integration/byval/slices=1`	`146813` ns	`147064` ns	`1.00`
`integration/byval/slices=3`	`425709` ns	`425893` ns	`1.00`
`integration/byval/reference`	`144869` ns	`145082` ns	`1.00`
`integration/byval/slices=2`	`286216` ns	`286384` ns	`1.00`
`integration/cudadevrt`	`103582` ns	`103602` ns	`1.00`
`kernel/indexing`	`14109` ns	`14225` ns	`0.99`
`kernel/indexing_checked`	`14715` ns	`14969` ns	`0.98`
`kernel/occupancy`	`719.9230769230769` ns	`732.5227272727273` ns	`0.98`
`kernel/launch`	`2487` ns	`2249.4444444444443` ns	`1.11`
`kernel/rand`	`14840` ns	`18642` ns	`0.80`
`array/reverse/1d`	`19661` ns	`19990` ns	`0.98`
`array/reverse/2dL_inplace`	`66712` ns	`66917` ns	`1.00`
`array/reverse/1dL`	`69754` ns	`70158` ns	`0.99`
`array/reverse/2d`	`21612` ns	`21954` ns	`0.98`
`array/reverse/1d_inplace`	`9523` ns	`9677` ns	`0.98`
`array/reverse/2d_inplace`	`10951` ns	`11077` ns	`0.99`
`array/reverse/2dL`	`73712` ns	`74051.5` ns	`1.00`
`array/reverse/1dL_inplace`	`66734` ns	`66880` ns	`1.00`
`array/copy`	`20406` ns	`20660` ns	`0.99`
`array/iteration/findall/int`	`156259` ns	`158373` ns	`0.99`
`array/iteration/findall/bool`	`139121.5` ns	`140139` ns	`0.99`
`array/iteration/findfirst/int`	`160318` ns	`161271` ns	`0.99`
`array/iteration/findfirst/bool`	`160750.5` ns	`162049` ns	`0.99`
`array/iteration/scalar`	`71772` ns	`72812.5` ns	`0.99`
`array/iteration/logical`	`214982` ns	`216894.5` ns	`0.99`
`array/iteration/findmin/1d`	`51356` ns	`50981` ns	`1.01`
`array/iteration/findmin/2d`	`95639` ns	`96704` ns	`0.99`
`array/reductions/reduce/Int64/1d`	`42833` ns	`43491` ns	`0.98`
`array/reductions/reduce/Int64/dims=1`	`44548.5` ns	`52642.5` ns	`0.85`
`array/reductions/reduce/Int64/dims=2`	`61336` ns	`61484` ns	`1.00`
`array/reductions/reduce/Int64/dims=1L`	`88872` ns	`88879` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`87447` ns	`87977` ns	`0.99`
`array/reductions/reduce/Float32/1d`	`35926` ns	`37248.5` ns	`0.96`
`array/reductions/reduce/Float32/dims=1`	`47245` ns	`43278` ns	`1.09`
`array/reductions/reduce/Float32/dims=2`	`59452` ns	`60066` ns	`0.99`
`array/reductions/reduce/Float32/dims=1L`	`52057` ns	`52282` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`71472` ns	`72365.5` ns	`0.99`
`array/reductions/mapreduce/Int64/1d`	`42652` ns	`43561` ns	`0.98`
`array/reductions/mapreduce/Int64/dims=1`	`44200` ns	`44306` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`61326` ns	`61482` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=1L`	`88603` ns	`89001` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2L`	`87623` ns	`88320` ns	`0.99`
`array/reductions/mapreduce/Float32/1d`	`36241` ns	`38092.5` ns	`0.95`
`array/reductions/mapreduce/Float32/dims=1`	`45113` ns	`41962` ns	`1.08`
`array/reductions/mapreduce/Float32/dims=2`	`59389` ns	`60039` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=1L`	`52306` ns	`52636` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=2L`	`71604` ns	`72310` ns	`0.99`
`array/broadcast`	`19787` ns	`20127` ns	`0.98`
`array/copyto!/gpu_to_gpu`	`12685` ns	`12738` ns	`1.00`
`array/copyto!/cpu_to_gpu`	`213773` ns	`217857` ns	`0.98`
`array/copyto!/gpu_to_cpu`	`286784` ns	`287088` ns	`1.00`
`array/accumulate/Int64/1d`	`124050` ns	`124778` ns	`0.99`
`array/accumulate/Int64/dims=1`	`83524.5` ns	`83708` ns	`1.00`
`array/accumulate/Int64/dims=2`	`157928` ns	`158367` ns	`1.00`
`array/accumulate/Int64/dims=1L`	`1709757` ns	`1710164` ns	`1.00`
`array/accumulate/Int64/dims=2L`	`965855` ns	`967254` ns	`1.00`
`array/accumulate/Float32/1d`	`108765` ns	`109314` ns	`0.99`
`array/accumulate/Float32/dims=1`	`80132` ns	`80184` ns	`1.00`
`array/accumulate/Float32/dims=2`	`147436` ns	`147922` ns	`1.00`
`array/accumulate/Float32/dims=1L`	`1618684` ns	`1618786` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`697887` ns	`698724` ns	`1.00`
`array/construct`	`1307.2` ns	`1295.5` ns	`1.01`
`array/random/randn/Float32`	`47414.5` ns	`47861` ns	`0.99`
`array/random/randn!/Float32`	`24572` ns	`24875` ns	`0.99`
`array/random/rand!/Int64`	`27209` ns	`27408` ns	`0.99`
`array/random/rand!/Float32`	`8733.333333333334` ns	`8909.666666666666` ns	`0.98`
`array/random/rand/Int64`	`38056` ns	`30055` ns	`1.27`
`array/random/rand/Float32`	`13097` ns	`13184` ns	`0.99`
`array/permutedims/4d`	`55080.5` ns	`55109` ns	`1.00`
`array/permutedims/2d`	`53606` ns	`53832` ns	`1.00`
`array/permutedims/3d`	`54539` ns	`54841` ns	`0.99`
`array/sorting/1d`	`2757516.5` ns	`2757534` ns	`1.00`
`array/sorting/by`	`3344110` ns	`3344541` ns	`1.00`
`array/sorting/2d`	`1080569` ns	`1081521` ns	`1.00`
`cuda/synchronization/stream/auto`	`1007` ns	`1036.5` ns	`0.97`
`cuda/synchronization/stream/nonblocking`	`7651.9` ns	`7410.8` ns	`1.03`
`cuda/synchronization/stream/blocking`	`808.6` ns	`820.6336633663366` ns	`0.99`
`cuda/synchronization/context/auto`	`1154.2` ns	`1154.3` ns	`1.00`
`cuda/synchronization/context/nonblocking`	`7125.8` ns	`7124.4` ns	`1.00`
`cuda/synchronization/context/blocking`	`936.6078431372549` ns	`887.4107142857143` ns	`1.06`

This comment was automatically generated by workflow using github-action-benchmark.

AntonOresten · 2025-12-08T16:05:22Z

The cuDNN run on CUDA 13 appears to fail due to running on SM75.

CarloLucibello · 2025-12-13T11:17:14Z

rebase on master?

AntonOresten · 2025-12-13T13:33:48Z

Done!

maleadt · 2025-12-31T11:26:36Z

CI failures look related.

2025-12-31 10:50:19 UTC
	pooling: Error During Test at /var/lib/buildkite-agent/builds/gpuci-17/julialang/cuda-dot-jl/lib/cudnn/test/pooling.jl:71
2025-12-31 10:50:20 UTC
	  Test threw exception
2025-12-31 10:50:20 UTC
	  Expression: ay1 ≈ cudnnPoolingForward(cx; mode, nanOpt, window, padding, stride, format, alpha) |> Array
2025-12-31 10:50:20 UTC
	  CUDNNError: CUDNN_STATUS_NOT_SUPPORTED_ARCH_MISMATCH (code 3007)

AntonOresten · 2025-12-31T11:33:53Z

Hi Tim.

Test outcome seems to depend on compute capability / arch:

Tests for cuDNN on CUDA 13.0 ran on an A100 (passed):

│ 1 device:
--
└   0: NVIDIA A100-PCIE-40GB MIG 1g.5gb (sm_80, 4.713 GiB / 4.750 GiB available)

But the tests for cuDNN on CUDA 12.0 happened to run on a Turing GPU (failed):

│ 1 device:
--
└   0: Quadro RTX 5000 (sm_75, 15.551 GiB / 16.000 GiB available)

Conversation

AntonOresten commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AntonOresten commented Nov 30, 2025

Uh oh!

kshyatt commented Dec 1, 2025

Uh oh!

kshyatt commented Dec 8, 2025

Uh oh!

codecov bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

AntonOresten commented Dec 8, 2025

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

AntonOresten commented Dec 8, 2025

Uh oh!

CarloLucibello commented Dec 13, 2025

Uh oh!

AntonOresten commented Dec 13, 2025

Uh oh!

maleadt commented Dec 31, 2025

Uh oh!

AntonOresten commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maleadt commented Dec 31, 2025

Uh oh!

AntonOresten commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AntonOresten commented Nov 28, 2025 •

edited

Loading

github-actions bot commented Nov 28, 2025 •

edited

Loading

codecov bot commented Dec 8, 2025 •

edited

Loading

github-actions bot left a comment •

edited

Loading

AntonOresten commented Dec 31, 2025 •

edited

Loading