Port non-blocking synchronization from CUDA.jl#783
Conversation
Wait for command buffers via a completion handler that notifies the Julia scheduler, instead of parking the calling thread in waitUntilCompleted. Fixes task switches from command buffer callbacks (#532). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
This essentially reverts #690, but if synchronization is faster I don't mind doing this until we want to support Metal 4. |
It's not. The only reason is to avoid the calling thread to block, since that can cause deadlocks when it's thread 0 and a callback from Metal also wants to do I/O (which in Julia can only happen on thread 0). Well, it also enables running code during synchronization, but that's not the immediate goal here. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #783 +/- ##
==========================================
- Coverage 80.84% 80.42% -0.42%
==========================================
Files 63 64 +1
Lines 3017 3025 +8
==========================================
- Hits 2439 2433 -6
- Misses 578 592 +14 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Metal Benchmarks
Details
| Benchmark suite | Current: 04a0ddf | Previous: 08dd32c | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
1272729 ns |
1121812.5 ns |
1.13 |
array/accumulate/Float32/dims=1 |
1650375 ns |
1540833 ns |
1.07 |
array/accumulate/Float32/dims=1L |
9895666.5 ns |
9853833 ns |
1.00 |
array/accumulate/Float32/dims=2 |
1967000 ns |
1854104 ns |
1.06 |
array/accumulate/Float32/dims=2L |
7336208.5 ns |
7215709 ns |
1.02 |
array/accumulate/Int64/1d |
1350250 ns |
1235583 ns |
1.09 |
array/accumulate/Int64/dims=1 |
1918208 ns |
1877791 ns |
1.02 |
array/accumulate/Int64/dims=1L |
11800375 ns |
11684292 ns |
1.01 |
array/accumulate/Int64/dims=2 |
2255500 ns |
2154625 ns |
1.05 |
array/accumulate/Int64/dims=2L |
9917688 ns |
9782354.5 ns |
1.01 |
array/broadcast |
694583 ns |
623250 ns |
1.11 |
array/construct |
5542 ns |
5500 ns |
1.01 |
array/permutedims/2d |
1246541.5 ns |
1128916 ns |
1.10 |
array/permutedims/3d |
1736833 ns |
1639771 ns |
1.06 |
array/permutedims/4d |
2647667 ns |
2584062 ns |
1.02 |
array/private/copy |
717291 ns |
580458.5 ns |
1.24 |
array/private/copyto!/cpu_to_gpu |
943917 ns |
781145.5 ns |
1.21 |
array/private/copyto!/gpu_to_cpu |
939375 ns |
782833 ns |
1.20 |
array/private/copyto!/gpu_to_gpu |
782542 ns |
612666 ns |
1.28 |
array/private/iteration/findall/bool |
1531500 ns |
1439229.5 ns |
1.06 |
array/private/iteration/findall/int |
1650167 ns |
1538875 ns |
1.07 |
array/private/iteration/findfirst/bool |
2147209 ns |
1956270.5 ns |
1.10 |
array/private/iteration/findfirst/int |
2184833.5 ns |
2000541.5 ns |
1.09 |
array/private/iteration/findmin/1d |
2500729.5 ns |
2099000 ns |
1.19 |
array/private/iteration/findmin/2d |
1748791 ns |
1580958.5 ns |
1.11 |
array/private/iteration/logical |
2750833 ns |
2524020.5 ns |
1.09 |
array/private/iteration/scalar |
6317375 ns |
4844500 ns |
1.30 |
array/random/rand/Float32 |
1215375 ns |
1117833 ns |
1.09 |
array/random/rand/Int64 |
1352353.5 ns |
1256375 ns |
1.08 |
array/random/rand!/Float32 |
1024917 ns |
878542 ns |
1.17 |
array/random/rand!/Int64 |
930583 ns |
838416 ns |
1.11 |
array/random/randn/Float32 |
1160000 ns |
1060750 ns |
1.09 |
array/random/randn!/Float32 |
927437.5 ns |
837333 ns |
1.11 |
array/reductions/mapreduce/Float32/1d |
1200666 ns |
1022124.5 ns |
1.17 |
array/reductions/mapreduce/Float32/dims=1 |
954583 ns |
833333.5 ns |
1.15 |
array/reductions/mapreduce/Float32/dims=1L |
1444270.5 ns |
1346542 ns |
1.07 |
array/reductions/mapreduce/Float32/dims=2 |
954209 ns |
839062.5 ns |
1.14 |
array/reductions/mapreduce/Float32/dims=2L |
1936979 ns |
1769791 ns |
1.09 |
array/reductions/mapreduce/Int64/1d |
1539875 ns |
1556625 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=1 |
1215834 ns |
1141250 ns |
1.07 |
array/reductions/mapreduce/Int64/dims=1L |
2136750 ns |
2019375 ns |
1.06 |
array/reductions/mapreduce/Int64/dims=2 |
1402500 ns |
1360750 ns |
1.03 |
array/reductions/mapreduce/Int64/dims=2L |
4436874.5 ns |
4395958 ns |
1.01 |
array/reductions/reduce/Float32/1d |
1199042 ns |
1014209 ns |
1.18 |
array/reductions/reduce/Float32/dims=1 |
942583 ns |
841708 ns |
1.12 |
array/reductions/reduce/Float32/dims=1L |
1456250 ns |
1356687 ns |
1.07 |
array/reductions/reduce/Float32/dims=2 |
938062.5 ns |
827542 ns |
1.13 |
array/reductions/reduce/Float32/dims=2L |
1928396 ns |
1774666 ns |
1.09 |
array/reductions/reduce/Int64/1d |
1508542 ns |
1468125 ns |
1.03 |
array/reductions/reduce/Int64/dims=1 |
1214417 ns |
1119291 ns |
1.08 |
array/reductions/reduce/Int64/dims=1L |
2092417 ns |
2121167 ns |
0.99 |
array/reductions/reduce/Int64/dims=2 |
1406500 ns |
1358750 ns |
1.04 |
array/reductions/reduce/Int64/dims=2L |
4436875 ns |
4144291 ns |
1.07 |
array/shared/copy |
314458 ns |
239520.5 ns |
1.31 |
array/shared/copyto!/cpu_to_gpu |
115667 ns |
78541 ns |
1.47 |
array/shared/copyto!/gpu_to_cpu |
107666 ns |
79083 ns |
1.36 |
array/shared/copyto!/gpu_to_gpu |
107666 ns |
80041 ns |
1.35 |
array/shared/iteration/findall/bool |
1526187 ns |
1449208 ns |
1.05 |
array/shared/iteration/findall/int |
1666000 ns |
1528312.5 ns |
1.09 |
array/shared/iteration/findfirst/bool |
1706917 ns |
1558750 ns |
1.10 |
array/shared/iteration/findfirst/int |
1702583 ns |
1577500 ns |
1.08 |
array/shared/iteration/findmin/1d |
2067625 ns |
1871458.5 ns |
1.10 |
array/shared/iteration/findmin/2d |
1749395.5 ns |
1606250 ns |
1.09 |
array/shared/iteration/logical |
2407291.5 ns |
2379063 ns |
1.01 |
array/shared/iteration/scalar |
301333 ns |
183709 ns |
1.64 |
integration/byval/reference |
1677104.5 ns |
1544334 ns |
1.09 |
integration/byval/slices=1 |
1658313 ns |
1565125 ns |
1.06 |
integration/byval/slices=2 |
2740979.5 ns |
2620312.5 ns |
1.05 |
integration/byval/slices=3 |
7852708.5 ns |
8876292 ns |
0.88 |
integration/metaldevrt |
991541 ns |
867708 ns |
1.14 |
kernel/indexing |
745209 ns |
631125 ns |
1.18 |
kernel/indexing_checked |
750500 ns |
641917 ns |
1.17 |
kernel/launch |
11417 ns |
11458 ns |
1.00 |
kernel/rand |
654542 ns |
589270.5 ns |
1.11 |
latency/import |
1383343250 ns |
1381855687.5 ns |
1.00 |
latency/precompile |
29348889041 ns |
29156740542 ns |
1.01 |
latency/ttfp |
1649143959 ns |
1647092167 ns |
1.00 |
metal/synchronization/context |
30708 ns |
18750 ns |
1.64 |
metal/synchronization/stream |
30625 ns |
17722.333333333332 ns |
1.73 |
This comment was automatically generated by workflow using github-action-benchmark.
Closes #532