Skip to content

Port non-blocking synchronization from CUDA.jl#783

Open
maleadt wants to merge 2 commits into
mainfrom
tb/nonblocking_sync
Open

Port non-blocking synchronization from CUDA.jl#783
maleadt wants to merge 2 commits into
mainfrom
tb/nonblocking_sync

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented May 26, 2026

Closes #532

maleadt and others added 2 commits May 26, 2026 21:59
Wait for command buffers via a completion handler that notifies the Julia
scheduler, instead of parking the calling thread in waitUntilCompleted.
Fixes task switches from command buffer callbacks (#532).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@christiangnrd
Copy link
Copy Markdown
Member

This essentially reverts #690, but if synchronization is faster I don't mind doing this until we want to support Metal 4.

@maleadt
Copy link
Copy Markdown
Member Author

maleadt commented May 26, 2026

if synchronization is faster I don't mind doing this until we want to support Metal 4.

It's not. The only reason is to avoid the calling thread to block, since that can cause deadlocks when it's thread 0 and a callback from Metal also wants to do I/O (which in Julia can only happen on thread 0).

Well, it also enables running code during synchronization, but that's not the immediate goal here.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

❌ Patch coverage is 78.43137% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.42%. Comparing base (08dd32c) to head (04a0ddf).

Files with missing lines Patch % Lines
lib/mps/ndarray.jl 0.00% 6 Missing ⚠️
src/memory.jl 66.66% 2 Missing ⚠️
src/synchronization.jl 87.50% 2 Missing ⚠️
lib/mps/matrixrandom.jl 93.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #783      +/-   ##
==========================================
- Coverage   80.84%   80.42%   -0.42%     
==========================================
  Files          63       64       +1     
  Lines        3017     3025       +8     
==========================================
- Hits         2439     2433       -6     
- Misses        578      592      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Details
Benchmark suite Current: 04a0ddf Previous: 08dd32c Ratio
array/accumulate/Float32/1d 1272729 ns 1121812.5 ns 1.13
array/accumulate/Float32/dims=1 1650375 ns 1540833 ns 1.07
array/accumulate/Float32/dims=1L 9895666.5 ns 9853833 ns 1.00
array/accumulate/Float32/dims=2 1967000 ns 1854104 ns 1.06
array/accumulate/Float32/dims=2L 7336208.5 ns 7215709 ns 1.02
array/accumulate/Int64/1d 1350250 ns 1235583 ns 1.09
array/accumulate/Int64/dims=1 1918208 ns 1877791 ns 1.02
array/accumulate/Int64/dims=1L 11800375 ns 11684292 ns 1.01
array/accumulate/Int64/dims=2 2255500 ns 2154625 ns 1.05
array/accumulate/Int64/dims=2L 9917688 ns 9782354.5 ns 1.01
array/broadcast 694583 ns 623250 ns 1.11
array/construct 5542 ns 5500 ns 1.01
array/permutedims/2d 1246541.5 ns 1128916 ns 1.10
array/permutedims/3d 1736833 ns 1639771 ns 1.06
array/permutedims/4d 2647667 ns 2584062 ns 1.02
array/private/copy 717291 ns 580458.5 ns 1.24
array/private/copyto!/cpu_to_gpu 943917 ns 781145.5 ns 1.21
array/private/copyto!/gpu_to_cpu 939375 ns 782833 ns 1.20
array/private/copyto!/gpu_to_gpu 782542 ns 612666 ns 1.28
array/private/iteration/findall/bool 1531500 ns 1439229.5 ns 1.06
array/private/iteration/findall/int 1650167 ns 1538875 ns 1.07
array/private/iteration/findfirst/bool 2147209 ns 1956270.5 ns 1.10
array/private/iteration/findfirst/int 2184833.5 ns 2000541.5 ns 1.09
array/private/iteration/findmin/1d 2500729.5 ns 2099000 ns 1.19
array/private/iteration/findmin/2d 1748791 ns 1580958.5 ns 1.11
array/private/iteration/logical 2750833 ns 2524020.5 ns 1.09
array/private/iteration/scalar 6317375 ns 4844500 ns 1.30
array/random/rand/Float32 1215375 ns 1117833 ns 1.09
array/random/rand/Int64 1352353.5 ns 1256375 ns 1.08
array/random/rand!/Float32 1024917 ns 878542 ns 1.17
array/random/rand!/Int64 930583 ns 838416 ns 1.11
array/random/randn/Float32 1160000 ns 1060750 ns 1.09
array/random/randn!/Float32 927437.5 ns 837333 ns 1.11
array/reductions/mapreduce/Float32/1d 1200666 ns 1022124.5 ns 1.17
array/reductions/mapreduce/Float32/dims=1 954583 ns 833333.5 ns 1.15
array/reductions/mapreduce/Float32/dims=1L 1444270.5 ns 1346542 ns 1.07
array/reductions/mapreduce/Float32/dims=2 954209 ns 839062.5 ns 1.14
array/reductions/mapreduce/Float32/dims=2L 1936979 ns 1769791 ns 1.09
array/reductions/mapreduce/Int64/1d 1539875 ns 1556625 ns 0.99
array/reductions/mapreduce/Int64/dims=1 1215834 ns 1141250 ns 1.07
array/reductions/mapreduce/Int64/dims=1L 2136750 ns 2019375 ns 1.06
array/reductions/mapreduce/Int64/dims=2 1402500 ns 1360750 ns 1.03
array/reductions/mapreduce/Int64/dims=2L 4436874.5 ns 4395958 ns 1.01
array/reductions/reduce/Float32/1d 1199042 ns 1014209 ns 1.18
array/reductions/reduce/Float32/dims=1 942583 ns 841708 ns 1.12
array/reductions/reduce/Float32/dims=1L 1456250 ns 1356687 ns 1.07
array/reductions/reduce/Float32/dims=2 938062.5 ns 827542 ns 1.13
array/reductions/reduce/Float32/dims=2L 1928396 ns 1774666 ns 1.09
array/reductions/reduce/Int64/1d 1508542 ns 1468125 ns 1.03
array/reductions/reduce/Int64/dims=1 1214417 ns 1119291 ns 1.08
array/reductions/reduce/Int64/dims=1L 2092417 ns 2121167 ns 0.99
array/reductions/reduce/Int64/dims=2 1406500 ns 1358750 ns 1.04
array/reductions/reduce/Int64/dims=2L 4436875 ns 4144291 ns 1.07
array/shared/copy 314458 ns 239520.5 ns 1.31
array/shared/copyto!/cpu_to_gpu 115667 ns 78541 ns 1.47
array/shared/copyto!/gpu_to_cpu 107666 ns 79083 ns 1.36
array/shared/copyto!/gpu_to_gpu 107666 ns 80041 ns 1.35
array/shared/iteration/findall/bool 1526187 ns 1449208 ns 1.05
array/shared/iteration/findall/int 1666000 ns 1528312.5 ns 1.09
array/shared/iteration/findfirst/bool 1706917 ns 1558750 ns 1.10
array/shared/iteration/findfirst/int 1702583 ns 1577500 ns 1.08
array/shared/iteration/findmin/1d 2067625 ns 1871458.5 ns 1.10
array/shared/iteration/findmin/2d 1749395.5 ns 1606250 ns 1.09
array/shared/iteration/logical 2407291.5 ns 2379063 ns 1.01
array/shared/iteration/scalar 301333 ns 183709 ns 1.64
integration/byval/reference 1677104.5 ns 1544334 ns 1.09
integration/byval/slices=1 1658313 ns 1565125 ns 1.06
integration/byval/slices=2 2740979.5 ns 2620312.5 ns 1.05
integration/byval/slices=3 7852708.5 ns 8876292 ns 0.88
integration/metaldevrt 991541 ns 867708 ns 1.14
kernel/indexing 745209 ns 631125 ns 1.18
kernel/indexing_checked 750500 ns 641917 ns 1.17
kernel/launch 11417 ns 11458 ns 1.00
kernel/rand 654542 ns 589270.5 ns 1.11
latency/import 1383343250 ns 1381855687.5 ns 1.00
latency/precompile 29348889041 ns 29156740542 ns 1.01
latency/ttfp 1649143959 ns 1647092167 ns 1.00
metal/synchronization/context 30708 ns 18750 ns 1.64
metal/synchronization/stream 30625 ns 17722.333333333332 ns 1.73

This comment was automatically generated by workflow using github-action-benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support task switches from command buffer callbacks

2 participants