Current status: Post-Phase 5 — 177/177 tests passing
Phase 4 is fully complete. Phase 5 performance work is complete. Intentional non-goals per §2.2 (CUDA Graphs, dynamic parallelism, texture objects, multi-GPU, graphics interop) remain deferred to v2.
Phase 5 items implemented:
metal_backend::launch_kernel_timed()— synchronous kernel launch that capturesMTLCommandBuffer.GPUStartTime/GPUEndTimefor precise GPU-execution-time measurement.metal_backend::GpuTimingResult— GPU start/end in CFTimeInterval (seconds), with aduration_ms()helper.tools/cumetal_bench/bench_kernels.metal— native Metal MSL baseline kernels:vector_add,saxpy(memory-bound SAXPY with scalar alpha as a 1-element buffer),reduce_f32(tree reduction using threadgroup shared memory, one partial sum per threadgroup).tools/cumetal_bench/main.cpp— rewritten multi-kernel Phase 5 benchmark:- Supports
--all-kernelsto sweep vector_add, saxpy, and reduce_f32. - Reports native GPU time (from
MTLCommandBuffertimestamps) and wall-clock time for both paths; ratio uses wall-clock (apples-to-apples: both paths synchronize per iteration). - Prints a tabular comparison: kernel | elements | native_gpu_ms | native_wall_ms | cumetal_wall_ms | ratio | PASS/FAIL.
--max-ratio <x>enforces the spec §5.7 / §10.6 gate (Phase 5 target: ≤ 2.0×).- Measured ratios on Apple Silicon: vector_add 0.74×, saxpy 0.98×, reduce_f32 1.00×.
- Supports
scripts/generate_bench_metallib.sh— compilesbench_kernels.metaltobench_kernels.metallibviaxcrun metal+xcrun metallib; exits 77 if toolchain is unavailable (CTest skip).scripts/run_bench_phase5.sh— end-to-end Phase 5 gate script: generates metallib, then runscumetal_bench --all-kernels --max-ratio 2.0.bench_phase5_all_kernelsCTest — registered in CMakeLists.txt (APPLE only, SKIP_RETURN_CODE 77); enforces the 2× ceiling defined in spec §5.7.
Post-Phase 5 work completed:
-
MTLHeap auto-threshold: MTLHeap sub-allocation now auto-enabled for allocations ≥ 4 MiB (configurable via
CUMETAL_MTLHEAP_THRESHOLD_BYTES). Three modes:CUMETAL_MTLHEAP_ALLOCunset → auto (heap for size ≥ threshold, default 4 MiB)CUMETAL_MTLHEAP_ALLOC=1→ always use heapCUMETAL_MTLHEAP_ALLOC=0→ never use heap Tests:functional_runtime_heap_auto_threshold,functional_runtime_heap_disabled.
-
Binary shim JIT cache: Registration-path PTX→metallib compilations are now cached persistently at
$CUMETAL_CACHE_DIR/registration-jit/<hash>.metallib(default:$HOME/Library/Caches/io.cumetal/registration-jit/), keyed by FNV-1a-64 hash ofptx_source + kernel_name. Persistent cache files survive__cudaUnregisterFatBinaryand process restart — second registration of the same kernel skips xcrun. Test:functional_runtime_registration_jit_cache. -
CUMETAL_DEBUG_REGISTRATION=1— opt-in stderr trace for binary shim diagnostics: logs fatbinary format detection, JIT compile path (Metal vs LLVM IR lowering), cache hits/misses, arg count inference, and kernel/symbol registration events.
Post-Phase 5 work completed (continued, part 2):
-
cuBLAS extended APIs (
runtime/rt/cublas.cpp,runtime/api/cublas_v2.h): AddedcudaDataType_t,cublasDiagType_t,cublasSideMode_t,cublasGemmAlgo_tenums. New functions:cublasGemmEx— extended GEMM: routes CUDA_R_32F → cublasSgemm, CUDA_R_64F → cublasDgemm, FP16/mixed types via scalar upconvert loop.cublasGemmStridedBatchedEx— batched strided GemmEx; routes fp32/fp64 to typed variants.cublasHgemm— half-precision GEMM via upconvert to float through GemmEx.cublasSgemmBatched/cublasDgemmBatched— array-of-pointers batched GEMM.cublasStrsm/cublasDtrsm— triangular solve (BLAS3); supports LEFT/RIGHT side, UPPER/LOWER fill, N/T/C transpose, UNIT/NON_UNIT diagonal, alpha scaling.cublasSetVector/cublasGetVector/cublasSetMatrix/cublasGetMatrix— strided host↔device copy helpers (no-op overhead on Apple Silicon UMA).- Async variants (
*Async) alias to their synchronous counterparts (stream ignored; UMA). Test:functional_cublas_extended_api.
-
Miscellaneous extended APIs (
runtime/api/,runtime/rt/,runtime/driver/): Fills remaining API gaps identified in post-Phase-5 survey.- cuRAND:
curandGeneratePoisson(generator, ptr, n, lambda)— Poisson-distributed uint32 viastd::poisson_distribution;curandGetProperty(type, value)returning major/minor/patch version (mirrors CUDAlibraryPropertyTypeenum). - cuBLAS:
cublasGetStatusName(status)— returns enum-name string (e.g."CUBLAS_STATUS_SUCCESS");cublasGetStatusString(status)— returns human-readable description. - cuFFT:
cufftSetWorkArea(plan, workArea)— no-op stub (vDSP manages its own scratch on UMA);cufftEstimate1d/2d/3d/Many— returns a conservative upper-bound scratch-size estimate without building a full plan. - 3D pitched memory (
cuda_runtime.h/cuda.h): Added typescudaExtent,cudaPitchedPtr,cudaPos,cudaMemcpy3DParms(with C++make_*helpers) and opaquecudaArray_t. New runtime APIs:cudaMalloc3D(pitchedDevPtr, extent)— allocates pitch×height×depth bytes, pitch aligned to 512 bytes.cudaMemcpy3D(parms)/cudaMemcpy3DAsync(parms, stream)— 3D pitched copy (plane-by-row stride walk; stream ignored on UMA).
- Driver API 3D copy (
cuda.h/cuda_driver.cpp): AddedCUmemorytypeenum,CUarrayopaque typedef,CUDA_MEMCPY3Dstruct, and:cuMemcpy3D(pCopy)/cuMemcpy3DAsync(pCopy, hStream)— 3D strided copy resolving host/device ptrs fromCUmemorytype(UMA: both are host-accessible). Test:functional_misc_extended_api(6 sub-tests covering all new APIs).
- cuRAND:
-
Extended APIs batch 2 (
runtime/api/,runtime/rt/,runtime/driver/):- cuRAND:
curandCreateGeneratorHost— on Apple Silicon UMA host=device, aliasescurandCreateGenerator(no separate host/device distinction needed). - cuBLAS:
cublasGetProperty(type, value)— returns cuBLAS version (major/minor/patch) vialibraryPropertyTypeenum (same guard ascurand.hto prevent double-definition). Symmetric BLAS:cublasSsyr/cublasDsyr— symmetric rank-1 update:A += alpha * x * x^T(column-major, only upper or lower triangle updated).cublasSsyrk/cublasDsyrk— symmetric rank-k update:C = alpha * op(A) * op(A)^T + beta * C.cublasSsyr2k/cublasDsyr2k— symmetric rank-2k update:C = alpha * (op(A)*op(B)^T + op(B)*op(A)^T) + beta * C.
- Driver API:
cuFuncSetAttribute— no-op (Metal manages occupancy automatically).cuOccupancyMaxActiveBlocksPerMultiprocessorWithFlags— delegates to base function, flags ignored.cuCtxPushCurrent/cuCtxPopCurrent— thin wrappers aroundcuCtxSetCurrent/GetCurrent.cuDevicePrimaryCtxRetain/cuDevicePrimaryCtxRelease— create/destroy primary context (single GPU on Apple Silicon).cuStreamGetPriority/cuStreamGetFlags— return 0 (single-priority stream model).cuModuleGetGlobal— stub returningCUDA_ERROR_NOT_FOUND(no runtime-addressable__device__globals in CuMetal).
- Runtime peer copy:
cudaMemcpyPeer/cudaMemcpyPeerAsync— UMA single GPU; forward tocudaMemcpy/cudaMemcpyAsyncwithcudaMemcpyDefault. cudaLaunchHostFunc(stream, fn, userData)— synchronizes stream then callsfn(userData).cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags— delegates to base function. Test:functional_extended_api_v2(18 sub-tests covering all new APIs).
- cuRAND:
-
Extended APIs batch 3 (
runtime/api/,runtime/rt/):- cuRAND:
curandGenerateExponential/curandGenerateExponentialDouble— exponential distribution via inverse transform: X = -ln(U), U ~ Uniform(0,1). - cuFFT:
cufftGetProperty(type, value)— returns cuFFT version major/minor/patch (consistent withcurandGetProperty/cublasGetPropertypattern). - cuBLAS BLAS2:
cublasSsyr2/cublasDsyr2— symmetric rank-2 update: A += α·(x·yᵀ + y·xᵀ); only upper or lower triangle updated. - cuBLAS BLAS2:
cublasStrmv/cublasDtrmv— triangular matrix-vector multiply: x := op(A)·x; supports UPPER/LOWER, NO_TRANS/TRANS, UNIT/NON_UNIT diagonal; uses temp buffer for in-place correctness. - cuBLAS BLAS3:
cublasSsymm/cublasDsymm— symmetric matrix-matrix multiply: C = α·A·B + β·C (SIDE_LEFT) or C = α·B·A + β·C (SIDE_RIGHT); symmetric element lookup reconstructs missing triangle from stored half. - cuBLAS BLAS3:
cublasStrmm/cublasDtrmm— triangular matrix-matrix multiply: C = α·op(A)·B (SIDE_LEFT) or C = α·B·op(A) (SIDE_RIGHT); output written to C (cuBLAS v2 API); supports all trans/diag/uplo/side combinations. - cuBLAS BLAS1:
cublasSrot/cublasDrot— apply Givens rotation: x[i] = c·x[i] + s·y[i]; y[i] = c·y[i] - s·x[i]. - cuBLAS BLAS1:
cublasSrotg/cublasDrotg— construct Givens rotation: given (a,b) compute (c,s,r,z) such that [c s;-s c]·[a;b] = [r;0]. Test:functional_extended_api_v3(14 sub-tests covering all new APIs).
- cuRAND:
-
Extended APIs batch 4 (
runtime/api/,runtime/rt/,runtime/driver/):- Runtime 3D memset:
cudaMemset2DAsync— async variant of 2D memset (stream ignored on UMA);cudaMemset3D/cudaMemset3DAsync— fill 3D pitched volume plane-by-row usingpitchedDevPtr.pitch × pitchedDevPtr.ysizeas the plane stride. - Driver 2D memset:
cuMemsetD2D8/cuMemsetD2D16/cuMemsetD2D32— strided per-row fill (8-bit usesmemset; 16/32-bit use typed element loops);*Asyncvariants alias synchronous versions (stream ignored; UMA). - Driver allocation query:
cuMemGetAddressRange(pbase, psize, dptr)— queries CuMetal's allocation table viacumetalRuntimeGetAllocationInfoto return base address and allocation size for any pointer within acudaMalloc-ed block. - Driver pointer attributes:
cuPointerGetAttribute(data, attribute, ptr)— supportsCU_POINTER_ATTRIBUTE_MEMORY_TYPE(returnsCU_MEMORYTYPE_UNIFIED),CU_POINTER_ATTRIBUTE_DEVICE_POINTER,CU_POINTER_ATTRIBUTE_HOST_POINTER(both return the pointer itself; UMA identity),CU_POINTER_ATTRIBUTE_IS_MANAGED,CU_POINTER_ATTRIBUTE_MAPPED, andCU_POINTER_ATTRIBUTE_CONTEXT. - cuBLAS BLAS1:
cublasSrotm/cublasDrotm— apply modified Givens rotation H to (x,y); flag encoding: -2 = identity no-op, -1 = general [h11 h12; h21 h22], 0 = diagonal-1 [1 h12; h21 1], 1 = off-diagonal [h11 1; -1 h22]. - cuBLAS BLAS1:
cublasSrotmg/cublasDrotmg— construct modified Givens rotation using Lawson et al. algorithm; encodes H into param[0..4] with rescaling loop to prevent overflow/underflow; updates d1, d2, x1 in-place. Test:functional_extended_api_v4(34 sub-tests covering all new APIs).
- Runtime 3D memset:
-
Threadgroup memory tiling hints (
compiler/passes/src/threadgroup_tiling.cpp): Newanalyse_threadgroup_tiling()pass that scans a PTX kernel's instruction stream for shared-memory bank-conflict patterns. The pass detectsmul.lo/shlstride constants (window of 4) immediately precedingld.shared/st.shared/atom.shared/red.sharedaccesses and emitsTilingHintentries for every power-of-2 stride ≥ 16 that aligns to the 32-bank, 4-byte-per-bank Metal threadgroup memory boundary. Each hint carries the detected stride, element size, recommended padding (1 element =elem_bytes), and a human-readable reason string. Covered byunit_threadgroup_tiling(9 sub-cases).
Items remaining (deferred per spec §2.2):
- Kernel fusion via MLIR GPU dialect (optional, deferred to v2).
Implemented:
- Phase 0.5 tooling:
air_inspect:.metallibcontainer inspection- parses Apple function-list tags (
NAME/TYPE/HASH/MDSZ/OFFT/VERS) on current Xcode layout
- parses Apple function-list tags (
cumetal-air-emitter:.metallibemission (xcrun-backed + experimental mode)cumetalc: thin compiler-driver CLI over the AIR emitterair_validate: structural checks + optionalxcrun metal -validatecumetal_metal_load_test:MTLDevice.newLibraryWithData:acceptance test
- Phase 1 scaffolding:
- minimal PTX text parser (
.version/.target/.entry/.param+ instruction stream) with tolerant/strict unsupported-op modes incompiler/ptx/ cumetal-ptx2llvm: PTX text to LLVM IR (AIR metadata scaffold) via the phase1 pipeline, including concrete vector-add and matrix-multiply body emission for recognized signatures- PTX signature lowering now also covers unary
negateandreduce_sum(atomic add) kernels used in regression tests forneg.f32,shl.b64, andatom.global.add.f32paths - intrinsic-lowering opcode coverage expanded for
div,rem,and,or,xor,not,selp, andrcpinstruction roots, with strict PTX sweep coverage - math intrinsic lowering extended:
fma,max/min/abs(with float/int variants),sqrt,rsqrt,ex2→exp2,lg2→log2,sin,cos - warp primitive lowering:
shfl.sync.{idx,down,up,bfly}→air.simdgroup.shuffle*,vote.sync.{ballot,any,all}→air.simdgroup.{ballot,any,all},bar.warp.sync→air.simdgroup.barrier(__syncwarp emulation) - memory barrier lowering:
membar.gl/sys→air.mem.barrier.device,membar.cta→air.mem.barrier.threadgroup(__threadfence/__threadfence_block) - async copy lowering:
cp.async.*→air.cp_async(serialized ld+st);cp.async.commit_group/wait_group/wait_all→air.threadgroup_barrier - warp reduction lowering:
redux.sync.{add,and,or,xor,min,max}→air.simdgroup.reduce_{add,and,or,xor,min,max}[.f32](__redux_sync emulation) - parser: targeted error diagnostics for Hopper cluster ops (
cluster.*,mbarrier.*), TMA (cp.async.bulk.tensor.*), and FP8 (cvt.rn.f8*) with specific messages cumetalcaccepts.ptxinput via internal PTX->LLVM lowering (--entry,--ptx-strict)cumetalcaccepts initial.cuinput via xcrun clang++ frontend lowering to LLVM IR- expanded PTX sweep harness (
tests/ptx_sweep) for strict-mode supported/unsupported opcode checks - initial
intrinsic_lowerpass for thread-index/barrier/basic-math mappings - initial
printf_lowerpass for PTXprintf/vprintfcall extraction and format-table metadata - initial
addrspacepass for shared/global/local load-store +cvta.to.*rewrites - initial
metadatapass for AIR-style kernel metadata fields - initial phase1 pipeline API chaining parser + passes for a selected PTX entry
- PTX parser handles entry attributes between signature/body (e.g.
.maxntid,.minnctapersm) and.paramqualifiers (.ptr,.align) used by clang-emitted PTX
- minimal PTX text parser (
- Early Phase 0 runtime path:
- allocation tracking (
ptr -> MTLBuffer) with offset resolution - optional
MTLHeap-backed sub-allocation path forcudaMalloc/cuMemAlloc(CUMETAL_MTLHEAP_ALLOC=1, chunk size override:CUMETAL_MTLHEAP_CHUNK_BYTES) - synchronous
cudaMemcpyon UMA viamemcpy - kernel launch through Metal compute pipelines (
setBuffer+setBytes) - default-stream, per-thread default stream, and user-stream execution
(
cudaStreamCreate/Destroy/Synchronize,cudaStreamPerThread,cudaStreamLegacy) - runtime functional tests for vector add, matrix multiply, and saxpy
- initial library shims for cuRAND and cuBLAS v2
- cuBLAS
cublasSgemm/cublasSgemmStridedBatchedbacked by MetalPerformanceShaders GEMM - driver module loading from both in-memory metallib bytes and filesystem paths
- on-disk cache for
cuModuleLoadDatametallib byte payloads - driver stream/event/memory APIs enforce
cuInit+ current-context requirements - shared runtime artifact:
libcumetal.dylib(pluscuda.h/cuda_runtime.hinstall headers) - startup conflict warning if another
libcuda.dylibis already loaded - Metal command-buffer failures map to CUDA timeout/illegal-address/devices-unavailable errors
- default module cache root:
$HOME/Library/Caches/io.cumetal/kernels(override:CUMETAL_CACHE_DIR) samples/vectorAddsource flow exercised end-to-end (compile.cuwithcumetalc, link host app againstlibcumetal, execute and validate output)- opt-in registration path symbols for binary-shim style launches
(
__cudaRegisterFatBinary,__cudaRegisterFatBinary2,__cudaRegisterFatBinary3,__cudaRegisterFatBinaryEnd,__cudaRegisterFunction,__cudaRegisterVar,__cudaRegisterManagedVar,__cudaPushCallConfiguration) - legacy runtime launch path (
cudaConfigureCall/cudaSetupArgument/cudaLaunch) - llm.c FP32 CUDA stress binary can be built and executed through CuMetal registration path
using
scripts/build_llmc_test_gpt2fp32cu.sh+scripts/run_llmc_test_gpt2fp32cu.sh conformance_llmc_gpt2fp32cunow enforces numerical parity markers and passes withOK (LOGITS),LOSS OK,TENSOR OK, andoverall okay: 1- llm.c harness build shim supports
CUMETAL_LLMC_GRAD_TOL(default1.2e-2) to tune gradient-check tolerance applied to the generated test translation unit - llm.c runtime emulation fallback is now explicitly traceable (
CUMETAL_TRACE_LLMC_EMULATION=1) and can be disabled (CUMETAL_DISABLE_LLMC_EMULATION=1) to validate pure PTX-lowered execution - direct Metal lowering for all 17 llm.c GPT-2 training kernels
(
compiler/ptx/src/lower_to_metal.cpp);CUMETAL_LLMC_REQUIRE_NO_EMULATION=1now passes (OK (LOGITS),LOSS OK,TENSOR OK,overall okay: 1) without any emulation fallback - PTX sweep extended with 30+ new test cases:
shfl.sync.{idx,down,up,bfly},vote.sync.{ballot,any,all},bar.warp.sync,membar.{gl,cta,sys},cp.async.{ca,commit_group,wait_all},redux.sync.{add,and,or,xor,min,max}, and math intrinsicssqrt,rsqrt,ex2,lg2,sin,cos,fma,abs,min,max - Unsupported-op sweep extended with targeted diagnostic cases for Hopper cluster ops
(
cluster.sync.aligned,mbarrier.init,mbarrier.arrive), TMA (cp.async.bulk.tensor.1d.*), and FP8 (cvt.rn.f8x2.*) --fp64=native|emulate|warnflag added tocumetalc(spec §8.1);warnmode emits per-instruction warnings for.f64opcodes;emulateimplements Dekker FP32-pair decomposition for recognized fp64 kernels; runtime defaults tokEmulatebecause Apple Silicon GPU rejectsfmul doublein Metal pipelines at runtime (setCUMETAL_FP64_MODE=nativeto force native mode for compilation-path testing)- functional tests added:
functional_runtime_warp_shuffle(simd_shuffle broadcast, 64 threads, lane-0 broadcast)functional_runtime_fp16_ops(half-precision add, 256 elements, exact integer check)functional_runtime_shared_reduce(256-thread tree reduction, output[0]==256.0)functional_runtime_grid_2d(4×4 grid of 2×2 blocks, linear index check)functional_runtime_grid_3d(2×3×4 grid of 2×2×2 blocks, 3D linear index check)functional_runtime_fp64_ops(PTX fma.rn.f64 via driver API; PASS via emulate mode)functional_runtime_atomic_shared(threadgroup atomic, 128 blocks×256 threads=32768)functional_runtime_warp_vote(simd_any/all/ballot; 64 threads, ballot=0x55555555)functional_runtime_struct_arg(struct by-value argument via CUMETAL_ARG_BYTES)functional_runtime_barrier_order(thread 0 writes sentinel; all threads verify post-barrier)functional_runtime_cp_async_emul(cp.async emulated as ld+st+threadgroup_barrier)functional_runtime_warp_partial_mask(spec §5.3/§10.3 partial-mask conservative lowering)
- intrinsic lowering:
brev.b32/b64→llvm.bitreverse.i32/i64added to pass and parser - intrinsic_lower unit tests: Test 6 (abs/shr), Test 7 (brev), Test 8 (f32/f64 math, b64 bitwise)
- PTX sweep: expanded to 93+ cases covering all kSupportedRoots opcode roots including:
clz.b64,popc.b64(64-bit bit-count ops)add/sub/mul/div.f32(basic float arithmetic)neg/abs/min/max.f64(double-precision unary/binary)and/or/xor/not.b64(64-bit bitwise ops)mul.lo.u64,rem.u32,rem.s64abs.{s32,s64,f32,f64},shr.{b32,u32,s32,b64,u64,s64}vote.{ballot,any,all}non-sync formsst.global.{u32,u64,f64},ld.global.{u8,s8,u16,s16}atom.global.{cas,and,or,xor,min,max,exch}.b32redux.sync.{min,max}.f32- partial-mask variants:
shfl.syncandvote.sync.ballotwith mask=0x0000FFFF
- allocation tracking (
Supported runtime API subset:
cudaInit,cudaDriverGetVersion,cudaRuntimeGetVersioncudaGetDeviceCount,cudaGetDevice,cudaSetDevice,cudaGetDeviceProperties,cudaDeviceGetAttributecudaSetDeviceFlags,cudaGetDeviceFlagscudaMalloc,cudaMallocManaged,cudaMallocHost,cudaFreecudaHostAlloc,cudaFreeHost,cudaHostGetDevicePointer,cudaHostGetFlagscudaMemGetInfocudaMemcpy,cudaMemcpyAsynccudaMemcpyToSymbol,cudaMemcpyFromSymbol,cudaMemcpyToSymbolAsync,cudaMemcpyFromSymbolAsynccudaMemset,cudaMemsetAsynccudaLaunchKernelcudaConfigureCall,cudaSetupArgument,cudaLaunchcudaStreamCreate,cudaStreamCreateWithFlags,cudaStreamDestroycudaStreamSynchronize,cudaStreamQuery,cudaStreamAddCallbackcudaStreamWaitEventcudaEventCreate,cudaEventCreateWithFlags,cudaEventRecordcudaEventQuery,cudaEventSynchronize,cudaEventElapsedTime,cudaEventDestroycudaDeviceResetcudaDeviceSynchronizecudaGetLastError,cudaPeekAtLastError,cudaGetErrorName,cudaGetErrorStringcudaProfilerStart,cudaProfilerStopcudaFuncGetAttributes,cudaFuncSetCacheConfig,cudaFuncSetSharedMemConfig,cudaFuncSetAttributecudaOccupancyMaxActiveBlocksPerMultiprocessor,cudaOccupancyMaxPotentialBlockSizecudaPointerGetAttributes,cudaChooseDevicecudaStreamCreateWithPriority(priority ignored; creates regular stream)cudaDeviceSetLimit(no-op),cudaDeviceGetLimit(returns sensible defaults)cudaLaunchCooperativeKernel(forwards tocudaLaunchKernel; threadgroup CG works)cudaDeviceSetCacheConfig,cudaDeviceGetCacheConfig(no-op stubs; all memory is UMA)cudaDeviceSetSharedMemConfig,cudaDeviceGetSharedMemConfig(no-op stubs)cudaGetSymbolAddress,cudaGetSymbolSizecudaMemPrefetchAsync,cudaMemAdvise,cudaMemRangeGetAttribute(meaningful no-ops on Apple Silicon UMA)cudaDeviceGetStreamPriorityRange(returns 0,0 — Metal has no priority queues)cudaMemcpy2D,cudaMemcpy2DAsync,cudaMemset2D(row-by-row on UMA)cudaMallocPitch(aligned 2D allocation; pitch rounded to 512 bytes)cudaDeviceCanAccessPeer,cudaDeviceEnablePeerAccess,cudaDeviceDisablePeerAccess(no-op stubs; single GPU)cuda_runtime_api.hforwarding header (programs that include this directly)
Device intrinsics added to cuda_runtime.h:
- Type-punning:
__int_as_float,__float_as_int,__uint_as_float,__float_as_uint,__longlong_as_double,__double_as_longlong - Integer:
__mulhi,__umulhi,__mul24,__umul24,__sad,__usad - Fast math:
__sinf,__cosf,__tanf,__expf,__exp2f,__logf,__log2f,__log10f,__powf,__sqrtf,__rsqrtf,__fdividef,__frcp_rn,__fsqrt_rn - Lane masks:
__lanemask_eq,__lanemask_lt,__lanemask_le,__lanemask_gt,__lanemask_ge - Warp reductions:
__reduce_add_sync,__reduce_and_sync,__reduce_or_sync,__reduce_xor_sync,__reduce_min_sync,__reduce_max_sync - Warp shuffle:
__shfl_sync,__shfl_down_sync,__shfl_up_sync,__shfl_xor_sync(int + float overloads; partial masks map to full-group on Apple Silicon) - Warp vote:
__any_sync,__all_sync,__ballot_sync(mask parameter accepted but Apple Silicon is always full-group) - Double atomics:
atomicAdd(double*, double)via 64-bit CAS loop
cuda_fp16.h expanded:
- Comparison:
__hge,__hle - Math:
__hfma,__hneg,__habs,__hmax,__hmin - Conversions:
__half2int_rn,__half2uint_rn,__half2short_rn,__half2ll_rn,__int2half_rn,__uint2half_rn,__short2half_rn,__ll2half_rn
Driver API additions:
cuMemAllocPitch,cuCtxEnablePeerAccess,cuCtxDisablePeerAccesscuCtxGetStreamPriorityRange(returns 0,0)cuLaunchHostFunc(launches a CPU callback asynchronously on a stream; implemented viacudaStreamAddCallback)
cudaDeviceProp fields now populated per spec §6.8:
unifiedAddressing = 1,managedMemory = 1,concurrentManagedAccess = 1(UMA)maxBufferArguments = 31(Metal buffer argument limit)clockRate,memoryClockRate(1296000 kHz),memoryBusWidth(128-bit)totalConstMem(64 KB),sharedMemPerMultiprocessor,maxThreadsPerMultiProcessor(2048)l2CacheSize(4 MB),canMapHostMemory = 1,integrated = 1,concurrentKernels = 1asyncEngineCount = 0,computeMode = cudaComputeModeDefaultpciBusID,pciDeviceID,pciDomainID(all 0 — no discrete PCI GPU)tccDriver = 0,kernelExecTimeoutEnabled = 0pageableMemoryAccess = 1,pageableMemoryAccessUsesHostPageTables = 1
cudaComputeMode enum added: cudaComputeModeDefault, cudaComputeModeExclusive, cudaComputeModeProhibited, cudaComputeModeExclusiveProcess
cudaDeviceGetAttribute and cuDeviceGetAttribute now support additional attributes:
cudaDevAttrComputeCapabilityMajor/CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR→ 8cudaDevAttrComputeCapabilityMinor/CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR→ 0cudaDevAttrMaxRegistersPerBlock/CU_DEVICE_ATTRIBUTE_MAX_REGISTERS_PER_BLOCK→ 65536cudaDevAttrClockRate/CU_DEVICE_ATTRIBUTE_CLOCK_RATE→ 1296000 kHzcudaDevAttrTextureAlignment→ 512 bytescudaDevAttrGpuOverlap/CU_DEVICE_ATTRIBUTE_GPU_OVERLAP→ 1cudaDevAttrMemoryBusWidth→ 128,cudaDevAttrL2CacheSize→ 4 MBcudaDevAttrMaxThreadsPerMultiProcessor→ 2048,cudaDevAttrMemoryClockRate→ 1296000cudaDevAttrIntegrated→ 1,cudaDevAttrCanMapHostMemory→ 1cudaDevAttrComputeMode→ 0,cudaDevAttrConcurrentKernels→ 1cudaDevAttrPciBusId,cudaDevAttrPciDeviceId,cudaDevAttrPciDomainId→ 0cudaDevAttrTccDriver→ 0,cudaDevAttrKernelExecTimeout→ 0,cudaDevAttrAsyncEngineCount→ 0cudaDevAttrPageableMemoryAccess→ 1,cudaDevAttrPageableMemoryAccessUsesHostPageTables→ 1cudaDevAttrSharedMemPerBlockOptin→ sharedMemPerBlock
cooperative_groups::thread_block_tile<N> extended with:
shfl(val, src_rank),shfl_down(val, delta),shfl_xor(val, mask)any(pred),all(pred),ballot(pred)(via__nvvm_vote_*builtins)cooperative_groups::less<T>binary operator alongside existingplus<T>andgreater<T>
CUDA vector types added to cuda_runtime.h:
- All standard types:
char2/3/4,short2/3/4,int2/3/4,uint2/4,long2/4,longlong2/4,ulong2/4,ulonglong2/4,float2/3,double2/3/4with__align__annotations andmake_*constructors
Device atomics added (CUDA device code path, spec §6.7):
atomicSub,atomicExch(int/uint/float),atomicMin/atomicMax(int/uint),atomicCAS(uint/int/ull),atomicAnd/atomicOr/atomicXor(int/uint)
Device intrinsics added (guarded by #ifndef __CLANG_CUDA_DEVICE_FUNCTIONS_H__):
__syncwarp,__threadfence,__threadfence_block,__threadfence_system__activemask,__popc/__popcll,__clz/__clzll,__brev/__brevll__ffs/__ffsll,__fmaf_rn,__fma_rn
install.sh / uninstall.sh now detect fish shell ($SHELL=*/fish) and write
set -gx syntax to ~/.config/fish/config.fish; CUMETAL_SHELL_RC overrides.
Supported driver API subset:
cuInit,cuDriverGetVersion,cuDeviceGetCount,cuDeviceGet,cuDeviceGetName,cuDeviceTotalMem,cuDeviceGetAttributecuCtxCreate,cuCtxDestroy,cuCtxSetCurrent,cuCtxGetCurrent,cuCtxGetDevice,cuCtxGetFlags,cuCtxSetFlags,cuCtxSynchronizecuStreamCreate,cuStreamDestroy,cuStreamSynchronize,cuStreamQuery,cuStreamAddCallback,cuStreamWaitEventcuEventCreate,cuEventDestroy,cuEventRecord,cuEventQuery,cuEventSynchronize,cuEventElapsedTimecuModuleLoad,cuModuleLoadData,cuModuleLoadDataEx,cuModuleUnload,cuModuleGetFunctioncuModuleLoadDataaccepts metallib bytes/paths and PTX text images (including basic CUDA fatbin wrapper PTX variants)cuModuleLoadDataExaccepts option arrays in compatibility mode (options are currently ignored)cuLaunchKernel(kernel params path andextrapacked-argument path)cuMemAlloc,cuMemAllocManaged,cuMemFreecuMemGetInfocuMemAllocHost,cuMemHostAlloc,cuMemHostGetDevicePointer,cuMemHostGetFlags,cuMemFreeHostcuMemcpyHtoD,cuMemcpyDtoH,cuMemcpyDtoDcuMemcpyHtoDAsync,cuMemcpyDtoHAsync,cuMemcpyDtoDAsynccuMemsetD8,cuMemsetD8AsynccuGetErrorName,cuGetErrorStringcuProfilerStart,cuProfilerStopcuOccupancyMaxActiveBlocksPerMultiprocessor,cuOccupancyMaxPotentialBlockSizecuFuncGetAttribute,cuFuncSetCacheConfigcuStreamCreateWithPriority(priority ignored; creates regular stream)cuLaunchCooperativeKernel(forwards tocuLaunchKernel; threadgroup CG works)cuMemsetD16,cuMemsetD32,cuMemsetD16Async,cuMemsetD32AsynccuDeviceComputeCapability(returns 8.0 — synthetic Ampere-equivalent)cuDeviceCanAccessPeer(returns 0; single GPU on Apple Silicon)
Public headers now installed: cuda.h, cuda_runtime.h, cuda_fp16.h, cuda_bf16.h,
cublas_v2.h, cublas_api.h, cublasLt.h, cufft.h, curand.h, cusparse.h,
cusolver_common.h, cusolverDn.h, cudnn.h, nvml.h, nccl.h,
cooperative_groups.h, cooperative_groups/reduce.h, cuComplex.h,
nvToolsExt.h, nvtx3/nvToolsExt.h.
Forwarding headers (route to existing implementations):
device_launch_parameters.h, driver_types.h, library_types.h,
channel_descriptor.h, device_atomic_functions.h, math_functions.h,
cuda_profiler_api.h, cuda_occupancy.h, cuda_runtime_api.h,
sm_20_intrinsics.h, sm_30_intrinsics.h, sm_60_intrinsics.h,
sm_70_intrinsics.h, sm_80_intrinsics.h.
Header-only library shims:
- thrust (
thrust/):device_vector,host_vector,device_ptr,sort,sort_by_key,stable_sort,reduce,transform_reduce,inclusive_scan,exclusive_scan,transform,fill,copy,for_each,unique,sequence,counting_iterator,zip_iterator,transform_iterator,constant_iterator,discard_iterator,permutation_iterator,execution_policy,functional,pair. CPU-backed on UMA (device memory is host-accessible). - CUB (
cub/):BlockReduce,BlockScan,BlockExchange,BlockLoad,BlockStore,WarpReduce,WarpScan,DeviceReduce(Sum/Min/Max/ArgMin/ArgMax),DeviceScan(Inclusive/Exclusive),DeviceRadixSort(SortKeys/SortPairs),DeviceSelect(If/Flagged/Unique),DeviceHistogram(Even/Range),DeviceRunLengthEncode(Encode/NonTrivialRuns). Sequential fallback for host-side compilation; device ops run on UMA. - NVTX (
nvtx3/nvToolsExt.h): No-op stubs for profiling annotations. Range push/pop, mark, domain API, naming API all silently ignored.cuda_fp16.hprovides host-side__half(IEEE 754 float16 via bit manipulation) and device-side__half = _Float16;atomicAdd(__half*, __half)via CAS loop (spec §8).
Supported library shim subset:
- cuRAND (
curand.h)curandCreateGenerator,curandDestroyGeneratorcurandGetVersioncurandSetStream,curandGetStreamcurandSetPseudoRandomGeneratorSeed,curandSetGeneratorOffsetcurandGenerate(uint32 output),curandGenerateLongLong(uint64 output)curandGenerateUniform,curandGenerateUniformDoublecurandGenerateNormal,curandGenerateNormalDoublecurandGenerateLogNormal,curandGenerateLogNormalDouble
- cuFFT (
cufft.h)cufftCreate,cufftDestroy,cufftSetStream,cufftGetSize,cufftGetVersioncufftPlan1d,cufftPlan2d,cufftPlan3d,cufftPlanManycufftMakePlan1d,cufftMakePlan2d,cufftMakePlan3d,cufftMakePlanManycufftExecC2C,cufftExecR2C,cufftExecC2R(single-precision)cufftExecZ2Z,cufftExecD2Z,cufftExecZ2D(double-precision)- Backed by Apple Accelerate
vDSP_DFT_Execute(arbitrary N, any batch size) libcufft.dylibsymlink alias tolibcumetal.dylib
- cuBLAS v2 (
cublas_v2.h)cublasCreate,cublasDestroy,cublasGetVersioncublasSetStream,cublasGetStreamcublasSetMathMode,cublasGetMathModecublasSaxpy,cublasSscal,cublasScopy,cublasSgemmcublasSgemmStridedBatched,cublasDgemmStridedBatchedcublasSswap,cublasDswapcublasSdot,cublasDdotcublasSasum,cublasDasumcublasSnrm2,cublasDnrm2cublasIsamax,cublasIdamaxcublasIsamin,cublasIdamincublasSgemv,cublasDgemvcublasSger,cublasDgercublasSsymv,cublasDsymvcublasDaxpy,cublasDscal,cublasDcopy,cublasDgemm
Library alias compatibility:
- Build/install also provides
libcublas.dylibandlibcurand.dylibaliases tolibcumetal.dylib, so software linked against CUDA library names can resolve shim symbols. - Optional binary-shim alias: when
CUMETAL_ENABLE_BINARY_SHIM=ON, build/install also provideslibcuda.dylib -> libcumetal.dylib.
Known limitations (intentional per spec §2.2 and §8):
- Default kernel launch uses a CuMetal descriptor (
cumetalKernel_t). - Binary-shim registration: CuMetal
CMTLenvelopes, direct PTX images, and basic CUDA fatbin PTX images are supported; full NVCC fatbinary variants are not yet implemented. - CUDA Graphs (
cudaGraphCreate,cuGraphCreate): deferred to v2 per spec §2.2. - Dynamic parallelism: compile-time error per spec §2.2.
- Texture/surface objects: deferred to v2 per spec §2.2 and §8.
- Multi-GPU peer access: single GPU only on Apple Silicon; peer APIs return appropriate errors.
- CUDA graphics interop (OpenGL/Vulkan): non-goal per spec §2.2.
cooperative_groups::grid_group::sync(): no-op stub; Metal has no cross-threadgroup barrier.- Warp partial-mask operations: conservative full-group emulation (spec §5.3).
- FP64: Apple Silicon GPU has minimal FP64 throughput;
--fp64=emulaterecommended (spec §8.1). - Device printf: buffer-based; format strings limited to 256 bytes (spec §5.3).