chore(nccl): migrate internal/nccl to purego/dlopen#80
Merged
Conversation
Implements internal/nccl as a zero-CGo runtime dlopen of libnccl.so.2, mirroring the pattern in internal/cublas/cublas_purego.go. The package now compiles on every platform without -tags cuda; non-linux GOOS returns a clean "not supported" error rather than panicking. ABI constants for ncclResult, ncclDataType, ncclRedOp, and NCCL_UNIQUE_ID_BYTES are hardcoded against the stable NCCL 2.x ABI. ncclCommInitRank takes the 128-byte ncclUniqueId by value. Per AAPCS64 rule B.4 (composites > 16 bytes are passed by hidden pointer), passing &uid.id[0] as a uintptr is the correct calling convention on linux/arm64, which is the only NCCL platform ztensor targets today. CI's go vet exclude list adds /internal/nccl$ alongside the other GPU runtime bindings that rely on unsafe.Pointer(uintptr(...)) trampolines. Refs #78
Renames the legacy nccl.go to nccl_cgo.go and tightens its build tag to //go:build cuda && cgo && nccl_cgo so the CGo implementation is OFF by default. The new purego/dlopen binding in nccl_purego.go is the default and only path on every supported platform; the CGo file is retained as a debugging fallback.
Drops the //go:build cuda guard from nccl_test.go so the package's tests compile on every platform. Tests that require libnccl.so.2 call a requireNccl helper that t.Skips when Available() returns false. Adds two new tests that exercise the pure-Go marshaling and ABI-constant paths without touching the runtime library.
Documents the rationale for replacing the CGo nccl binding with a runtime dlopen of libnccl.so.2, the AArch64 hidden-pointer ABI trick that lets us pass ncclUniqueId by value through the shared cuda.Ccall trampoline, and the consequences for build/CI/test posture.
dndungu
added a commit
that referenced
this pull request
Apr 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #78.
Replaces the CGo binding in
internal/ncclwith a runtime dlopen oflibnccl.so.2, mirroring the pattern ininternal/cublas/cublas_purego.go. The package now compiles on every platform without-tags cuda; non-linux GOOS returns a clean "not supported" error rather than panicking.Summary
internal/nccl/nccl_purego.goresolvesncclGetUniqueId,ncclCommInitRank,ncclCommDestroy,ncclCommGetAsyncError,ncclAllReduce,ncclBroadcast,ncclGroupStart,ncclGroupEnd, andncclGetErrorStringviacuda.DlopenPath/cuda.Dlsym/cuda.Ccall. ABI constants for ncclResult, ncclDataType, ncclRedOp, andNCCL_UNIQUE_ID_BYTES = 128are hardcoded against the stable NCCL 2.x ABI.internal/nccl/nccl_cgo.goand gated behind//go:build cuda && cgo && nccl_cgo(OFF by default; opt-in fallback only).nccl_test.gono longer requires-tags cuda. ArequireNccl(t)helper skips whenlibnccl.so.2is not dlopen-able. Two new tests (TestConstants,TestUniqueIDFromBytesRoundTripNoLib) exercise the pure-Go marshaling and ABI constants on every platform.ncclUniqueIdby value through the sharedcuda.Ccalltrampoline./internal/nccl$alongside the other GPU runtime binding packages.AArch64 ABI note
ncclCommInitRanktakesncclUniqueIdby value (128 bytes). Per AAPCS64 rule B.4, composites larger than 16 bytes are passed by hidden caller-allocated pointer, so passinguintptr(unsafe.Pointer(&uid.id[0]))is the correct calling convention on linux/arm64 (the only NCCL platform ztensor targets today). If we ever need linux/amd64 NCCL, the SysV ABI passes large aggregates on the stack and we will need either an assembly trampoline or thenccl_cgofallback.Verification on DGX (linux/arm64, real libnccl.so.2)
(Two-GPU tests skip on the single-GPU DGX Spark host as expected.)
Test plan
go build ./...on darwin/arm64 (no tags) — cleango test ./internal/nccl/...on darwin/arm64 — pure tests pass, NCCL tests skipgo build ./...on linux/arm64 DGX — cleango test ./internal/nccl/...on linux/arm64 DGX — all NCCL paths exercised against real libnccl.so.2Follow-up
internal/ncclcopy inside thezerfoorepository is not touched by this PR. It should be migrated separately.