Skip to content

chore(nccl): migrate internal/nccl to purego/dlopen#80

Merged
dndungu merged 4 commits intomainfrom
chore/nccl-purego
Apr 9, 2026
Merged

chore(nccl): migrate internal/nccl to purego/dlopen#80
dndungu merged 4 commits intomainfrom
chore/nccl-purego

Conversation

@dndungu
Copy link
Copy Markdown
Contributor

@dndungu dndungu commented Apr 9, 2026

Closes #78.

Replaces the CGo binding in internal/nccl with a runtime dlopen of libnccl.so.2, mirroring the pattern in internal/cublas/cublas_purego.go. The package now compiles on every platform without -tags cuda; non-linux GOOS returns a clean "not supported" error rather than panicking.

Summary

  • New internal/nccl/nccl_purego.go resolves ncclGetUniqueId, ncclCommInitRank, ncclCommDestroy, ncclCommGetAsyncError, ncclAllReduce, ncclBroadcast, ncclGroupStart, ncclGroupEnd, and ncclGetErrorString via cuda.DlopenPath/cuda.Dlsym/cuda.Ccall. ABI constants for ncclResult, ncclDataType, ncclRedOp, and NCCL_UNIQUE_ID_BYTES = 128 are hardcoded against the stable NCCL 2.x ABI.
  • Legacy CGo file moved to internal/nccl/nccl_cgo.go and gated behind //go:build cuda && cgo && nccl_cgo (OFF by default; opt-in fallback only).
  • nccl_test.go no longer requires -tags cuda. A requireNccl(t) helper skips when libnccl.so.2 is not dlopen-able. Two new tests (TestConstants, TestUniqueIDFromBytesRoundTripNoLib) exercise the pure-Go marshaling and ABI constants on every platform.
  • ADR-002 documents the rationale and the AAPCS64 hidden-pointer trick that lets us pass the 128-byte ncclUniqueId by value through the shared cuda.Ccall trampoline.
  • CI vet exclude list adds /internal/nccl$ alongside the other GPU runtime binding packages.

AArch64 ABI note

ncclCommInitRank takes ncclUniqueId by value (128 bytes). Per AAPCS64 rule B.4, composites larger than 16 bytes are passed by hidden caller-allocated pointer, so passing uintptr(unsafe.Pointer(&uid.id[0])) is the correct calling convention on linux/arm64 (the only NCCL platform ztensor targets today). If we ever need linux/amd64 NCCL, the SysV ABI passes large aggregates on the stack and we will need either an assembly trampoline or the nccl_cgo fallback.

Verification on DGX (linux/arm64, real libnccl.so.2)

=== RUN   TestConstants
--- PASS: TestConstants (0.00s)
=== RUN   TestUniqueIDFromBytesRoundTripNoLib
--- PASS: TestUniqueIDFromBytesRoundTripNoLib (0.00s)
=== RUN   TestGetUniqueID
--- PASS: TestGetUniqueID (0.01s)
=== RUN   TestUniqueIDRoundTrip
--- PASS: TestUniqueIDRoundTrip (0.00s)
=== RUN   TestUniqueIDFromBytesInvalidLength
--- PASS: TestUniqueIDFromBytesInvalidLength (0.00s)
=== RUN   TestSingleRankInitDestroy
--- PASS: TestSingleRankInitDestroy (1.15s)
=== RUN   TestSingleRankAllReduce
--- PASS: TestSingleRankAllReduce (0.52s)
=== RUN   TestGroupStartEnd
--- PASS: TestGroupStartEnd (0.00s)
PASS
ok  	github.com/zerfoo/ztensor/internal/nccl	1.816s

(Two-GPU tests skip on the single-GPU DGX Spark host as expected.)

Test plan

  • go build ./... on darwin/arm64 (no tags) — clean
  • go test ./internal/nccl/... on darwin/arm64 — pure tests pass, NCCL tests skip
  • go build ./... on linux/arm64 DGX — clean
  • go test ./internal/nccl/... on linux/arm64 DGX — all NCCL paths exercised against real libnccl.so.2
  • CI green

Follow-up

  • The duplicate internal/nccl copy inside the zerfoo repository is not touched by this PR. It should be migrated separately.

dndungu added 4 commits April 9, 2026 10:35
Implements internal/nccl as a zero-CGo runtime dlopen of libnccl.so.2,
mirroring the pattern in internal/cublas/cublas_purego.go. The package now
compiles on every platform without -tags cuda; non-linux GOOS returns a
clean "not supported" error rather than panicking. ABI constants for
ncclResult, ncclDataType, ncclRedOp, and NCCL_UNIQUE_ID_BYTES are
hardcoded against the stable NCCL 2.x ABI.

ncclCommInitRank takes the 128-byte ncclUniqueId by value. Per AAPCS64
rule B.4 (composites > 16 bytes are passed by hidden pointer), passing
&uid.id[0] as a uintptr is the correct calling convention on linux/arm64,
which is the only NCCL platform ztensor targets today.

CI's go vet exclude list adds /internal/nccl$ alongside the other GPU
runtime bindings that rely on unsafe.Pointer(uintptr(...)) trampolines.

Refs #78
Renames the legacy nccl.go to nccl_cgo.go and tightens its build tag to
//go:build cuda && cgo && nccl_cgo so the CGo implementation is OFF by
default. The new purego/dlopen binding in nccl_purego.go is the default
and only path on every supported platform; the CGo file is retained as a
debugging fallback.
Drops the //go:build cuda guard from nccl_test.go so the package's tests
compile on every platform. Tests that require libnccl.so.2 call a
requireNccl helper that t.Skips when Available() returns false. Adds two
new tests that exercise the pure-Go marshaling and ABI-constant paths
without touching the runtime library.
Documents the rationale for replacing the CGo nccl binding with a
runtime dlopen of libnccl.so.2, the AArch64 hidden-pointer ABI trick
that lets us pass ncclUniqueId by value through the shared cuda.Ccall
trampoline, and the consequences for build/CI/test posture.
@dndungu dndungu merged commit af8af73 into main Apr 9, 2026
1 check passed
@dndungu dndungu deleted the chore/nccl-purego branch April 9, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate internal/nccl from CGo to purego/dlopen

1 participant