Skip to content

fix(actor_group): support cu13 TMS preload and enable NCCL CUMEM by d…#307

Open
aoshen02 wants to merge 2 commits into
mainfrom
fix/cu13-tms-nccl-cumem
Open

fix(actor_group): support cu13 TMS preload and enable NCCL CUMEM by d…#307
aoshen02 wants to merge 2 commits into
mainfrom
fix/cu13-tms-nccl-cumem

Conversation

@aoshen02

@aoshen02 aoshen02 commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

…efault

  • Add torch_memory_saver_hook_mode_preload_cu13.abi3.so to the TMS dynamic library search list. Without this, cu13 (CUDA 13) containers fail to find the preload hook and TMS memory management is disabled.

  • Change NCCL_CUMEM_ENABLE default from "0" to "1". GB300 (sm103) requires CUMEM for NVLink/NVLS transports; disabling it causes NCCL init failures on Blackwell GPUs.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates vime/ray/actor_group.py by changing the default value of the NCCL_CUMEM_ENABLE environment variable from '0' to '1'. Additionally, it adds support for CUDA 13 by including 'torch_memory_saver_hook_mode_preload_cu13.abi3.so' in the list of preloaded shared library paths for torch_memory_saver. As there are no review comments, I have no feedback to provide.

@read-the-docs-community

read-the-docs-community Bot commented Jul 1, 2026

Copy link
Copy Markdown

CalvinXKY
CalvinXKY previously approved these changes Jul 1, 2026
aoshen02 and others added 2 commits July 2, 2026 07:43
…efault

- Add torch_memory_saver_hook_mode_preload_cu13.abi3.so to the TMS
  dynamic library search list. Without this, cu13 (CUDA 13) containers
  fail to find the preload hook and TMS memory management is disabled.

- Change NCCL_CUMEM_ENABLE default from "0" to "1". GB300 (sm103)
  requires CUMEM for NVLink/NVLS transports; disabling it causes NCCL
  init failures on Blackwell GPUs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…onfig

Docker:
- Dockerfile: add cu13 (CUDA 13) build target alongside cu12, with
  sm_100f family-compatible kernels for GB300 (sm103)
- justfile: add cu13 build/push targets
- Remove obsolete vllm.patch (fixes merged upstream)

Scripts:
- run-glm5.2-744B-A40B.sh: update parallel config for 64 GPU GB300
  (PP=4, TP=8, CP=2, EP=16), DSA layer split (first=18, mid=20,
  last=20), workload sizing (rollout-batch-size=8, n-samples=8,
  global-batch-size=64, max-tokens-per-gpu=65536 for 128K support),
  unique log filenames with run ID

Docs:
- Update GLM-5.2 744B example for GB300 configuration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aoshen02 aoshen02 force-pushed the fix/cu13-tms-nccl-cumem branch from 95f0602 to 339f599 Compare July 2, 2026 07:44
aoshen02 added a commit to aoshen02/vime that referenced this pull request Jul 3, 2026
Port the cu13 build support from vllm-project#307: ENABLE_CUDA_13 branches the apt
dev headers, cublas header, TransformerEngine (source-built for cu13),
TMS_CUDA_MAJOR auto-detect, and the cudnn pin. justfile gains a
build-cu13 target and a VARIANT-prefixed manifest. actor_group preloads
the cu13 TMS .so.

Also switch the vLLM patch apply to --allow-empty so the build survives
once the patch is emptied upstream.

Excludes vllm-project#307's NCCL_CUMEM_ENABLE default flip (0->1) and the glm5.2
scripts by request.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: aoshen02 <aoshen@inferact.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants