fix(actor_group): support cu13 TMS preload and enable NCCL CUMEM by d… by aoshen02 · Pull Request #307 · vllm-project/vime

aoshen02 · 2026-07-01T11:08:32Z

…efault

Add torch_memory_saver_hook_mode_preload_cu13.abi3.so to the TMS dynamic library search list. Without this, cu13 (CUDA 13) containers fail to find the preload hook and TMS memory management is disabled.
Change NCCL_CUMEM_ENABLE default from "0" to "1". GB300 (sm103) requires CUMEM for NVLink/NVLS transports; disabling it causes NCCL init failures on Blackwell GPUs.

gemini-code-assist

Code Review

This pull request updates vime/ray/actor_group.py by changing the default value of the NCCL_CUMEM_ENABLE environment variable from '0' to '1'. Additionally, it adds support for CUDA 13 by including 'torch_memory_saver_hook_mode_preload_cu13.abi3.so' in the list of preloaded shared library paths for torch_memory_saver. As there are no review comments, I have no feedback to provide.

read-the-docs-community · 2026-07-01T11:09:50Z

Documentation build overview

📚 vime | 🛠️ Build #33406240 | 📁 Comparing 339f599 against latest (d679c75)

🔍 Preview build

41 files changed · ± 41 modified

± Modified

…efault - Add torch_memory_saver_hook_mode_preload_cu13.abi3.so to the TMS dynamic library search list. Without this, cu13 (CUDA 13) containers fail to find the preload hook and TMS memory management is disabled. - Change NCCL_CUMEM_ENABLE default from "0" to "1". GB300 (sm103) requires CUMEM for NVLink/NVLS transports; disabling it causes NCCL init failures on Blackwell GPUs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…onfig Docker: - Dockerfile: add cu13 (CUDA 13) build target alongside cu12, with sm_100f family-compatible kernels for GB300 (sm103) - justfile: add cu13 build/push targets - Remove obsolete vllm.patch (fixes merged upstream) Scripts: - run-glm5.2-744B-A40B.sh: update parallel config for 64 GPU GB300 (PP=4, TP=8, CP=2, EP=16), DSA layer split (first=18, mid=20, last=20), workload sizing (rollout-batch-size=8, n-samples=8, global-batch-size=64, max-tokens-per-gpu=65536 for 128K support), unique log filenames with run ID Docs: - Update GLM-5.2 744B example for GB300 configuration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Port the cu13 build support from vllm-project#307: ENABLE_CUDA_13 branches the apt dev headers, cublas header, TransformerEngine (source-built for cu13), TMS_CUDA_MAJOR auto-detect, and the cudnn pin. justfile gains a build-cu13 target and a VARIANT-prefixed manifest. actor_group preloads the cu13 TMS .so. Also switch the vLLM patch apply to --allow-empty so the build survives once the patch is emptied upstream. Excludes vllm-project#307's NCCL_CUMEM_ENABLE default flip (0->1) and the glm5.2 scripts by request. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: aoshen02 <aoshen@inferact.ai>

gemini-code-assist Bot reviewed Jul 1, 2026

View reviewed changes

CalvinXKY previously approved these changes Jul 1, 2026

View reviewed changes

aoshen02 dismissed CalvinXKY’s stale review via 95f0602 July 1, 2026 11:48

aoshen02 and others added 2 commits July 2, 2026 07:43

aoshen02 force-pushed the fix/cu13-tms-nccl-cumem branch from 95f0602 to 339f599 Compare July 2, 2026 07:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(actor_group): support cu13 TMS preload and enable NCCL CUMEM by d…#307

fix(actor_group): support cu13 TMS preload and enable NCCL CUMEM by d…#307
aoshen02 wants to merge 2 commits into
mainfrom
fix/cu13-tms-nccl-cumem

aoshen02 commented Jul 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

read-the-docs-community Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

aoshen02 commented Jul 1, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

read-the-docs-community Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

read-the-docs-community Bot commented Jul 1, 2026 •

edited

Loading