Fused all-gather matmul on multi GPUs with PyTorch Symmetric Memory by kwen2501 · Pull Request #62 · NVIDIA/cutile-python

kwen2501 · 2026-01-20T05:54:44Z

Description

Each rank gathers inputs from (all) peer GPUs, and perform a matrix multiplication with its local weight.

Peer inputs are made visible via PyTorch Symmetric Memory, i.e.

import torch.distributed._symmetric_memor as symm_mem
symm_mem.empty(...)

The fused kernel is equivalent to:

dist.all_gather_into_tensor(ag_out, inp, group)
out = ag_out @ w

The fusion overlaps communication and computation in fine grain.

Argparser Add test Signed-off-by: Ke Wen <kwen@nvidia.com>

haijieg · 2026-02-13T06:32:24Z

integrated.

kwen2501 force-pushed the ag_matmul branch from c188781 to ec26b9d Compare January 21, 2026 03:27

Fused all-gather matmul

8edd2ea

Argparser Add test Signed-off-by: Ke Wen <kwen@nvidia.com>

kwen2501 force-pushed the ag_matmul branch from ec26b9d to 8edd2ea Compare January 21, 2026 22:13

haijieg closed this Feb 13, 2026