[tx] Cuda tile for expert parallelism #880

agolajko · 2026-01-15T00:19:22Z

Draft PR re #862

Replaces the Jax ragged_dot with a cuda tile implementation
Inspired by https://github.com/NVIDIA/cutile-python/blob/main/samples/MoE.py

Benchmarking cuda-tile and existing ragged_dot implementation

On RTX Pro 6000 via Runpod


================================================================================
                      CUTILE vs RAGGED_DOT BENCHMARK SUITE                      
================================================================================

Small (original)
  Config: 1024 tokens × 512 hidden → 512 out, 16 experts
  ----------------------------------------------------------------------------

Benchmark Results:
  ragged_dot: 1.739 ms
  cutile:     1.721 ms
  Speedup:    1.01x

Medium (Qwen-0.6B scale)
  Config: 2048 tokens × 1024 hidden → 1024 out, 16 experts
  ----------------------------------------------------------------------------

Benchmark Results:
  ragged_dot: 3.222 ms
  cutile:     3.506 ms
  Speedup:    0.92x

Large (Qwen2.5-1.5B scale)
  Config: 4096 tokens × 1536 hidden → 1536 out, 32 experts
  ----------------------------------------------------------------------------

Benchmark Results:
  ragged_dot: 9.152 ms
  cutile:     9.142 ms
  Speedup:    1.00x

Large+ (2B scale)
  Config: 4096 tokens × 2048 hidden → 2048 out, 32 experts
  ----------------------------------------------------------------------------

Benchmark Results:
  ragged_dot: 14.459 ms
  cutile:     14.464 ms
  Speedup:    1.00x

XLarge (Llama 3 8B scale)
  Config: 8192 tokens × 4096 hidden → 4096 out, 64 experts
  ----------------------------------------------------------------------------

Benchmark Results:
  ragged_dot: 94.049 ms
  cutile:     92.585 ms
  Speedup:    1.02x

Results of time_cutile_parts.py giving breakdown of time spent on different tasks

TX_USE_CUTILE_LORA=1 uv run tests/cutile/time_cutile_parts.py
Config: m=2048, d=1024, out=1024, E=16, dtype=torch.float16
TILE_M/N/K = 128/128/64
rhs contiguous=True stride=(1048576, 1024, 1)

=== CUDA-event timing breakdown ===
pad_groups:    0.199 ms
cutile_launch: 0.068 ms
combined:      0.272 ms
pad fraction:  73.3%
launch frac:   25.0%
(pad+launch):  0.267 ms (rough expected)

Todo:

Multi GPU support
backward pass
more tests
profile

agolajko added 17 commits January 14, 2026 14:40

forward pass and tests

248a36d

updated pyproject

29b2408

updated jax to torch

0a95ea7

dlpack update

e7e4a70

cutile config update

df7cb3c

group size

7b0cbcc

added axis

0728190

indexing tile

030811a

dtype

29a9b17

load index replaced

4ed3c28

fixed broadcast

2646fb4

fix testing

b7ed4eb

fix more tests

b605e1b

moved tests

0c6bc10

updated tx tests readme

30d9e94

removed test

d09ae4c

updated tests readme

eb2c716

pcmoritz added the tx label Jan 15, 2026

agolajko added 10 commits January 15, 2026 10:49

updated benchmark test

948fd53

updated benchmark dimensions

13adecc

updated cutile ragged_dot

b897925

more cutile opts

540e03a

more cutile opts fix 1

0865377

more cutile opts fix 2

1694bf6

padding timing

da97f6e

padding timing 2

774f575

padding timing 3

57de4ae

padding opt

c8becce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tx] Cuda tile for expert parallelism #880

[tx] Cuda tile for expert parallelism #880

Uh oh!

agolajko commented Jan 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[tx] Cuda tile for expert parallelism #880

Are you sure you want to change the base?

[tx] Cuda tile for expert parallelism #880

Uh oh!

Conversation

agolajko commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarking cuda-tile and existing ragged_dot implementation

Results of time_cutile_parts.py giving breakdown of time spent on different tasks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

agolajko commented Jan 15, 2026 •

edited

Loading