Port necessary FBGEMM MoE kernels by Clydingus · Pull Request #30 · Overworldai/world_engine

Clydingus · 2026-03-10T19:17:04Z

Fixes #24

Yoinks the two kernels index_shuffling and scatter_add_dense_tokens we need from FBGEMM as triton kernels that should work on windows even if fbgemm isn't installed on windows devices, which will bring Biome WP1.5 MoE models up to par.

There are some tests that come with the PR and a uv.lock file, that should be removed before the final merge.

Benchmarks

Two benchmarks are run, one is for the MoE module itself, and the other is a random weights full MoE model rollout for 256 frames. There's no comparison with the fbgemm MoE forward pass, as that one has a breaking bug. See following legend for the moe implementation matrix benchmark:

self_max: diff between itself when run twice on same inputs, nondeterminism check
eager_max: max absolute difference between that implementation’s eager output and the eager baseline output
eager_mean: mean absolute difference between that implementation’s eager output and the eager baseline output
comp_base_max / comp_base_mean: as above, but the diff of the compiled version of the implementation
comp_eager_max / comp_eager_min: self diff between eager and compiled versions of the implementation

4090 benchmarks

moe implementation matrix
impl          eager_ms   compiled_ms     speedup    self_max    self_mean   vs_base_max   vs_base_mean  comp_vs_base_max  comp_vs_base_mean  comp_vs_eager_max  comp_vs_eager_mean              status
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
fbgemm          1.8661        0.0430     43.358x    0.000000     0.000000     16.000000       0.886417       2608.000000         289.523926        2608.000000          289.512695                  ok
custom_bf16      1.7449        1.6280      1.072x    0.000000     0.000000     16.000000       0.813475         16.000000           0.813475           0.000000            0.000000                  ok
custom_fp32      2.7033        2.5997      1.040x    0.250000     0.000000     16.000000       0.146415         16.000000           0.483098          16.000000            0.458849                  ok
baseline        5.0534           n/a         n/a    0.000000     0.000000      0.000000       0.000000               n/a                n/a                n/a                 n/aexpected:_slow_grouped_mm_not_compileable

256 frame rollout in ~8 seconds, about 32 fps on a 4090.

---------------------------------------------------------------------------------------------------------------------------------------------- benchmark: 1 tests ----------------------------------------------------------------------------------------------------------------------------------------------
Name (time in s)                                                                                                                                                                                                                               Median     Max    Mean  StdDev  max_vram_alloc  max_vram_reserved
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_ar_rollout[taehv_ae=True,ae_uri=Overworld-Models/taehv1_5,gated_linear=False,n_kv_heads=16,n_heads=32,moe=True,moe_n_experts=16,moe_top_k=8,shared_frame_experts=False,n_layers=8-256-True] | params=703,359,112 | active=434,923,656     7.9852  7.9929  7.9816  0.0106            2.04               2.62
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

5090 benchmarks (Linux, @philpax)

moe implementation matrix
impl          eager_ms   compiled_ms     speedup    self_max    self_mean   vs_base_max   vs_base_mean  comp_vs_base_max  comp_vs_base_mean  comp_vs_eager_max  comp_vs_eager_mean              status
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
fbgemm          1.3057        0.0377     34.638x    0.000061     0.000000     12.785400       0.900568       2703.395508         287.864868        2705.968750          287.855652                  ok
custom_bf16      1.0459        0.9433      1.109x    0.000061     0.000000     12.347900       0.810782         12.347900           0.810782           0.000061            0.000000                  ok
custom_fp32      1.6152        1.6044      1.007x    0.000244     0.000001      1.421631       0.145962          8.163818           0.483711           7.636108            0.459396                  ok
baseline        3.4143           n/a         n/a    0.000000     0.000000      0.000000       0.000000               n/a                n/a                n/a                 n/aexpected:_slow_grouped_mm_not_compileable

256 frame rollout in ~7 seconds, about 36 fps on a 5090.

---------------------------------------------------------------------------------------------------------------------------------------------- benchmark: 1 tests ----------------------------------------------------------------------------------------------------------------------------------------------
Name (time in s)                                                                                                                                                                                                                               Median     Max    Mean  StdDev  max_vram_alloc  max_vram_reserved
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_ar_rollout[taehv_ae=True,ae_uri=Overworld-Models/taehv1_5,gated_linear=False,n_kv_heads=16,n_heads=32,moe=True,moe_n_experts=16,moe_top_k=8,shared_frame_experts=False,n_layers=8-256-True] | params=703,359,112 | active=434,923,656     7.1618  7.3075  7.0543  0.2926            2.04               2.64
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Clydingus added 9 commits March 11, 2026 02:32

bingus kernels

5baf6d8

feat: grouped_gemm kernels

e2dcf2e

add: tests

aea71a2

fix: ported kernels compile efficiency

88e9cf4

tests: temp push for benchmarking

3e43b7c

kernels go brrr

a5c89ad

fix: removes HAS_FBGEMM export

fa3b1d5

fix: output dtype cast

2fc5685

rm: has_fbgemm in moe.py

a76c2d9

Clydingus marked this pull request as ready for review March 16, 2026 14:51

Clydingus requested a review from lapp0 March 16, 2026 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port necessary FBGEMM MoE kernels#30

Port necessary FBGEMM MoE kernels#30
Clydingus wants to merge 9 commits intowp-1.5from
yoink-fbgemm-kernels

Clydingus commented Mar 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Clydingus commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Clydingus commented Mar 10, 2026 •

edited

Loading