Skip to content

Port necessary FBGEMM MoE kernels#30

Open
Clydingus wants to merge 9 commits intowp-1.5from
yoink-fbgemm-kernels
Open

Port necessary FBGEMM MoE kernels#30
Clydingus wants to merge 9 commits intowp-1.5from
yoink-fbgemm-kernels

Conversation

@Clydingus
Copy link

@Clydingus Clydingus commented Mar 10, 2026

Fixes #24

Yoinks the two kernels index_shuffling and scatter_add_dense_tokens we need from FBGEMM as triton kernels that should work on windows even if fbgemm isn't installed on windows devices, which will bring Biome WP1.5 MoE models up to par.

There are some tests that come with the PR and a uv.lock file, that should be removed before the final merge.

Benchmarks

Two benchmarks are run, one is for the MoE module itself, and the other is a random weights full MoE model rollout for 256 frames. There's no comparison with the fbgemm MoE forward pass, as that one has a breaking bug. See following legend for the moe implementation matrix benchmark:

self_max: diff between itself when run twice on same inputs, nondeterminism check
eager_max: max absolute difference between that implementation’s eager output and the eager baseline output
eager_mean: mean absolute difference between that implementation’s eager output and the eager baseline output
comp_base_max / comp_base_mean: as above, but the diff of the compiled version of the implementation
comp_eager_max / comp_eager_min: self diff between eager and compiled versions of the implementation
4090 benchmarks
moe implementation matrix
impl          eager_ms   compiled_ms     speedup    self_max    self_mean   vs_base_max   vs_base_mean  comp_vs_base_max  comp_vs_base_mean  comp_vs_eager_max  comp_vs_eager_mean              status
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
fbgemm          1.8661        0.0430     43.358x    0.000000     0.000000     16.000000       0.886417       2608.000000         289.523926        2608.000000          289.512695                  ok
custom_bf16      1.7449        1.6280      1.072x    0.000000     0.000000     16.000000       0.813475         16.000000           0.813475           0.000000            0.000000                  ok
custom_fp32      2.7033        2.5997      1.040x    0.250000     0.000000     16.000000       0.146415         16.000000           0.483098          16.000000            0.458849                  ok
baseline        5.0534           n/a         n/a    0.000000     0.000000      0.000000       0.000000               n/a                n/a                n/a                 n/aexpected:_slow_grouped_mm_not_compileable

256 frame rollout in ~8 seconds, about 32 fps on a 4090.

---------------------------------------------------------------------------------------------------------------------------------------------- benchmark: 1 tests ----------------------------------------------------------------------------------------------------------------------------------------------
Name (time in s)                                                                                                                                                                                                                               Median     Max    Mean  StdDev  max_vram_alloc  max_vram_reserved
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_ar_rollout[taehv_ae=True,ae_uri=Overworld-Models/taehv1_5,gated_linear=False,n_kv_heads=16,n_heads=32,moe=True,moe_n_experts=16,moe_top_k=8,shared_frame_experts=False,n_layers=8-256-True] | params=703,359,112 | active=434,923,656     7.9852  7.9929  7.9816  0.0106            2.04               2.62
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
5090 benchmarks (Linux, @philpax)
moe implementation matrix
impl          eager_ms   compiled_ms     speedup    self_max    self_mean   vs_base_max   vs_base_mean  comp_vs_base_max  comp_vs_base_mean  comp_vs_eager_max  comp_vs_eager_mean              status
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
fbgemm          1.3057        0.0377     34.638x    0.000061     0.000000     12.785400       0.900568       2703.395508         287.864868        2705.968750          287.855652                  ok
custom_bf16      1.0459        0.9433      1.109x    0.000061     0.000000     12.347900       0.810782         12.347900           0.810782           0.000061            0.000000                  ok
custom_fp32      1.6152        1.6044      1.007x    0.000244     0.000001      1.421631       0.145962          8.163818           0.483711           7.636108            0.459396                  ok
baseline        3.4143           n/a         n/a    0.000000     0.000000      0.000000       0.000000               n/a                n/a                n/a                 n/aexpected:_slow_grouped_mm_not_compileable

256 frame rollout in ~7 seconds, about 36 fps on a 5090.

---------------------------------------------------------------------------------------------------------------------------------------------- benchmark: 1 tests ----------------------------------------------------------------------------------------------------------------------------------------------
Name (time in s)                                                                                                                                                                                                                               Median     Max    Mean  StdDev  max_vram_alloc  max_vram_reserved
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_ar_rollout[taehv_ae=True,ae_uri=Overworld-Models/taehv1_5,gated_linear=False,n_kv_heads=16,n_heads=32,moe=True,moe_n_experts=16,moe_top_k=8,shared_frame_experts=False,n_layers=8-256-True] | params=703,359,112 | active=434,923,656     7.1618  7.3075  7.0543  0.2926            2.04               2.64
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

@Clydingus Clydingus marked this pull request as ready for review March 16, 2026 14:51
@Clydingus Clydingus requested a review from lapp0 March 16, 2026 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant