Skip to content

[BENCH] Add SwiGLU, Cross Entropy, and Fused Linear JSD benchmarks#308

Merged
mark14wu merged 14 commits intomainfrom
add-new-perf-benchmarks
Mar 15, 2026
Merged

[BENCH] Add SwiGLU, Cross Entropy, and Fused Linear JSD benchmarks#308
mark14wu merged 14 commits intomainfrom
add-new-perf-benchmarks

Conversation

@mark14wu
Copy link
Copy Markdown
Collaborator

@mark14wu mark14wu commented Mar 3, 2026

Summary

  • Add SwiGLU forward/backward kernel benchmarks (from Liger-Kernel ops/swiglu.py)
  • Add fused Cross Entropy kernel benchmark (from Liger-Kernel ops/cross_entropy.py)
  • Add Fused Linear JSD benchmark with a simple config reusing the existing jsd_kernel + element_mul_kernel

Test plan

  • CI benchmark workflow runs successfully with the new entries
  • New benchmarks appear in the PR comment results table

Add four new benchmark entries to the CI sanitizer benchmark suite:
- swiglu_forward/swiglu_backward (Liger-Kernel SwiGLU kernels)
- cross_entropy (Liger-Kernel fused cross entropy with online softmax)
- fused_linear_jsd (simple config reusing existing JSD + element_mul kernels)
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 3, 2026

Sanitizer Performance Benchmark

Benchmark main (min) PR (min) Change
gemm 0.023s 0.189s +708.6% ⚠️
gemm_oob 0.024s 0.195s +706.8% ⚠️
indirect_load 0.077s 0.300s +290.9% ⚠️
nested_loop 0.026s 0.383s +1391.2% ⚠️
block_pointer_loop_advance 0.007s 0.190s +2464.7% ⚠️
liger_jsd 0.153s 0.154s +0.5%
flaggems_layernorm 2.881s 0.470s -83.7%
swiglu N/A 0.189s N/A
cross_entropy N/A 0.177s N/A
fused_linear_jsd N/A 0.231s N/A
Total N/A 2.476s N/A

Threshold: >5% regression flagged with ⚠️
Iterations: 1 warmup + 20 measured

mark14wu added 9 commits March 2, 2026 23:06
… Linear JSD

Scale up benchmark configs to match actual test parameter spaces:
- SwiGLU: 4 shapes x 2 dtypes = 8 configs (fwd+bwd grouped)
- Cross Entropy: 4 shapes x 2 reductions x 6 (scalar,dtype) = 48 configs
- Fused Linear JSD: 3 shapes x 2 dtypes x 2 (temp,beta) = 12 configs
The lazy allocation pattern caused each iteration to re-allocate large
tensors via torch.randn, which is extremely slow for bf16 (~8s for a
single 4096x128256 tensor). 48 configs x ~2.85s allocation = 137s per
iteration, exceeding the 180s timeout.

Fix: pre-allocate tensor templates once in setup(), clone only the
in-place-modified tensors (X, a_bwd, b_bwd, dX, grad_input) per run.
Clone is ~10x faster than randn. Also deduplicate the repeated
(2,4096,32000) CE shape and drop the scalar dimension (does not affect
symbolic analysis).

Benchmark run times (1 warmup + 5 measured):
- swiglu: ~0.9s (8 configs)
- cross_entropy: ~2.4s (12 configs)
- fused_linear_jsd: ~0.6s (12 configs)
…ting

Pre-allocating all 6 CE templates (~5 GB) plus clones (~2.1 GB peak)
exceeded CI runner's ~7 GB RAM limit.

Switch to streaming: setup stores only metadata, run allocates one
(BT, V, dtype) template at a time, runs both reductions, then frees
before the next. Peak memory ≈ 4.2 GB (one template + one clone).
Also use f32→bf16 conversion instead of direct bf16 randn (~3x faster).
Shrink grid sizes from thousands to ~512 and reduce iterations from
40 to 20.  The sanitizer simulates every program instance, so large
grids caused benchmarks to exceed the 180 s per-benchmark timeout.
- CI workflow: reduce A/B rounds from 8 to 4 (4×5 = 20 iterations)
- Increase old benchmark sizes (gemm 64→128, indirect 256→1024, etc.)
  to bring them closer to the new benchmarks in runtime
- Reduce flaggems_layernorm configs (5 shapes × 3 dtypes → 3 shapes ×
  1 dtype) to cut its dominant 2.5 s runtime down to ~0.3 s
@mark14wu
Copy link
Copy Markdown
Collaborator Author

The ⚠️ regressions in gemm, gemm_oob, indirect_load, and block_pointer_loop_advance are expected and intentional — we deliberately increased their sizes (e.g. gemm 64→128, indirect 256→1024, block_pointer grid 1→16) to bring their runtimes closer to the new benchmarks and reduce the spread across the table.

Similarly, flaggems_layernorm dropping 84% is intentional: we trimmed it from 5 shapes × 3 dtypes = 15 configs down to 3 shapes × 1 dtype = 3 configs, since it previously dominated total runtime at ~2.8 s.

simple_load_store and nested_loop don't change regardless of grid size because the symbolic cache (_fn_symbolic_cache_set) makes all program instances after the first one nearly free for these simple kernels.

Once this PR merges, these new sizes become the baseline and future comparisons will be apples-to-apples.

Shrink grid from ~512 to ~128 to speed up these three benchmarks.
Add more diverse constexpr parameter combinations to generate unique
symbolic cache entries, bringing these benchmarks closer in runtime
to the other complex benchmarks.

- swiglu: 3 → 11 shape configs with varied intermediate_size
- cross_entropy: 3 → 7 V values + label_smoothing variation (×2)
- fused_linear_jsd: 3 → 9 shape configs with varied V/H
Convert these benchmarks to grouped pattern with multiple constexpr
parameter combinations to defeat symbolic cache and increase runtime:
- gemm/gemm_oob: 7 (M, N, K, TILE_SIZE) configs
- nested_loop: parameterize loop bounds as constexpr, 7 (OUTER, INNER) configs
Too trivial for meaningful benchmarking — symbolic cache makes it
complete in constant time regardless of grid/block size.
@mark14wu mark14wu merged commit f5ec5a2 into main Mar 15, 2026
4 checks passed
@mark14wu mark14wu deleted the add-new-perf-benchmarks branch March 15, 2026 03:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants