[BENCH] Add SwiGLU, Cross Entropy, and Fused Linear JSD benchmarks by mark14wu · Pull Request #308 · Deep-Learning-Profiling-Tools/triton-viz

mark14wu · 2026-03-03T03:36:16Z

Summary

Add SwiGLU forward/backward kernel benchmarks (from Liger-Kernel ops/swiglu.py)
Add fused Cross Entropy kernel benchmark (from Liger-Kernel ops/cross_entropy.py)
Add Fused Linear JSD benchmark with a simple config reusing the existing jsd_kernel + element_mul_kernel

Test plan

CI benchmark workflow runs successfully with the new entries
New benchmarks appear in the PR comment results table

Add four new benchmark entries to the CI sanitizer benchmark suite: - swiglu_forward/swiglu_backward (Liger-Kernel SwiGLU kernels) - cross_entropy (Liger-Kernel fused cross entropy with online softmax) - fused_linear_jsd (simple config reusing existing JSD + element_mul kernels)

github-actions · 2026-03-03T03:51:26Z

Sanitizer Performance Benchmark

Benchmark	main (min)	PR (min)	Change
gemm	0.023s	0.189s	+708.6% ⚠️
gemm_oob	0.024s	0.195s	+706.8% ⚠️
indirect_load	0.077s	0.300s	+290.9% ⚠️
nested_loop	0.026s	0.383s	+1391.2% ⚠️
block_pointer_loop_advance	0.007s	0.190s	+2464.7% ⚠️
liger_jsd	0.153s	0.154s	+0.5%
flaggems_layernorm	2.881s	0.470s	-83.7%
swiglu	N/A	0.189s	N/A
cross_entropy	N/A	0.177s	N/A
fused_linear_jsd	N/A	0.231s	N/A
Total	N/A	2.476s	N/A

Threshold: >5% regression flagged with ⚠️
Iterations: 1 warmup + 20 measured

… Linear JSD Scale up benchmark configs to match actual test parameter spaces: - SwiGLU: 4 shapes x 2 dtypes = 8 configs (fwd+bwd grouped) - Cross Entropy: 4 shapes x 2 reductions x 6 (scalar,dtype) = 48 configs - Fused Linear JSD: 3 shapes x 2 dtypes x 2 (temp,beta) = 12 configs

The lazy allocation pattern caused each iteration to re-allocate large tensors via torch.randn, which is extremely slow for bf16 (~8s for a single 4096x128256 tensor). 48 configs x ~2.85s allocation = 137s per iteration, exceeding the 180s timeout. Fix: pre-allocate tensor templates once in setup(), clone only the in-place-modified tensors (X, a_bwd, b_bwd, dX, grad_input) per run. Clone is ~10x faster than randn. Also deduplicate the repeated (2,4096,32000) CE shape and drop the scalar dimension (does not affect symbolic analysis). Benchmark run times (1 warmup + 5 measured): - swiglu: ~0.9s (8 configs) - cross_entropy: ~2.4s (12 configs) - fused_linear_jsd: ~0.6s (12 configs)

…ting Pre-allocating all 6 CE templates (~5 GB) plus clones (~2.1 GB peak) exceeded CI runner's ~7 GB RAM limit. Switch to streaming: setup stores only metadata, run allocates one (BT, V, dtype) template at a time, runs both reductions, then frees before the next. Peak memory ≈ 4.2 GB (one template + one clone). Also use f32→bf16 conversion instead of direct bf16 randn (~3x faster).

Shrink grid sizes from thousands to ~512 and reduce iterations from 40 to 20. The sanitizer simulates every program instance, so large grids caused benchmarks to exceed the 180 s per-benchmark timeout.

- CI workflow: reduce A/B rounds from 8 to 4 (4×5 = 20 iterations) - Increase old benchmark sizes (gemm 64→128, indirect 256→1024, etc.) to bring them closer to the new benchmarks in runtime - Reduce flaggems_layernorm configs (5 shapes × 3 dtypes → 3 shapes × 1 dtype) to cut its dominant 2.5 s runtime down to ~0.3 s

mark14wu · 2026-03-14T02:46:44Z

The ⚠️ regressions in gemm, gemm_oob, indirect_load, and block_pointer_loop_advance are expected and intentional — we deliberately increased their sizes (e.g. gemm 64→128, indirect 256→1024, block_pointer grid 1→16) to bring their runtimes closer to the new benchmarks and reduce the spread across the table.

Similarly, flaggems_layernorm dropping 84% is intentional: we trimmed it from 5 shapes × 3 dtypes = 15 configs down to 3 shapes × 1 dtype = 3 configs, since it previously dominated total runtime at ~2.8 s.

simple_load_store and nested_loop don't change regardless of grid size because the symbolic cache (_fn_symbolic_cache_set) makes all program instances after the first one nearly free for these simple kernels.

Once this PR merges, these new sizes become the baseline and future comparisons will be apples-to-apples.

Shrink grid from ~512 to ~128 to speed up these three benchmarks.

Add more diverse constexpr parameter combinations to generate unique symbolic cache entries, bringing these benchmarks closer in runtime to the other complex benchmarks. - swiglu: 3 → 11 shape configs with varied intermediate_size - cross_entropy: 3 → 7 V values + label_smoothing variation (×2) - fused_linear_jsd: 3 → 9 shape configs with varied V/H

Convert these benchmarks to grouped pattern with multiple constexpr parameter combinations to defeat symbolic cache and increase runtime: - gemm/gemm_oob: 7 (M, N, K, TILE_SIZE) configs - nested_loop: parameterize loop bounds as constexpr, 7 (OUTER, INNER) configs

Too trivial for meaningful benchmarking — symbolic cache makes it complete in constant time regardless of grid/block size.

mark14wu added 9 commits March 2, 2026 23:06

Merge branch 'main' into add-new-perf-benchmarks

3b703af

[BENCH] Remove bfloat16 from new benchmarks (not yet supported)

8089937

Merge branch 'main' into add-new-perf-benchmarks

cfcecde

[BENCH] Reduce benchmark sizes and iterations to fix CI timeout

986751e

Shrink grid sizes from thousands to ~512 and reduce iterations from 40 to 20. The sanitizer simulates every program instance, so large grids caused benchmarks to exceed the 180 s per-benchmark timeout.

Merge branch 'main' into add-new-perf-benchmarks

c453a9e

mark14wu added 4 commits March 14, 2026 02:48

[BENCH] Reduce swiglu, cross_entropy, fused_linear_jsd grid sizes

3a95871

Shrink grid from ~512 to ~128 to speed up these three benchmarks.

[BENCH] Remove simple_load_store benchmark

518e933

Too trivial for meaningful benchmarking — symbolic cache makes it complete in constant time regardless of grid/block size.

Jokeren approved these changes Mar 14, 2026

View reviewed changes

mark14wu merged commit f5ec5a2 into main Mar 15, 2026
4 checks passed

mark14wu deleted the add-new-perf-benchmarks branch March 15, 2026 03:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BENCH] Add SwiGLU, Cross Entropy, and Fused Linear JSD benchmarks#308

[BENCH] Add SwiGLU, Cross Entropy, and Fused Linear JSD benchmarks#308
mark14wu merged 14 commits intomainfrom
add-new-perf-benchmarks

mark14wu commented Mar 3, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 3, 2026 •

edited

Loading

Uh oh!

mark14wu commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mark14wu commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sanitizer Performance Benchmark

Uh oh!

mark14wu commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mark14wu commented Mar 3, 2026 •

edited

Loading

github-actions bot commented Mar 3, 2026 •

edited

Loading