[BENCH] Add SwiGLU, Cross Entropy, and Fused Linear JSD benchmarks#308
[BENCH] Add SwiGLU, Cross Entropy, and Fused Linear JSD benchmarks#308
Conversation
Add four new benchmark entries to the CI sanitizer benchmark suite: - swiglu_forward/swiglu_backward (Liger-Kernel SwiGLU kernels) - cross_entropy (Liger-Kernel fused cross entropy with online softmax) - fused_linear_jsd (simple config reusing existing JSD + element_mul kernels)
Sanitizer Performance Benchmark
Threshold: >5% regression flagged with |
… Linear JSD Scale up benchmark configs to match actual test parameter spaces: - SwiGLU: 4 shapes x 2 dtypes = 8 configs (fwd+bwd grouped) - Cross Entropy: 4 shapes x 2 reductions x 6 (scalar,dtype) = 48 configs - Fused Linear JSD: 3 shapes x 2 dtypes x 2 (temp,beta) = 12 configs
The lazy allocation pattern caused each iteration to re-allocate large tensors via torch.randn, which is extremely slow for bf16 (~8s for a single 4096x128256 tensor). 48 configs x ~2.85s allocation = 137s per iteration, exceeding the 180s timeout. Fix: pre-allocate tensor templates once in setup(), clone only the in-place-modified tensors (X, a_bwd, b_bwd, dX, grad_input) per run. Clone is ~10x faster than randn. Also deduplicate the repeated (2,4096,32000) CE shape and drop the scalar dimension (does not affect symbolic analysis). Benchmark run times (1 warmup + 5 measured): - swiglu: ~0.9s (8 configs) - cross_entropy: ~2.4s (12 configs) - fused_linear_jsd: ~0.6s (12 configs)
…ting Pre-allocating all 6 CE templates (~5 GB) plus clones (~2.1 GB peak) exceeded CI runner's ~7 GB RAM limit. Switch to streaming: setup stores only metadata, run allocates one (BT, V, dtype) template at a time, runs both reductions, then frees before the next. Peak memory ≈ 4.2 GB (one template + one clone). Also use f32→bf16 conversion instead of direct bf16 randn (~3x faster).
Shrink grid sizes from thousands to ~512 and reduce iterations from 40 to 20. The sanitizer simulates every program instance, so large grids caused benchmarks to exceed the 180 s per-benchmark timeout.
- CI workflow: reduce A/B rounds from 8 to 4 (4×5 = 20 iterations) - Increase old benchmark sizes (gemm 64→128, indirect 256→1024, etc.) to bring them closer to the new benchmarks in runtime - Reduce flaggems_layernorm configs (5 shapes × 3 dtypes → 3 shapes × 1 dtype) to cut its dominant 2.5 s runtime down to ~0.3 s
|
The Similarly,
Once this PR merges, these new sizes become the baseline and future comparisons will be apples-to-apples. |
Shrink grid from ~512 to ~128 to speed up these three benchmarks.
Add more diverse constexpr parameter combinations to generate unique symbolic cache entries, bringing these benchmarks closer in runtime to the other complex benchmarks. - swiglu: 3 → 11 shape configs with varied intermediate_size - cross_entropy: 3 → 7 V values + label_smoothing variation (×2) - fused_linear_jsd: 3 → 9 shape configs with varied V/H
Convert these benchmarks to grouped pattern with multiple constexpr parameter combinations to defeat symbolic cache and increase runtime: - gemm/gemm_oob: 7 (M, N, K, TILE_SIZE) configs - nested_loop: parameterize loop bounds as constexpr, 7 (OUTER, INNER) configs
Too trivial for meaningful benchmarking — symbolic cache makes it complete in constant time regardless of grid/block size.
Summary
ops/swiglu.py)ops/cross_entropy.py)jsd_kernel+element_mul_kernelTest plan