|
| 1 | +# Next steps to actually unlock B200 throughput on PR #2014 |
| 2 | + |
| 3 | +Phase Z ruled out MLP block-size tuning (3 variants, none ≥ baseline). This |
| 4 | +doc proposes the remaining feasible unlocks — what to test, how to verify, |
| 5 | +and what the engineering cost is. **No pods will be spawned without your |
| 6 | +explicit go-ahead.** |
| 7 | + |
| 8 | +## Workload facts (extracted from Phase Z v6 train.out) |
| 9 | + |
| 10 | +| Metric | Value | |
| 11 | +|---|---| |
| 12 | +| Model | 35.9M params, 11 layers, model_dim=512, mlp_mult=4.0, head_dim=64 | |
| 13 | +| Seq | 3072 (eval+train), num_loops=2 | |
| 14 | +| Step time on B200 | ~0.6 s/step (100 steps in 60 s wallclock) | |
| 15 | +| Reported tok/s | 1,378,166 (mean, steady state) | |
| 16 | +| Peak VRAM | 42.2 GB / 180 GB → **23% utilized** | |
| 17 | +| FA dispatch | basic=FA4 (CuTe sm_100), varlen=FA2 | |
| 18 | +| Compile | `torch.compile(dynamic=False, fullgraph=True)` | |
| 19 | +| CUDA graphs | **not explicitly enabled** (inductor default = off) | |
| 20 | +| Loop layers per step | encoder=8 ops, decoder=9 ops = 17 layer-applies | |
| 21 | + |
| 22 | +VRAM headroom is real (only 23% used), so bigger batch would help if PR #2014 |
| 23 | +allowed it. It doesn't — `local_microbatch_tokens` is fixed by `TRAIN_SEQ_LEN` |
| 24 | +× per-rank batch. |
| 25 | + |
| 26 | +## Three unlock paths, ranked by effort × confidence |
| 27 | + |
| 28 | +### Tier 1 — Enable CUDA Graphs (recommended first) |
| 29 | + |
| 30 | +**What**: Set `torch._inductor.config.triton.cudagraphs = True` (or env var |
| 31 | +`TORCHINDUCTOR_CUDAGRAPHS=1`) before the first `torch.compile` call. Inductor |
| 32 | +will then capture a CUDA graph per shape bucket. The pre-baking via |
| 33 | +`COMPILE_SHAPE_WARMUP=1` (4 buckets × 3 iters) already pays the capture cost |
| 34 | +exactly once; subsequent steps replay the graph. |
| 35 | + |
| 36 | +**Why it should help**: At 0.6 s/step on a 35.9M-param workload, a meaningful |
| 37 | +fraction is launch overhead (we launch ~17 layer-applies × ~20 kernels each = |
| 38 | +~340 kernels per step → tens of µs each of host launch). CUDA Graphs collapse |
| 39 | +that to a single `cudaGraphLaunch` per step. |
| 40 | + |
| 41 | +**Expected delta**: +5-15% wall (literature for sub-50M-param + sm_100 + |
| 42 | +seq=3K typically reports 10-20% from CUDA graphs). |
| 43 | + |
| 44 | +**Engineering**: 1 line in train_gpt.py, 1 pod (~15 min wall, ~$1.50 cost) to |
| 45 | +A/B test. Risk: CUDA graphs are brittle if shape buckets aren't exhaustive; |
| 46 | +but `COMPILE_SHAPE_WARMUP=1` already enumerates `64,128,192,256` cu_buckets |
| 47 | +so we know which shapes appear. Worst case: graph capture fails on an unseen |
| 48 | +shape and falls back; no correctness impact. |
| 49 | + |
| 50 | +**Greenlight criterion**: a single `TORCHINDUCTOR_CUDAGRAPHS=1` env var test. |
| 51 | + |
| 52 | +### Tier 2 — FP8 transformer-engine for MLP up/down |
| 53 | + |
| 54 | +**What**: Replace the Triton `linear_leaky_relu_square` MLP kernel with NVIDIA |
| 55 | +TransformerEngine's `te.Linear` + manual leaky_relu/square. TE auto-emits |
| 56 | +sm_100 native FP8 GEMM via tcgen05 + tensor-memory accumulator. |
| 57 | + |
| 58 | +**Why it should help**: B200's FP8 tensor cores are 2× FP16 throughput. MLP |
| 59 | +up/down is the heaviest compute in this model (the only matmuls with |
| 60 | +M=B*S=3072, N=2048, K=512 — peak compute ~6.5 GFLOPs × 2 (up+down) × 17 |
| 61 | +loop-layers × 100 steps / 0.6s = ~36 TFLOPs MLP per step on B200 dense FP16 |
| 62 | +peak ~990 TFLOPs → MLP ≈ 4% MFU end-to-end). If FP8 gets 2× speedup on the |
| 63 | +MLP and MLP is ~half of step time (rough estimate — we don't have per-kernel |
| 64 | +profiling), end-to-end +25-30%. |
| 65 | + |
| 66 | +**Why it MIGHT not help**: |
| 67 | +- (a) FP8 quantization could regress BPB by >0.005 nat (i.e., a record-breaker |
| 68 | + worth of regression). Would need careful per-tensor scaling factor |
| 69 | + calibration and BPB verification on a full eval run, not just throughput. |
| 70 | +- (b) The Triton kernel is already BF16 GEMM and might be memory-bandwidth |
| 71 | + bound for the K=512 dimension (low arithmetic intensity), in which case FP8 |
| 72 | + compute doesn't help. |
| 73 | + |
| 74 | +**Engineering**: ~half-day to 1 day. Includes (a) wiring TE into PR #2014's |
| 75 | +MLP class without breaking the leaky_relu_square activation (TE's GELU/SiLU |
| 76 | +are baked-in; LeakyReLU² isn't), (b) calibrating FP8 scaling factors for the |
| 77 | +specific weight distribution, (c) re-running 3-seed BPB to verify no |
| 78 | +regression. 2-3 pods (~$10-15 cost). |
| 79 | + |
| 80 | +**Greenlight criterion**: a profiler trace showing MLP > 30% of step time on |
| 81 | +B200. Without that, we shouldn't bet 1 engineer-day. |
| 82 | + |
| 83 | +### Tier 3 — torch 2.7 + Triton 3.7 + rebuild FA2/FA4 |
| 84 | + |
| 85 | +**What**: Build a new docker image with torch 2.7+, Triton 3.7+, FA2 rebuilt |
| 86 | +against the new ABI, FA4 rebuilt against new cuda-python + cutlass-dsl. |
| 87 | + |
| 88 | +**Why it should help**: Triton 3.7+ has sm_100-native PTX codegen (tcgen05 |
| 89 | ++ TMEM). The current Triton 3.5.1 emits Hopper-era PTX on Blackwell, which |
| 90 | +the hardware then runs through legacy code paths. |
| 91 | + |
| 92 | +**Risk**: Lots — torch ABI changes, FA4 cute API changes, every dependency |
| 93 | +needs verification. Phase Z v5 attempt confirmed Triton 3.7 alone breaks |
| 94 | +torch 2.6 inductor; a coordinated stack upgrade is needed. |
| 95 | + |
| 96 | +**Engineering**: 1-2 full days of docker image work plus eval validation. |
| 97 | +5-10 pods of debug iterations expected (~$30-50). |
| 98 | + |
| 99 | +**Greenlight criterion**: only if Tiers 1+2 don't reach the desired |
| 100 | +throughput target, OR if you specifically want to validate the |
| 101 | +"Triton-version-matters-on-Blackwell" thesis. |
| 102 | + |
| 103 | +## Recommended path |
| 104 | + |
| 105 | +1. **Now (zero risk, ~$1.50)**: Tier 1 — test CUDA Graphs. One-line change, |
| 106 | + single pod, ~15 min wall. If +10%+ confirmed, lock it in. |
| 107 | +2. **If Tier 1 gives +10%+**: stop here. B200 is at ~1.5M tok/s, which is |
| 108 | + a respectable result for a tiny model. Stack any further hardware |
| 109 | + investigation behind a clearer use-case. |
| 110 | +3. **If Tier 1 gives <5%**: profile the workload (Nsight Systems trace, |
| 111 | + ~$3 of pod time) to see where time actually goes. Don't pursue Tier 2 |
| 112 | + blindly without knowing MLP's actual fraction. |
| 113 | + |
| 114 | +Let me know if you want to greenlight Tier 1 (the CUDA Graphs test). |
0 commit comments