Skip to content

Commit 4a8aeeb

Browse files
Yin Chenclaude
authored andcommitted
Phase Z aftermath: NEXT_STEPS — 3 tiers of remaining B200 unlocks
After Phase Z ruled out block-size tuning, this doc proposes the 3 remaining feasible paths and explicitly asks for greenlight before spawning more pods. Tier 1 (recommended): enable CUDA Graphs via TORCHINDUCTOR_CUDAGRAPHS=1. One-line change. PR openai#2014 already does torch.compile(dynamic=False, fullgraph=True) which is graph-capture-friendly. Expected +5-15%. Cost ~$1.50 for one verification pod. Tier 2: FP8 transformer-engine for MLP up/down. Half-day to 1d engineering. Risk: BPB regression from FP8 quant; also unclear if MLP is compute-bound vs memory-bw-bound at K=512. Greenlight criterion: profiler trace showing MLP >30% of step time. Tier 3: torch 2.7 + Triton 3.7 + FA stack rebuild. Most uncertain; 1-2d work. Only do if Tiers 1+2 don't reach target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 16bbac0 commit 4a8aeeb

1 file changed

Lines changed: 114 additions & 0 deletions

File tree

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Next steps to actually unlock B200 throughput on PR #2014
2+
3+
Phase Z ruled out MLP block-size tuning (3 variants, none ≥ baseline). This
4+
doc proposes the remaining feasible unlocks — what to test, how to verify,
5+
and what the engineering cost is. **No pods will be spawned without your
6+
explicit go-ahead.**
7+
8+
## Workload facts (extracted from Phase Z v6 train.out)
9+
10+
| Metric | Value |
11+
|---|---|
12+
| Model | 35.9M params, 11 layers, model_dim=512, mlp_mult=4.0, head_dim=64 |
13+
| Seq | 3072 (eval+train), num_loops=2 |
14+
| Step time on B200 | ~0.6 s/step (100 steps in 60 s wallclock) |
15+
| Reported tok/s | 1,378,166 (mean, steady state) |
16+
| Peak VRAM | 42.2 GB / 180 GB → **23% utilized** |
17+
| FA dispatch | basic=FA4 (CuTe sm_100), varlen=FA2 |
18+
| Compile | `torch.compile(dynamic=False, fullgraph=True)` |
19+
| CUDA graphs | **not explicitly enabled** (inductor default = off) |
20+
| Loop layers per step | encoder=8 ops, decoder=9 ops = 17 layer-applies |
21+
22+
VRAM headroom is real (only 23% used), so bigger batch would help if PR #2014
23+
allowed it. It doesn't — `local_microbatch_tokens` is fixed by `TRAIN_SEQ_LEN`
24+
× per-rank batch.
25+
26+
## Three unlock paths, ranked by effort × confidence
27+
28+
### Tier 1 — Enable CUDA Graphs (recommended first)
29+
30+
**What**: Set `torch._inductor.config.triton.cudagraphs = True` (or env var
31+
`TORCHINDUCTOR_CUDAGRAPHS=1`) before the first `torch.compile` call. Inductor
32+
will then capture a CUDA graph per shape bucket. The pre-baking via
33+
`COMPILE_SHAPE_WARMUP=1` (4 buckets × 3 iters) already pays the capture cost
34+
exactly once; subsequent steps replay the graph.
35+
36+
**Why it should help**: At 0.6 s/step on a 35.9M-param workload, a meaningful
37+
fraction is launch overhead (we launch ~17 layer-applies × ~20 kernels each =
38+
~340 kernels per step → tens of µs each of host launch). CUDA Graphs collapse
39+
that to a single `cudaGraphLaunch` per step.
40+
41+
**Expected delta**: +5-15% wall (literature for sub-50M-param + sm_100 +
42+
seq=3K typically reports 10-20% from CUDA graphs).
43+
44+
**Engineering**: 1 line in train_gpt.py, 1 pod (~15 min wall, ~$1.50 cost) to
45+
A/B test. Risk: CUDA graphs are brittle if shape buckets aren't exhaustive;
46+
but `COMPILE_SHAPE_WARMUP=1` already enumerates `64,128,192,256` cu_buckets
47+
so we know which shapes appear. Worst case: graph capture fails on an unseen
48+
shape and falls back; no correctness impact.
49+
50+
**Greenlight criterion**: a single `TORCHINDUCTOR_CUDAGRAPHS=1` env var test.
51+
52+
### Tier 2 — FP8 transformer-engine for MLP up/down
53+
54+
**What**: Replace the Triton `linear_leaky_relu_square` MLP kernel with NVIDIA
55+
TransformerEngine's `te.Linear` + manual leaky_relu/square. TE auto-emits
56+
sm_100 native FP8 GEMM via tcgen05 + tensor-memory accumulator.
57+
58+
**Why it should help**: B200's FP8 tensor cores are 2× FP16 throughput. MLP
59+
up/down is the heaviest compute in this model (the only matmuls with
60+
M=B*S=3072, N=2048, K=512 — peak compute ~6.5 GFLOPs × 2 (up+down) × 17
61+
loop-layers × 100 steps / 0.6s = ~36 TFLOPs MLP per step on B200 dense FP16
62+
peak ~990 TFLOPs → MLP ≈ 4% MFU end-to-end). If FP8 gets 2× speedup on the
63+
MLP and MLP is ~half of step time (rough estimate — we don't have per-kernel
64+
profiling), end-to-end +25-30%.
65+
66+
**Why it MIGHT not help**:
67+
- (a) FP8 quantization could regress BPB by >0.005 nat (i.e., a record-breaker
68+
worth of regression). Would need careful per-tensor scaling factor
69+
calibration and BPB verification on a full eval run, not just throughput.
70+
- (b) The Triton kernel is already BF16 GEMM and might be memory-bandwidth
71+
bound for the K=512 dimension (low arithmetic intensity), in which case FP8
72+
compute doesn't help.
73+
74+
**Engineering**: ~half-day to 1 day. Includes (a) wiring TE into PR #2014's
75+
MLP class without breaking the leaky_relu_square activation (TE's GELU/SiLU
76+
are baked-in; LeakyReLU² isn't), (b) calibrating FP8 scaling factors for the
77+
specific weight distribution, (c) re-running 3-seed BPB to verify no
78+
regression. 2-3 pods (~$10-15 cost).
79+
80+
**Greenlight criterion**: a profiler trace showing MLP > 30% of step time on
81+
B200. Without that, we shouldn't bet 1 engineer-day.
82+
83+
### Tier 3 — torch 2.7 + Triton 3.7 + rebuild FA2/FA4
84+
85+
**What**: Build a new docker image with torch 2.7+, Triton 3.7+, FA2 rebuilt
86+
against the new ABI, FA4 rebuilt against new cuda-python + cutlass-dsl.
87+
88+
**Why it should help**: Triton 3.7+ has sm_100-native PTX codegen (tcgen05
89+
+ TMEM). The current Triton 3.5.1 emits Hopper-era PTX on Blackwell, which
90+
the hardware then runs through legacy code paths.
91+
92+
**Risk**: Lots — torch ABI changes, FA4 cute API changes, every dependency
93+
needs verification. Phase Z v5 attempt confirmed Triton 3.7 alone breaks
94+
torch 2.6 inductor; a coordinated stack upgrade is needed.
95+
96+
**Engineering**: 1-2 full days of docker image work plus eval validation.
97+
5-10 pods of debug iterations expected (~$30-50).
98+
99+
**Greenlight criterion**: only if Tiers 1+2 don't reach the desired
100+
throughput target, OR if you specifically want to validate the
101+
"Triton-version-matters-on-Blackwell" thesis.
102+
103+
## Recommended path
104+
105+
1. **Now (zero risk, ~$1.50)**: Tier 1 — test CUDA Graphs. One-line change,
106+
single pod, ~15 min wall. If +10%+ confirmed, lock it in.
107+
2. **If Tier 1 gives +10%+**: stop here. B200 is at ~1.5M tok/s, which is
108+
a respectable result for a tiny model. Stack any further hardware
109+
investigation behind a clearer use-case.
110+
3. **If Tier 1 gives <5%**: profile the workload (Nsight Systems trace,
111+
~$3 of pod time) to see where time actually goes. Don't pursue Tier 2
112+
blindly without knowing MLP's actual fraction.
113+
114+
Let me know if you want to greenlight Tier 1 (the CUDA Graphs test).

0 commit comments

Comments
 (0)