Skip to content

Commit 84adf58

Browse files
authored
Add Claude skills for corrrectness validation (#13)
Quick regression check for code changes — run 15 training steps on both the base and feature branch, compare loss curves, and flag if anything diverged.
1 parent bfb2826 commit 84adf58

11 files changed

Lines changed: 770 additions & 10 deletions

File tree

Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
---
2+
name: correctness-validation
3+
description: Validates that code changes do not break training correctness by comparing loss curves between a base branch and the current feature branch. Use when user asks to "validate correctness", "check if changes break training", "compare loss curves", "run a regression test", or "verify my changes are correct". Also use when a feature branch modifies model code, operators, pipeline logic, or distributed training modules.
4+
---
5+
6+
# Correctness Validation
7+
8+
Validates training correctness by running a short 15-step training run on both a base branch and the current feature branch, then comparing three metrics step-by-step: cross-entropy loss, load-balance loss, and gradient norm.
9+
10+
## Overview
11+
12+
The validation has two phases:
13+
14+
1. **Shared setup** (run once, reused across branches): download a minimal DCLM corpus shard, tokenize it, download and convert the HuggingFace checkpoint to DCP format.
15+
2. **Branch comparison**: run 15 training steps on the base branch (via git worktree) and the feature branch, then compare the stdout logs.
16+
17+
Shared setup artifacts live in `workspace/` and are deterministic given the same seed and released checkpoint, so they are safe to share between branches.
18+
19+
## Prerequisites
20+
21+
- **Python environment**: Use the `.venv` in the original repo root (not the worktree). Activate it before running any scripts: `source $REPO_ROOT/.venv/bin/activate`. If `.venv` does not exist, create it following the README instructions (`uv venv && uv sync`).
22+
- **Hardware**: Minimum **4x B200 GPUs** (PP=2, EP=2 with DeepSeek-V2-Lite).
23+
24+
Note: both `.venv` and `workspace/` live in the original repo root. The worktree gets both via symlink (see Step 4).
25+
26+
## Supported Models
27+
28+
Each model has a validation script and a setup script under `scripts/`:
29+
30+
| Model | Setup Script | Validation Script | GPUs |
31+
|---|---|---|---|
32+
| DeepSeek-V2-Lite | `setup_deepseek_v2_lite.py` | `validate_deepseek_v2_lite.py` | 4 (PP=2, EP=2) |
33+
| Qwen3-30B-A3B | `setup_qwen3_30b_a3b.py` | `validate_qwen3_30b_a3b.py` | 16 (PP=2, EP=8) |
34+
35+
## Step-by-Step Workflow
36+
37+
### Step 1: Determine Impact and Select Models
38+
39+
Analyze the code change to decide which models need validation. The goal is to run validation on **every model whose behavior could be affected**.
40+
41+
**How to analyze impact:**
42+
43+
1. Get the list of changed files:
44+
```bash
45+
git diff --name-only <base_branch>
46+
```
47+
48+
2. **If changes are under a model-specific directory** (e.g., `pithtrain/models/deepseek_v2_lite/` or `pithtrain/models/qwen3_moe/`), only that model is affected.
49+
50+
3. **If changes are in shared code** (e.g., `pithtrain/operators/`, `pithtrain/layers/`, `pithtrain/dualpipe/`, `pithtrain/modules/`, `pithtrain/tasks/`), read the changed code and determine whether it touches a feature that is model-specific or universal:
51+
- Read each model's `config.json` at `examples/pretrain_language_model/<model>/config.json` to understand what features that model uses (attention type, shared experts, expert count, RoPE variant, etc.)
52+
- Read the changed code to understand what architectural features it touches
53+
- A model is affected if it uses any feature touched by the change
54+
55+
4. **If unsure whether a model is affected, include it.** Over-validating is better than missing a regression.
56+
57+
### Step 2: Detect Environment
58+
59+
Check if running under SLURM by testing for `SLURM_JOB_ID`:
60+
61+
```bash
62+
if [ -n "${SLURM_JOB_ID:-}" ]; then
63+
echo "SLURM detected (job $SLURM_JOB_ID) — will use srun for multi-node launch"
64+
else
65+
echo "No SLURM — single-node launch"
66+
fi
67+
```
68+
69+
This determines whether to prefix commands with `srun -W 0`. The workspace directory is **node-local storage**, so setup (data download, tokenization, checkpoint conversion) must run on **every node**.
70+
71+
### Step 3: Shared Setup
72+
73+
Run the setup launch script for each affected model. The setup scripts are idempotent — they skip steps whose output already exists.
74+
75+
```bash
76+
# Single-node (replace <model> with deepseek-v2-lite or qwen3-30b-a3b)
77+
bash .claude/skills/correctness-validation/scripts/launch_setup.sh <model>
78+
79+
# Multi-node (SLURM) — must run on every node since workspace is node-local
80+
srun -W 0 .claude/skills/correctness-validation/scripts/launch_setup.sh <model>
81+
```
82+
83+
This downloads a single minimal DCLM shard (`global-shard_01_of_10/local-shard_0_of_10/shard_00000000_processed.jsonl.zst`), tokenizes it with the model's tokenizer, downloads the HuggingFace checkpoint, and converts it to DCP format.
84+
85+
### Step 4: Create Git Worktree for Base Branch
86+
87+
Create a worktree for the base branch. Symlink `workspace/` and `.venv` from the repo root so both branches share the same data and environment.
88+
89+
```bash
90+
BASE_BRANCH=main # or the branch this feature was based on
91+
WORKTREE=$(mktemp -d)
92+
REPO_ROOT=$(git rev-parse --show-toplevel)
93+
94+
git worktree add $WORKTREE $BASE_BRANCH
95+
ln -sfn $REPO_ROOT/workspace $WORKTREE/workspace
96+
ln -sfn $REPO_ROOT/.venv $WORKTREE/.venv
97+
```
98+
99+
### Step 5: Run Validation on Base Branch
100+
101+
Run 15 training steps in the base worktree. Only run the model(s) selected in Step 1.
102+
103+
```bash
104+
cd $WORKTREE
105+
106+
# Single-node (replace <model> with deepseek-v2-lite or qwen3-30b-a3b)
107+
bash .claude/skills/correctness-validation/scripts/launch_validate.sh <model>
108+
109+
# Multi-node (SLURM)
110+
srun -W 0 .claude/skills/correctness-validation/scripts/launch_validate.sh <model>
111+
```
112+
113+
The launch script auto-detects SLURM environment variables (`SLURM_NNODES`, `SLURM_NODEID`, `SLURM_STEP_GPUS`, `SLURM_STEP_NODELIST`) to configure `torchrun` arguments. On single-node, it falls back to localhost defaults.
114+
115+
Logs are written to `logging/correctness-validation/validate_<model>_node<N>.log`.
116+
117+
Return to the original repo directory after the run completes.
118+
119+
### Step 6: Run Validation on Feature Branch
120+
121+
Run the same 15 steps in the current (feature) working directory, for the same model(s).
122+
123+
```bash
124+
cd $REPO_ROOT
125+
126+
# Single-node
127+
bash .claude/skills/correctness-validation/scripts/launch_validate.sh <model>
128+
129+
# Multi-node (SLURM)
130+
srun -W 0 .claude/skills/correctness-validation/scripts/launch_validate.sh <model>
131+
```
132+
133+
### Step 7: Compare Results
134+
135+
Run the compare script for each model that was validated. Use the node-0 logs (rank 0 emits the metrics). Run `python3 .claude/skills/correctness-validation/scripts/compare.py --help` for full options.
136+
137+
```bash
138+
python3 .claude/skills/correctness-validation/scripts/compare.py \
139+
$WORKTREE/logging/correctness-validation/validate_<model>_node0.log \
140+
logging/correctness-validation/validate_<model>_node0.log
141+
```
142+
143+
The compare script parses both logs, extracts per-step metrics, and reports pass/fail. It checks:
144+
145+
- **cross-entropy-loss**: relative tolerance per step
146+
- **load-balance-loss**: relative tolerance per step
147+
- **gradient-norm**: relative tolerance per step
148+
149+
Default tolerance is 1e-3 relative difference. Use `--tolerance` to adjust.
150+
151+
Expected output on success:
152+
153+
```
154+
PASS: All metrics within tolerance across all steps.
155+
```
156+
157+
Expected output on failure:
158+
159+
```
160+
FAIL: Metrics diverged beyond tolerance:
161+
cross-entropy-loss:
162+
step 003: cross-entropy-loss diverged — base=2.663700, feature=2.680100, rel_diff=6.16e-03 > tolerance=1e-03
163+
```
164+
165+
### Step 8: Clean Up
166+
167+
```bash
168+
git worktree remove $WORKTREE
169+
```
170+
171+
## Log Format
172+
173+
The training scripts emit lines like:
174+
175+
```
176+
2026-04-02 12:32:40 | INFO | step 00000001/00000015 | step-time 110.990 sec | cross-entropy-loss 2.6637 | load-balance-loss 0.001234 | learning-rate 1.000000e-06 | gradient-norm 20.3210 | tokens-per-second 18,895 | peak-gpu-memory 47.20 GB
177+
```
178+
179+
The compare script parses pipe-separated key-value pairs from lines containing `| INFO | step `.
180+
181+
## Common Issues
182+
183+
### Setup fails on HuggingFace download
184+
185+
Ensure `HF_TOKEN` is set if the model is gated. DeepSeek-V2-Lite and Qwen3-30B-A3B are public models.
186+
187+
### OOM during validation
188+
189+
DeepSeek-V2-Lite requires 4x B200 GPUs. Qwen3-30B-A3B requires 16x B200 GPUs. If OOM occurs, check that no other processes are using GPU memory.
190+
191+
### Logs show no load-balance-loss
192+
193+
The validation scripts set `moe_load_balance_coef > 0` to ensure this metric is logged. If it is missing, check that the validation script (not an example script) was used.
194+
195+
### Tolerance too strict
196+
197+
FP8 operations and flash attention can introduce small non-determinism. If validation fails with very small differences, try increasing tolerance:
198+
199+
```bash
200+
python3 .claude/skills/correctness-validation/scripts/compare.py \
201+
base.log feature.log --tolerance 4e-3
202+
```
203+
204+
### Worktree conflicts
205+
206+
If the worktree was not cleaned up from a previous run, use `git worktree list` to find it and `git worktree remove <path> --force` to remove it.

0 commit comments

Comments
 (0)