|
| 1 | +--- |
| 2 | +name: correctness-validation |
| 3 | +description: Validates that code changes do not break training correctness by comparing loss curves between a base branch and the current feature branch. Use when user asks to "validate correctness", "check if changes break training", "compare loss curves", "run a regression test", or "verify my changes are correct". Also use when a feature branch modifies model code, operators, pipeline logic, or distributed training modules. |
| 4 | +--- |
| 5 | + |
| 6 | +# Correctness Validation |
| 7 | + |
| 8 | +Validates training correctness by running a short 15-step training run on both a base branch and the current feature branch, then comparing three metrics step-by-step: cross-entropy loss, load-balance loss, and gradient norm. |
| 9 | + |
| 10 | +## Overview |
| 11 | + |
| 12 | +The validation has two phases: |
| 13 | + |
| 14 | +1. **Shared setup** (run once, reused across branches): download a minimal DCLM corpus shard, tokenize it, download and convert the HuggingFace checkpoint to DCP format. |
| 15 | +2. **Branch comparison**: run 15 training steps on the base branch (via git worktree) and the feature branch, then compare the stdout logs. |
| 16 | + |
| 17 | +Shared setup artifacts live in `workspace/` and are deterministic given the same seed and released checkpoint, so they are safe to share between branches. |
| 18 | + |
| 19 | +## Prerequisites |
| 20 | + |
| 21 | +- **Python environment**: Use the `.venv` in the original repo root (not the worktree). Activate it before running any scripts: `source $REPO_ROOT/.venv/bin/activate`. If `.venv` does not exist, create it following the README instructions (`uv venv && uv sync`). |
| 22 | +- **Hardware**: Minimum **4x B200 GPUs** (PP=2, EP=2 with DeepSeek-V2-Lite). |
| 23 | + |
| 24 | +Note: both `.venv` and `workspace/` live in the original repo root. The worktree gets both via symlink (see Step 4). |
| 25 | + |
| 26 | +## Supported Models |
| 27 | + |
| 28 | +Each model has a validation script and a setup script under `scripts/`: |
| 29 | + |
| 30 | +| Model | Setup Script | Validation Script | GPUs | |
| 31 | +|---|---|---|---| |
| 32 | +| DeepSeek-V2-Lite | `setup_deepseek_v2_lite.py` | `validate_deepseek_v2_lite.py` | 4 (PP=2, EP=2) | |
| 33 | +| Qwen3-30B-A3B | `setup_qwen3_30b_a3b.py` | `validate_qwen3_30b_a3b.py` | 16 (PP=2, EP=8) | |
| 34 | + |
| 35 | +## Step-by-Step Workflow |
| 36 | + |
| 37 | +### Step 1: Determine Impact and Select Models |
| 38 | + |
| 39 | +Analyze the code change to decide which models need validation. The goal is to run validation on **every model whose behavior could be affected**. |
| 40 | + |
| 41 | +**How to analyze impact:** |
| 42 | + |
| 43 | +1. Get the list of changed files: |
| 44 | + ```bash |
| 45 | + git diff --name-only <base_branch> |
| 46 | + ``` |
| 47 | + |
| 48 | +2. **If changes are under a model-specific directory** (e.g., `pithtrain/models/deepseek_v2_lite/` or `pithtrain/models/qwen3_moe/`), only that model is affected. |
| 49 | + |
| 50 | +3. **If changes are in shared code** (e.g., `pithtrain/operators/`, `pithtrain/layers/`, `pithtrain/dualpipe/`, `pithtrain/modules/`, `pithtrain/tasks/`), read the changed code and determine whether it touches a feature that is model-specific or universal: |
| 51 | + - Read each model's `config.json` at `examples/pretrain_language_model/<model>/config.json` to understand what features that model uses (attention type, shared experts, expert count, RoPE variant, etc.) |
| 52 | + - Read the changed code to understand what architectural features it touches |
| 53 | + - A model is affected if it uses any feature touched by the change |
| 54 | + |
| 55 | +4. **If unsure whether a model is affected, include it.** Over-validating is better than missing a regression. |
| 56 | + |
| 57 | +### Step 2: Detect Environment |
| 58 | + |
| 59 | +Check if running under SLURM by testing for `SLURM_JOB_ID`: |
| 60 | + |
| 61 | +```bash |
| 62 | +if [ -n "${SLURM_JOB_ID:-}" ]; then |
| 63 | + echo "SLURM detected (job $SLURM_JOB_ID) — will use srun for multi-node launch" |
| 64 | +else |
| 65 | + echo "No SLURM — single-node launch" |
| 66 | +fi |
| 67 | +``` |
| 68 | + |
| 69 | +This determines whether to prefix commands with `srun -W 0`. The workspace directory is **node-local storage**, so setup (data download, tokenization, checkpoint conversion) must run on **every node**. |
| 70 | + |
| 71 | +### Step 3: Shared Setup |
| 72 | + |
| 73 | +Run the setup launch script for each affected model. The setup scripts are idempotent — they skip steps whose output already exists. |
| 74 | + |
| 75 | +```bash |
| 76 | +# Single-node (replace <model> with deepseek-v2-lite or qwen3-30b-a3b) |
| 77 | +bash .claude/skills/correctness-validation/scripts/launch_setup.sh <model> |
| 78 | + |
| 79 | +# Multi-node (SLURM) — must run on every node since workspace is node-local |
| 80 | +srun -W 0 .claude/skills/correctness-validation/scripts/launch_setup.sh <model> |
| 81 | +``` |
| 82 | + |
| 83 | +This downloads a single minimal DCLM shard (`global-shard_01_of_10/local-shard_0_of_10/shard_00000000_processed.jsonl.zst`), tokenizes it with the model's tokenizer, downloads the HuggingFace checkpoint, and converts it to DCP format. |
| 84 | + |
| 85 | +### Step 4: Create Git Worktree for Base Branch |
| 86 | + |
| 87 | +Create a worktree for the base branch. Symlink `workspace/` and `.venv` from the repo root so both branches share the same data and environment. |
| 88 | + |
| 89 | +```bash |
| 90 | +BASE_BRANCH=main # or the branch this feature was based on |
| 91 | +WORKTREE=$(mktemp -d) |
| 92 | +REPO_ROOT=$(git rev-parse --show-toplevel) |
| 93 | + |
| 94 | +git worktree add $WORKTREE $BASE_BRANCH |
| 95 | +ln -sfn $REPO_ROOT/workspace $WORKTREE/workspace |
| 96 | +ln -sfn $REPO_ROOT/.venv $WORKTREE/.venv |
| 97 | +``` |
| 98 | + |
| 99 | +### Step 5: Run Validation on Base Branch |
| 100 | + |
| 101 | +Run 15 training steps in the base worktree. Only run the model(s) selected in Step 1. |
| 102 | + |
| 103 | +```bash |
| 104 | +cd $WORKTREE |
| 105 | + |
| 106 | +# Single-node (replace <model> with deepseek-v2-lite or qwen3-30b-a3b) |
| 107 | +bash .claude/skills/correctness-validation/scripts/launch_validate.sh <model> |
| 108 | + |
| 109 | +# Multi-node (SLURM) |
| 110 | +srun -W 0 .claude/skills/correctness-validation/scripts/launch_validate.sh <model> |
| 111 | +``` |
| 112 | + |
| 113 | +The launch script auto-detects SLURM environment variables (`SLURM_NNODES`, `SLURM_NODEID`, `SLURM_STEP_GPUS`, `SLURM_STEP_NODELIST`) to configure `torchrun` arguments. On single-node, it falls back to localhost defaults. |
| 114 | + |
| 115 | +Logs are written to `logging/correctness-validation/validate_<model>_node<N>.log`. |
| 116 | + |
| 117 | +Return to the original repo directory after the run completes. |
| 118 | + |
| 119 | +### Step 6: Run Validation on Feature Branch |
| 120 | + |
| 121 | +Run the same 15 steps in the current (feature) working directory, for the same model(s). |
| 122 | + |
| 123 | +```bash |
| 124 | +cd $REPO_ROOT |
| 125 | + |
| 126 | +# Single-node |
| 127 | +bash .claude/skills/correctness-validation/scripts/launch_validate.sh <model> |
| 128 | + |
| 129 | +# Multi-node (SLURM) |
| 130 | +srun -W 0 .claude/skills/correctness-validation/scripts/launch_validate.sh <model> |
| 131 | +``` |
| 132 | + |
| 133 | +### Step 7: Compare Results |
| 134 | + |
| 135 | +Run the compare script for each model that was validated. Use the node-0 logs (rank 0 emits the metrics). Run `python3 .claude/skills/correctness-validation/scripts/compare.py --help` for full options. |
| 136 | + |
| 137 | +```bash |
| 138 | +python3 .claude/skills/correctness-validation/scripts/compare.py \ |
| 139 | + $WORKTREE/logging/correctness-validation/validate_<model>_node0.log \ |
| 140 | + logging/correctness-validation/validate_<model>_node0.log |
| 141 | +``` |
| 142 | + |
| 143 | +The compare script parses both logs, extracts per-step metrics, and reports pass/fail. It checks: |
| 144 | + |
| 145 | +- **cross-entropy-loss**: relative tolerance per step |
| 146 | +- **load-balance-loss**: relative tolerance per step |
| 147 | +- **gradient-norm**: relative tolerance per step |
| 148 | + |
| 149 | +Default tolerance is 1e-3 relative difference. Use `--tolerance` to adjust. |
| 150 | + |
| 151 | +Expected output on success: |
| 152 | + |
| 153 | +``` |
| 154 | +PASS: All metrics within tolerance across all steps. |
| 155 | +``` |
| 156 | + |
| 157 | +Expected output on failure: |
| 158 | + |
| 159 | +``` |
| 160 | +FAIL: Metrics diverged beyond tolerance: |
| 161 | + cross-entropy-loss: |
| 162 | + step 003: cross-entropy-loss diverged — base=2.663700, feature=2.680100, rel_diff=6.16e-03 > tolerance=1e-03 |
| 163 | +``` |
| 164 | + |
| 165 | +### Step 8: Clean Up |
| 166 | + |
| 167 | +```bash |
| 168 | +git worktree remove $WORKTREE |
| 169 | +``` |
| 170 | + |
| 171 | +## Log Format |
| 172 | + |
| 173 | +The training scripts emit lines like: |
| 174 | + |
| 175 | +``` |
| 176 | +2026-04-02 12:32:40 | INFO | step 00000001/00000015 | step-time 110.990 sec | cross-entropy-loss 2.6637 | load-balance-loss 0.001234 | learning-rate 1.000000e-06 | gradient-norm 20.3210 | tokens-per-second 18,895 | peak-gpu-memory 47.20 GB |
| 177 | +``` |
| 178 | + |
| 179 | +The compare script parses pipe-separated key-value pairs from lines containing `| INFO | step `. |
| 180 | + |
| 181 | +## Common Issues |
| 182 | + |
| 183 | +### Setup fails on HuggingFace download |
| 184 | + |
| 185 | +Ensure `HF_TOKEN` is set if the model is gated. DeepSeek-V2-Lite and Qwen3-30B-A3B are public models. |
| 186 | + |
| 187 | +### OOM during validation |
| 188 | + |
| 189 | +DeepSeek-V2-Lite requires 4x B200 GPUs. Qwen3-30B-A3B requires 16x B200 GPUs. If OOM occurs, check that no other processes are using GPU memory. |
| 190 | + |
| 191 | +### Logs show no load-balance-loss |
| 192 | + |
| 193 | +The validation scripts set `moe_load_balance_coef > 0` to ensure this metric is logged. If it is missing, check that the validation script (not an example script) was used. |
| 194 | + |
| 195 | +### Tolerance too strict |
| 196 | + |
| 197 | +FP8 operations and flash attention can introduce small non-determinism. If validation fails with very small differences, try increasing tolerance: |
| 198 | + |
| 199 | +```bash |
| 200 | +python3 .claude/skills/correctness-validation/scripts/compare.py \ |
| 201 | + base.log feature.log --tolerance 4e-3 |
| 202 | +``` |
| 203 | + |
| 204 | +### Worktree conflicts |
| 205 | + |
| 206 | +If the worktree was not cleaned up from a previous run, use `git worktree list` to find it and `git worktree remove <path> --force` to remove it. |
0 commit comments