You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+19-13Lines changed: 19 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,7 +12,7 @@ PithTrain is built to be understood — by humans and AI agents alike. At ~10K l
12
12
13
13
## Installation
14
14
15
-
Hopper (SM90) or Blackwell (SM100) GPUs are required. CUDA 13.0 and Python >= 3.12 are required. We use [uv](https://docs.astral.sh/uv/) to manage project dependencies.
15
+
NVIDIA Hopper (SM90) or Blackwell (SM100) GPUs are required. CUDA 13.0 and Python >= 3.12 are required. We use [uv](https://docs.astral.sh/uv/) to manage project dependencies.
Here is an example of pretraining Qwen3-30B-A3B. Other models like DeepSeek-V2-Lite are also supported; see the[`examples`](examples/) directory for more configurations.
36
+
Pretrain Qwen3-30B-A3B from scratch. Datasets and checkpoints are stored in the `workspace` folder by default. Other models like DeepSeek-V2-Lite follow the same steps. See[`examples`](examples)for available configurations.
This downloads and tokenizes the dataset to `workspace/datasets/dclm-baseline/toktxt/qwen3`.
44
+
Download and tokenize the DCLM pretraining corpus into mmap-friendly packed sequences. Each model uses its own tokenizer, so switching to a different model requires running this step again.
45
45
46
-
**2. Review the training config** at [`examples/pretrain_language_model/qwen3-30b-a3b/script.py`](examples/pretrain_language_model/qwen3-30b-a3b/script.py) and adjust parallelism sizes, batch size, or other hyperparameters to match your cluster.
46
+
**2. Configure training**
47
47
48
-
**3. Launch training:**
48
+
Edit [`examples/pretrain_language_model/qwen3-30b-a3b/script.py`](examples/pretrain_language_model/qwen3-30b-a3b/script.py) to adjust parallelism, batch size, learning rate, and other hyperparameters. The model architecture is defined in the accompanying [`config.json`](examples/pretrain_language_model/qwen3-30b-a3b/config.json).
The launch script auto-detects GPUs and supports both single-node and multi-node (SLURM) setups. Training resumes from the latest checkpoint automatically, and checkpoints are reshardable across different parallelism.
Convert a training checkpoint to standard Hugging Face format for evaluation or inference. The same tool also supports importing Hugging Face checkpoints for continued pretraining.
65
+
54
66
## Architecture
55
67
56
68
<palign="center">
@@ -68,15 +80,9 @@ PithTrain is structured in three layers:
68
80
-*Training Infrastructure* — `torch.compile`, optimizer and LR scheduling, checkpointing, logging, etc.
Checkpoints can be converted between PyTorch Distributed Checkpoint (DCP) and Hugging Face `safetensors` via [examples/convert_checkpoint](examples/convert_checkpoint/).
74
-
75
-
The exported checkpoints are Hugging Face-compatible and can be used with evaluation tools like `lm-evaluation-harness` and inference engines like `vLLM` and `SGLang`.
76
-
77
83
## Attribution
78
84
79
-
PithTrain is built on top of DeepSeek's [DualPipe](https://github.com/deepseek-ai/DualPipe), which provides the original pipeline parallelism schedule and example code.
85
+
PithTrain is developed by contributors from CMU. It is built on top of DeepSeek's [DualPipe](https://github.com/deepseek-ai/DualPipe), which provides the original pipeline parallelism schedule and examples. We thank the [CMU Foundation and Language Model (FLAME) Center](https://www.cmu.edu/flame/) for providing the compute resources to develop PithTrain. We also acknowledge the support of DGX B200 from NVIDIA.
0 commit comments