Skip to content

Commit cc44cb3

Browse files
committed
Revise the README
1 parent 63af8cd commit cc44cb3

1 file changed

Lines changed: 19 additions & 13 deletions

File tree

README.md

Lines changed: 19 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ PithTrain is built to be understood — by humans and AI agents alike. At ~10K l
1212

1313
## Installation
1414

15-
Hopper (SM90) or Blackwell (SM100) GPUs are required. CUDA 13.0 and Python >= 3.12 are required. We use [uv](https://docs.astral.sh/uv/) to manage project dependencies.
15+
NVIDIA Hopper (SM90) or Blackwell (SM100) GPUs are required. CUDA 13.0 and Python >= 3.12 are required. We use [uv](https://docs.astral.sh/uv/) to manage project dependencies.
1616

1717
```bash
1818
git clone https://github.com/mlc-ai/Pith-Train.git && cd Pith-Train
@@ -33,24 +33,36 @@ uv sync
3333

3434
## Getting Started
3535

36-
Here is an example of pretraining Qwen3-30B-A3B. Other models like DeepSeek-V2-Lite are also supported; see the [`examples`](examples/) directory for more configurations.
36+
Pretrain Qwen3-30B-A3B from scratch. Datasets and checkpoints are stored in the `workspace` folder by default. Other models like DeepSeek-V2-Lite follow the same steps. See [`examples`](examples) for available configurations.
3737

38-
**1. Build a tokenized dataset:**
38+
**1. Prepare the dataset**
3939

4040
```bash
4141
bash examples/build_tokenized_corpus/launch.sh dclm-qwen3
4242
```
4343

44-
This downloads and tokenizes the dataset to `workspace/datasets/dclm-baseline/toktxt/qwen3`.
44+
Download and tokenize the DCLM pretraining corpus into mmap-friendly packed sequences. Each model uses its own tokenizer, so switching to a different model requires running this step again.
4545

46-
**2. Review the training config** at [`examples/pretrain_language_model/qwen3-30b-a3b/script.py`](examples/pretrain_language_model/qwen3-30b-a3b/script.py) and adjust parallelism sizes, batch size, or other hyperparameters to match your cluster.
46+
**2. Configure training**
4747

48-
**3. Launch training:**
48+
Edit [`examples/pretrain_language_model/qwen3-30b-a3b/script.py`](examples/pretrain_language_model/qwen3-30b-a3b/script.py) to adjust parallelism, batch size, learning rate, and other hyperparameters. The model architecture is defined in the accompanying [`config.json`](examples/pretrain_language_model/qwen3-30b-a3b/config.json).
49+
50+
**3. Launch training**
4951

5052
```bash
5153
bash examples/pretrain_language_model/launch.sh qwen3-30b-a3b
5254
```
5355

56+
The launch script auto-detects GPUs and supports both single-node and multi-node (SLURM) setups. Training resumes from the latest checkpoint automatically, and checkpoints are reshardable across different parallelism.
57+
58+
**4. Export checkpoint**
59+
60+
```bash
61+
bash examples/convert_checkpoint/launch.sh qwen3-30b-a3b
62+
```
63+
64+
Convert a training checkpoint to standard Hugging Face format for evaluation or inference. The same tool also supports importing Hugging Face checkpoints for continued pretraining.
65+
5466
## Architecture
5567

5668
<p align="center">
@@ -68,15 +80,9 @@ PithTrain is structured in three layers:
6880
- *Training Infrastructure*`torch.compile`, optimizer and LR scheduling, checkpointing, logging, etc.
6981
- **Operators** — PyTorch (basic ops, NCCL), operator libraries (DeepGEMM, FlashAttention), and Python DSLs (Triton, TileLang).
7082

71-
## Model Compatibility and Evaluation
72-
73-
Checkpoints can be converted between PyTorch Distributed Checkpoint (DCP) and Hugging Face `safetensors` via [examples/convert_checkpoint](examples/convert_checkpoint/).
74-
75-
The exported checkpoints are Hugging Face-compatible and can be used with evaluation tools like `lm-evaluation-harness` and inference engines like `vLLM` and `SGLang`.
76-
7783
## Attribution
7884

79-
PithTrain is built on top of DeepSeek's [DualPipe](https://github.com/deepseek-ai/DualPipe), which provides the original pipeline parallelism schedule and example code.
85+
PithTrain is developed by contributors from CMU. It is built on top of DeepSeek's [DualPipe](https://github.com/deepseek-ai/DualPipe), which provides the original pipeline parallelism schedule and examples. We thank the [CMU Foundation and Language Model (FLAME) Center](https://www.cmu.edu/flame/) for providing the compute resources to develop PithTrain. We also acknowledge the support of DGX B200 from NVIDIA.
8086

8187
## License
8288

0 commit comments

Comments
 (0)