[Docs] Add 5-node MiniMax example files#47
Conversation
Remove internal logging details, environment-specific paths, and FP4-specific wording from the 5-node MiniMax example files so they are safer to publish and easier to adapt. Fix the public usage comments to reference the 5-node example paths.
There was a problem hiding this comment.
Pull request overview
Adds a new 5-node MiniMax-M2.5 (Eagle3) example configuration and launcher script to the TorchSpec examples/configs set, intended to document a 40-GPU (5-node) training/inference split workflow.
Changes:
- Added a new example
run.shfor launching MiniMax-M2.5 5-node training with configurable training/inference GPU allocation. - Added a new
sglang_minimax_m25_5node.yamlconfig for the 5-node setup (model/dataset/training/inference/mooncake settings).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| examples/minimax-m25-5node-h200/run.sh | New launcher script for the 5-node MiniMax-M2.5 setup, including logging and (attempted) resume behavior. |
| configs/sglang_minimax_m25_5node.yaml | New example config defining the MiniMax-M2.5 5-node training + SGLang inference + mooncake parameters. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if [ -n "$CHECKPOINT_DIR" ]; then | ||
| RESUME_ARGS=(training.load_path="$CHECKPOINT_DIR") | ||
| if [ -f "$CHECKPOINT_DIR/latest_checkpointed_iteration.txt" ]; then | ||
| LAST_STEP=$(<"$CHECKPOINT_DIR/latest_checkpointed_iteration.txt") | ||
| echo "Resuming from checkpoint: $CHECKPOINT_DIR/iter_$(printf '%07d' "$LAST_STEP")" |
There was a problem hiding this comment.
The resume logic is effectively always enabled because CHECKPOINT_DIR is derived from output_dir (so it will be non-empty whenever output_dir is set), which makes the script print "No checkpoint tracker found" on fresh runs and always passes training.load_path=... even when the directory doesn’t exist. Gate resume behavior on the checkpoint directory existing and containing latest_checkpointed_iteration.txt (or another explicit resume flag) before setting RESUME_ARGS / printing resume messages.
| if [ -n "$CHECKPOINT_DIR" ]; then | |
| RESUME_ARGS=(training.load_path="$CHECKPOINT_DIR") | |
| if [ -f "$CHECKPOINT_DIR/latest_checkpointed_iteration.txt" ]; then | |
| LAST_STEP=$(<"$CHECKPOINT_DIR/latest_checkpointed_iteration.txt") | |
| echo "Resuming from checkpoint: $CHECKPOINT_DIR/iter_$(printf '%07d' "$LAST_STEP")" | |
| if [ -n "$CHECKPOINT_DIR" ] && [ -d "$CHECKPOINT_DIR" ]; then | |
| if [ -f "$CHECKPOINT_DIR/latest_checkpointed_iteration.txt" ]; then | |
| LAST_STEP=$(<"$CHECKPOINT_DIR/latest_checkpointed_iteration.txt") | |
| echo "Resuming from checkpoint: $CHECKPOINT_DIR/iter_$(printf '%07d' "$LAST_STEP")" | |
| RESUME_ARGS=(training.load_path="$CHECKPOINT_DIR") |
| export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True | ||
|
|
There was a problem hiding this comment.
PYTORCH_CUDA_ALLOC_CONF is exported here, but TorchSpec Ray actors only receive a small allowlist of environment variables (via get_torchspec_env_vars()), which currently does not include PYTORCH_CUDA_ALLOC_CONF. As a result, the allocator setting likely won’t apply to the training/inference worker processes where it matters. Either remove this export to avoid a false sense of effect, or plumb it through (e.g., add it to the env allowlist / train_env_vars) so it reaches all Ray actors.
| export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True |
| python3 -m torchspec.train_entry \ | ||
| --config "$CONFIG_FILE" \ | ||
| training.training_num_nodes="$TRAIN_NODES" \ | ||
| training.training_num_gpus_per_node="$TRAIN_GPUS" \ | ||
| inference.inference_num_gpus="$INFERENCE_GPUS" \ |
There was a problem hiding this comment.
This example targets a multi-node (5-node) run, but the script doesn’t preflight that it’s connected to a Ray cluster. If Ray isn’t running / RAY_ADDRESS isn’t set, TorchSpec may start a local Ray instance and then fail after waiting for 40 GPUs, which is slow and not very actionable. Consider adding a ray status (as in the other multi-node examples) with clear instructions to start/join the cluster or set RAY_ADDRESS before invoking torchspec.train_entry.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Add 5-node MiniMax example files