[Docs] Add 5-node MiniMax example files by yubofredwang · Pull Request #47 · torchspec-project/TorchSpec

yubofredwang · 2026-03-19T22:12:07Z

Add 5-node MiniMax example files

Remove internal logging details, environment-specific paths, and FP4-specific wording from the 5-node MiniMax example files so they are safer to publish and easier to adapt. Fix the public usage comments to reference the 5-node example paths.

Copilot

Pull request overview

Adds a new 5-node MiniMax-M2.5 (Eagle3) example configuration and launcher script to the TorchSpec examples/configs set, intended to document a 40-GPU (5-node) training/inference split workflow.

Changes:

Added a new example run.sh for launching MiniMax-M2.5 5-node training with configurable training/inference GPU allocation.
Added a new sglang_minimax_m25_5node.yaml config for the 5-node setup (model/dataset/training/inference/mooncake settings).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
examples/minimax-m25-5node-h200/run.sh	New launcher script for the 5-node MiniMax-M2.5 setup, including logging and (attempted) resume behavior.
configs/sglang_minimax_m25_5node.yaml	New example config defining the MiniMax-M2.5 5-node training + SGLang inference + mooncake parameters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

configs/sglang_minimax_m25_5node.yaml

examples/minimax-m25-5node-h200/run.sh

configs/sglang_minimax_m25_5node.yaml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-19T22:21:57Z

examples/minimax-m25-5node-h200/run.sh

+if [ -n "$CHECKPOINT_DIR" ]; then
+  RESUME_ARGS=(training.load_path="$CHECKPOINT_DIR")
+  if [ -f "$CHECKPOINT_DIR/latest_checkpointed_iteration.txt" ]; then
+    LAST_STEP=$(<"$CHECKPOINT_DIR/latest_checkpointed_iteration.txt")
+    echo "Resuming from checkpoint: $CHECKPOINT_DIR/iter_$(printf '%07d' "$LAST_STEP")"


The resume logic is effectively always enabled because CHECKPOINT_DIR is derived from output_dir (so it will be non-empty whenever output_dir is set), which makes the script print "No checkpoint tracker found" on fresh runs and always passes training.load_path=... even when the directory doesn’t exist. Gate resume behavior on the checkpoint directory existing and containing latest_checkpointed_iteration.txt (or another explicit resume flag) before setting RESUME_ARGS / printing resume messages.

Suggested change

if [ -n "$CHECKPOINT_DIR" ]; then

RESUME_ARGS=(training.load_path="$CHECKPOINT_DIR")

if [ -f "$CHECKPOINT_DIR/latest_checkpointed_iteration.txt" ]; then

LAST_STEP=$(<"$CHECKPOINT_DIR/latest_checkpointed_iteration.txt")

echo "Resuming from checkpoint: $CHECKPOINT_DIR/iter_$(printf '%07d' "$LAST_STEP")"

if [ -n "$CHECKPOINT_DIR" ] && [ -d "$CHECKPOINT_DIR" ]; then

if [ -f "$CHECKPOINT_DIR/latest_checkpointed_iteration.txt" ]; then

LAST_STEP=$(<"$CHECKPOINT_DIR/latest_checkpointed_iteration.txt")

echo "Resuming from checkpoint: $CHECKPOINT_DIR/iter_$(printf '%07d' "$LAST_STEP")"

RESUME_ARGS=(training.load_path="$CHECKPOINT_DIR")

Copilot · 2026-03-19T22:21:57Z

examples/minimax-m25-5node-h200/run.sh

+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+


PYTORCH_CUDA_ALLOC_CONF is exported here, but TorchSpec Ray actors only receive a small allowlist of environment variables (via get_torchspec_env_vars()), which currently does not include PYTORCH_CUDA_ALLOC_CONF. As a result, the allocator setting likely won’t apply to the training/inference worker processes where it matters. Either remove this export to avoid a false sense of effect, or plumb it through (e.g., add it to the env allowlist / train_env_vars) so it reaches all Ray actors.

Suggested change

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Copilot · 2026-03-19T22:21:57Z

examples/minimax-m25-5node-h200/run.sh

+python3 -m torchspec.train_entry \
+    --config "$CONFIG_FILE" \
+    training.training_num_nodes="$TRAIN_NODES" \
+    training.training_num_gpus_per_node="$TRAIN_GPUS" \
+    inference.inference_num_gpus="$INFERENCE_GPUS" \


This example targets a multi-node (5-node) run, but the script doesn’t preflight that it’s connected to a Ray cluster. If Ray isn’t running / RAY_ADDRESS isn’t set, TorchSpec may start a local Ray instance and then fail after waiting for 40 GPUs, which is slow and not very actionable. Consider adding a ray status (as in the other multi-node examples) with clear instructions to start/join the cluster or set RAY_ADDRESS before invoking torchspec.train_entry.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

yubofredwang marked this pull request as ready for review March 19, 2026 22:13

Copilot AI review requested due to automatic review settings March 19, 2026 22:13

Copilot started reviewing on behalf of yubofredwang March 19, 2026 22:13 View session

Copilot AI reviewed Mar 19, 2026

View reviewed changes

configs/sglang_minimax_m25_5node.yaml Outdated Show resolved Hide resolved

examples/minimax-m25-5node-h200/run.sh Show resolved Hide resolved

configs/sglang_minimax_m25_5node.yaml Show resolved Hide resolved

Update configs/sglang_minimax_m25_5node.yaml

b5fee0e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 19, 2026 22:17

yubofredwang merged commit 2cc6512 into main Mar 19, 2026
2 checks passed

yubofredwang deleted the ywang/minimax-5node-sanitize branch March 19, 2026 22:17

Copilot started reviewing on behalf of yubofredwang March 19, 2026 22:18 View session

Copilot AI reviewed Mar 19, 2026

View reviewed changes

zhubohao911 pushed a commit to zhubohao911/TorchSpec that referenced this pull request Mar 23, 2026

[Docs] Add 5-node MiniMax example files (torchspec-project#47)

f2256d5

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

zhubohao911 pushed a commit to zhubohao911/TorchSpec that referenced this pull request Mar 23, 2026

[Docs] Add 5-node MiniMax example files (torchspec-project#47)

18a1519

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docs] Add 5-node MiniMax example files#47

[Docs] Add 5-node MiniMax example files#47
yubofredwang merged 2 commits intomainfrom
ywang/minimax-5node-sanitize

yubofredwang commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Copilot AI Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yubofredwang commented Mar 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants