Skip to content

[Docs] Add 5-node MiniMax example files#47

Merged
yubofredwang merged 2 commits intomainfrom
ywang/minimax-5node-sanitize
Mar 19, 2026
Merged

[Docs] Add 5-node MiniMax example files#47
yubofredwang merged 2 commits intomainfrom
ywang/minimax-5node-sanitize

Conversation

@yubofredwang
Copy link
Collaborator

Add 5-node MiniMax example files

Remove internal logging details, environment-specific paths, and FP4-specific wording from the 5-node MiniMax example files so they are safer to publish and easier to adapt. Fix the public usage comments to reference the 5-node example paths.
@yubofredwang yubofredwang marked this pull request as ready for review March 19, 2026 22:13
Copilot AI review requested due to automatic review settings March 19, 2026 22:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new 5-node MiniMax-M2.5 (Eagle3) example configuration and launcher script to the TorchSpec examples/configs set, intended to document a 40-GPU (5-node) training/inference split workflow.

Changes:

  • Added a new example run.sh for launching MiniMax-M2.5 5-node training with configurable training/inference GPU allocation.
  • Added a new sglang_minimax_m25_5node.yaml config for the 5-node setup (model/dataset/training/inference/mooncake settings).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
examples/minimax-m25-5node-h200/run.sh New launcher script for the 5-node MiniMax-M2.5 setup, including logging and (attempted) resume behavior.
configs/sglang_minimax_m25_5node.yaml New example config defining the MiniMax-M2.5 5-node training + SGLang inference + mooncake parameters.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 19, 2026 22:17
@yubofredwang yubofredwang merged commit 2cc6512 into main Mar 19, 2026
2 checks passed
@yubofredwang yubofredwang deleted the ywang/minimax-5node-sanitize branch March 19, 2026 22:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +60 to +64
if [ -n "$CHECKPOINT_DIR" ]; then
RESUME_ARGS=(training.load_path="$CHECKPOINT_DIR")
if [ -f "$CHECKPOINT_DIR/latest_checkpointed_iteration.txt" ]; then
LAST_STEP=$(<"$CHECKPOINT_DIR/latest_checkpointed_iteration.txt")
echo "Resuming from checkpoint: $CHECKPOINT_DIR/iter_$(printf '%07d' "$LAST_STEP")"
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resume logic is effectively always enabled because CHECKPOINT_DIR is derived from output_dir (so it will be non-empty whenever output_dir is set), which makes the script print "No checkpoint tracker found" on fresh runs and always passes training.load_path=... even when the directory doesn’t exist. Gate resume behavior on the checkpoint directory existing and containing latest_checkpointed_iteration.txt (or another explicit resume flag) before setting RESUME_ARGS / printing resume messages.

Suggested change
if [ -n "$CHECKPOINT_DIR" ]; then
RESUME_ARGS=(training.load_path="$CHECKPOINT_DIR")
if [ -f "$CHECKPOINT_DIR/latest_checkpointed_iteration.txt" ]; then
LAST_STEP=$(<"$CHECKPOINT_DIR/latest_checkpointed_iteration.txt")
echo "Resuming from checkpoint: $CHECKPOINT_DIR/iter_$(printf '%07d' "$LAST_STEP")"
if [ -n "$CHECKPOINT_DIR" ] && [ -d "$CHECKPOINT_DIR" ]; then
if [ -f "$CHECKPOINT_DIR/latest_checkpointed_iteration.txt" ]; then
LAST_STEP=$(<"$CHECKPOINT_DIR/latest_checkpointed_iteration.txt")
echo "Resuming from checkpoint: $CHECKPOINT_DIR/iter_$(printf '%07d' "$LAST_STEP")"
RESUME_ARGS=(training.load_path="$CHECKPOINT_DIR")

Copilot uses AI. Check for mistakes.
Comment on lines +21 to +22
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PYTORCH_CUDA_ALLOC_CONF is exported here, but TorchSpec Ray actors only receive a small allowlist of environment variables (via get_torchspec_env_vars()), which currently does not include PYTORCH_CUDA_ALLOC_CONF. As a result, the allocator setting likely won’t apply to the training/inference worker processes where it matters. Either remove this export to avoid a false sense of effect, or plumb it through (e.g., add it to the env allowlist / train_env_vars) so it reaches all Ray actors.

Suggested change
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Copilot uses AI. Check for mistakes.
Comment on lines +86 to +90
python3 -m torchspec.train_entry \
--config "$CONFIG_FILE" \
training.training_num_nodes="$TRAIN_NODES" \
training.training_num_gpus_per_node="$TRAIN_GPUS" \
inference.inference_num_gpus="$INFERENCE_GPUS" \
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example targets a multi-node (5-node) run, but the script doesn’t preflight that it’s connected to a Ray cluster. If Ray isn’t running / RAY_ADDRESS isn’t set, TorchSpec may start a local Ray instance and then fail after waiting for 40 GPUs, which is slow and not very actionable. Consider adding a ray status (as in the other multi-node examples) with clear instructions to start/join the cluster or set RAY_ADDRESS before invoking torchspec.train_entry.

Copilot uses AI. Check for mistakes.
zhubohao911 pushed a commit to zhubohao911/TorchSpec that referenced this pull request Mar 23, 2026
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
zhubohao911 pushed a commit to zhubohao911/TorchSpec that referenced this pull request Mar 23, 2026
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants