Skip to content

Group 16 TorchTitan Training on Perlmutter#11

Open
IanHollow wants to merge 89 commits intocornell-sysphotonics:mainfrom
IanHollow:torchtitan-perlmutter-train
Open

Group 16 TorchTitan Training on Perlmutter#11
IanHollow wants to merge 89 commits intocornell-sysphotonics:mainfrom
IanHollow:torchtitan-perlmutter-train

Conversation

@IanHollow
Copy link
Copy Markdown

Trained and profiled 3 models using TorchTitan on Perlmutter. This was done by group 16.

IanHollow and others added 30 commits December 1, 2025 16:08
Copilot AI review requested due to automatic review settings December 20, 2025 18:16
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds TorchTitan training configurations and collected traces for Llama-3.1-8B across three different parallelism strategies (PP, FSDP+TP, and TP+PP) executed by group 16 on Perlmutter.

  • Implements three workload configurations for Llama-3.1-8B training using different parallelism approaches
  • Includes setup scripts, SLURM batch configurations, and analysis outputs for each workload
  • Provides symbolic links to shared infrastructure files to reduce code duplication

Reviewed changes

Copilot reviewed 198 out of 277 changed files in this pull request and generated 3 comments.

File Description
trace_collection/llama3-8b-torchtitan-tp+pp-perlmutter-group-16/ Pipeline parallelism (PP=4) workload configuration and generated analysis report
trace_collection/llama3-8b-torchtitan-pp-perlmutter-group-16/ TP+PP hybrid workload with symlinks to shared scripts from fsdp workload
trace_collection/llama3-8b-torchtitan-fsdp-perlmutter-group-16/ Base FSDP workload containing shared infrastructure (install, run scripts, environment config)
trace_collection/llama3-8b-torchtitan-fsdp+tp-perlmutter-group-16/ FSDP+TP hybrid workload with configuration and analysis results

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread trace_collection/llama3-8b-torchtitan-fsdp-perlmutter-group-16/train_config.toml Outdated
Comment thread trace_collection/llama3-8b-torchtitan-fsdp-perlmutter-group-16/train_config.toml Outdated
Comment thread trace_collection/llama3-8b-torchtitan-fsdp-perlmutter-group-16/train_config.toml Outdated
IanHollow and others added 2 commits December 20, 2025 10:18
…/train_config.toml

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants