Group 16 TorchTitan Training on Perlmutter#11
Open
IanHollow wants to merge 89 commits intocornell-sysphotonics:mainfrom
Open
Group 16 TorchTitan Training on Perlmutter#11IanHollow wants to merge 89 commits intocornell-sysphotonics:mainfrom
IanHollow wants to merge 89 commits intocornell-sysphotonics:mainfrom
Conversation
…nd improve flexibility
…-torchtitan-train into fix-file-structure
…-torchtitan-train into fix-file-structure
There was a problem hiding this comment.
Pull request overview
This PR adds TorchTitan training configurations and collected traces for Llama-3.1-8B across three different parallelism strategies (PP, FSDP+TP, and TP+PP) executed by group 16 on Perlmutter.
- Implements three workload configurations for Llama-3.1-8B training using different parallelism approaches
- Includes setup scripts, SLURM batch configurations, and analysis outputs for each workload
- Provides symbolic links to shared infrastructure files to reduce code duplication
Reviewed changes
Copilot reviewed 198 out of 277 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| trace_collection/llama3-8b-torchtitan-tp+pp-perlmutter-group-16/ | Pipeline parallelism (PP=4) workload configuration and generated analysis report |
| trace_collection/llama3-8b-torchtitan-pp-perlmutter-group-16/ | TP+PP hybrid workload with symlinks to shared scripts from fsdp workload |
| trace_collection/llama3-8b-torchtitan-fsdp-perlmutter-group-16/ | Base FSDP workload containing shared infrastructure (install, run scripts, environment config) |
| trace_collection/llama3-8b-torchtitan-fsdp+tp-perlmutter-group-16/ | FSDP+TP hybrid workload with configuration and analysis results |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…/train_config.toml Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Trained and profiled 3 models using TorchTitan on Perlmutter. This was done by group 16.