[Dev] Add Llama3 training example and fix cache save by wtr0504 · Pull Request #14 · SandAI-org/MagiCompiler

wtr0504 · 2026-04-01T11:44:24Z

🗂️ PR Category

📝 Description

Summary

Add end-to-end Llama3 training example (example/training/) with FSDP support, a distributed training script, and an Nsys profiling launch script.
Fix a cache save bug where aot_autograd artifacts were empty, causing compiled graphs to fail to persist correctly.

Changes

example/training/llama3.py — Llama3 model definition adapted to use magi_compile
example/training/train.py — distributed training loop with FSDP and NVTX profiling hooks
example/training/train.sh — torchrun launcher with optional Nsys profiling
magi_compiler/magi_backend/piecewise_compiler.py — workaround for empty aot_autograd artifacts on cache save
magi_compiler/utils/nvtx.py — profiler for iteration

jiahy0825 · 2026-04-01T14:12:18Z

example/training/llama3.py

+from dataclasses import dataclass
+from typing import Optional, Tuple
+
+import fairscale.nn.model_parallel.initialize as fs_init


Do we need to add this pkg to requirements-test.txt?

jiahy0825 · 2026-04-01T14:15:52Z

example/training/train.py

+        device = torch.device("cpu")
+
+    # Initialize a small config for testing
+    config = ModelArgs(n_layers=10, max_batch_size=2, max_seq_len=1024)


Use official config for profiling

jiahy0825 · 2026-04-01T14:16:46Z

example/training/train.sh

+export MAGI_ENABLE_FX_GRAPH_VIZ=${MAGI_ENABLE_FX_GRAPH_VIZ:-false}
+
+$NSYS_CMD torchrun $DISTRIBUTED_ARGS $SCRIPT_DIR/train.py \
+    $NSYS_ARGS


No NSYS_ARGS provided? Check again and try to simplify this script~

wtr0504 added 3 commits April 1, 2026 19:42

Add Llama3 training example and fix cache save

1ce7ec3

fix ci

dbe27ab

chore

29624e0

jiahy0825 reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Dev] Add Llama3 training example and fix cache save#14

[Dev] Add Llama3 training example and fix cache save#14
wtr0504 wants to merge 3 commits intoSandAI-org:mainfrom
wtr0504:dev/training

wtr0504 commented Apr 1, 2026

Uh oh!

jiahy0825 Apr 1, 2026

Uh oh!

jiahy0825 Apr 1, 2026

Uh oh!

jiahy0825 Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wtr0504 commented Apr 1, 2026

🗂️ PR Category

📝 Description

Uh oh!

jiahy0825 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

jiahy0825 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

jiahy0825 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants