Skip to content

Add Megatron SFT internal checkpoints#645

Merged
Kovbo merged 3 commits intomainfrom
codex/megatron-sft-internal-checkpoints
Apr 8, 2026
Merged

Add Megatron SFT internal checkpoints#645
Kovbo merged 3 commits intomainfrom
codex/megatron-sft-internal-checkpoints

Conversation

@Kovbo
Copy link
Copy Markdown
Collaborator

@Kovbo Kovbo commented Apr 8, 2026

Since Megatron runs in a subprocess, the serverless backend cannot reliably snapshot live training state by copying files from the parent process during long SFT runs. This PR changes internal checkpointing so the Megatron worker saves its own periodic checkpoints inside ART, and the serverless backend only uploads those intermediate checkpoints as artifacts for resume. This keeps the training logic in one place, makes long-running Megatron SFT jobs resumable after pod restarts or deploys, and leaves the final checkpoint to the normal end-of-training save path.

@Kovbo Kovbo requested a review from FurtherAI April 8, 2026 01:39
@Kovbo Kovbo merged commit 953577a into main Apr 8, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants