Make checkpointing better by finbarrtimbers · Pull Request #1647 · allenai/open-instruct

finbarrtimbers · 2026-04-29T21:51:04Z

Summary

Pull the checkpoint-state save into the timed save path and cut redundant work along the way.

Move the inline checkpoint-state save out of run_training and into a new maybe_save_checkpoint_state helper called from one_training_step, so its duration counts toward the time/saving metric and num_total_tokens reflects the just-finished step.
Log checkpoint size (GiB) and average write bandwidth (MiB/s) for the most recent global_step{N} directory after each save, for I/O monitoring.
Stop pickling the dataloader and data-prep-actor state into every rank's mp_rank_*_model_states.pt. Rank 0 now writes a single driver_state.pt next to the DeepSpeed checkpoint; the load path picks it up and merges into states.
Skip-when-clean for the reference policy: introduce a shared should_save_ref_policy(args, training_step) predicate and use it both at the EMA-update site and the save site, so off-cadence checkpoints don't rewrite an unchanged ref-policy file. An assert in main() enforces checkpoint_state_freq % ref_policy_update_freq == 0 so saves always land on update steps.
Drop the redundant per-device torch_cuda_rng_states dict; torch_cuda_rng_state_all already covers every device.
Restore short WHY comments on the two non-obvious bits: ref_policy bypassing DeepSpeed's saver (DummyOptim has no state_dict) and the mpu detach / all-ranks save_checkpoint contract.

No changes to checkpoint format on the load side beyond reading driver_state.pt when present.

Test plan

Run a debug GRPO script with --checkpoint_state_freq set and confirm the new size/bandwidth log line appears and that time/saving reflects the checkpoint-state save.
Resume from a fresh checkpoint and verify dataloader / data-prep-actor state restore via driver_state.pt.
Confirm with load_ref_policy=True and a configured ref_policy_update_freq (multiple of checkpoint_state_freq) that the ref-policy file is only rewritten on update steps.

…-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…-By: Claude Opus 4.7 <noreply@anthropic.com>

… of $@ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…g size/bandwidth Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…icy rewrites, RNG dup Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…UDA RNG Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ore driver_state guard Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…land on update steps Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request enhances the checkpointing system for GRPO training by introducing driver state persistence (dataloader and actor states) and refactoring reference policy updates. It also adds a training script for Qwen 2.5 0.5B. Key feedback includes addressing a potential backward compatibility issue with RNG state restoration, improving consistency in checkpoint directory identification by using the 'latest' file, and mitigating performance bottlenecks caused by synchronous file system operations during checkpoint logging.

gemini-code-assist · 2026-04-29T21:53:05Z

+                if torch.cuda.is_available():
+                    torch.cuda.set_rng_state_all(rng_states["torch_cuda_rng_state_all"])


The code now assumes torch_cuda_rng_state_all is always present in rng_states when CUDA is available. This will cause a KeyError when attempting to resume from older checkpoints that lack this specific key. It is safer to check for the key's existence to maintain backward compatibility.

Suggested change

if torch.cuda.is_available():

torch.cuda.set_rng_state_all(rng_states["torch_cuda_rng_state_all"])

if torch.cuda.is_available() and "torch_cuda_rng_state_all" in rng_states:

torch.cuda.set_rng_state_all(rng_states["torch_cuda_rng_state_all"])

gemini-code-assist · 2026-04-29T21:53:05Z

+    step_dirs = [
+        d
+        for d in os.listdir(args.checkpoint_state_dir)
+        if d.startswith("global_step")
+        and d[len("global_step") :].isdigit()
+        and os.path.isdir(os.path.join(args.checkpoint_state_dir, d))
+    ]
+    if step_dirs:
+        latest_dir = os.path.join(
+            args.checkpoint_state_dir, max(step_dirs, key=lambda d: int(d[len("global_step") :]))
+        )


Manually searching for the latest checkpoint directory by parsing global_step* names is redundant and potentially inconsistent with DeepSpeed's latest file mechanism. Since save_driver_state (line 858) already relies on the latest file as the source of truth, this function should ideally do the same to ensure it measures the correct directory.

latest_file = os.path.join(args.checkpoint_state_dir, "latest") latest_dir = None if os.path.exists(latest_file): with open(latest_file) as f: latest_dir = os.path.join(args.checkpoint_state_dir, f.read().strip()) if latest_dir:

gemini-code-assist · 2026-04-29T21:53:05Z

+        total_bytes = sum(
+            os.path.getsize(os.path.join(root, f)) for root, _, files in os.walk(latest_dir) for f in files
+        )


Calculating the total checkpoint size using os.walk and os.path.getsize on the main thread can introduce significant latency, especially when using distributed filesystems or when dealing with large model checkpoints (e.g., 70B+ parameters). Since this information is only used for logging, consider moving this calculation to a background thread or making it optional to avoid stalling the training loop.

…pu/save dance Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…y@anthropic.com>

finbarrtimbers added 9 commits April 28, 2026 16:02

Added fast script

3085ad0

Align qwen2.5_0.5b_gsm8k.sh scaffolding with qwen3_4b_dapo_math.sh Co…

155d805

…-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Remove redundant default flags from qwen2.5_0.5b_gsm8k.sh Co-Authored…

4d683b6

…-By: Claude Opus 4.7 <noreply@anthropic.com>

Drop dead --log_train_solve_rate_metrics flag and shift image arg out…

754ce98

… of $@ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Extract checkpoint-state save into maybe_save_checkpoint_state and lo…

ccae3a5

…g size/bandwidth Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Cut redundant work in checkpoint save: per-rank driver state, ref_pol…

9cac3d8

…icy rewrites, RNG dup Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Drop legacy checkpoint fallbacks for driver_state.pt and per-device C…

ed68389

…UDA RNG Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replace _ref_policy_dirty with should_save_ref_policy predicate; rest…

f0d9a21

…ore driver_state guard Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Assert checkpoint_state_freq divides ref_policy_update_freq so saves …

98f4821

…land on update steps Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist Bot reviewed Apr 29, 2026

View reviewed changes

finbarrtimbers added 2 commits April 29, 2026 16:13

Restore WHY comments around ref_policy raw torch.save and deepspeed m…

6f3e1b9

…pu/save dance Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Add CHANGELOG entry for #1647 Co-Authored-By: Claude Opus 4.7 <norepl…

c7af0d0

…y@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make checkpointing better#1647

Make checkpointing better#1647
finbarrtimbers wants to merge 11 commits into
mainfrom
finbarr/check-script-in

finbarrtimbers commented Apr 29, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if torch.cuda.is_available():
		torch.cuda.set_rng_state_all(rng_states["torch_cuda_rng_state_all"])

Conversation

finbarrtimbers commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

finbarrtimbers commented Apr 29, 2026 •

edited

Loading