Skip to content

[tinker] Fix checkpointing for multi-node runs#1732

Open
pcmoritz wants to merge 1 commit into
NovaSky-AI:mainfrom
pcmoritz:tinker-multi-node-checkpoints
Open

[tinker] Fix checkpointing for multi-node runs#1732
pcmoritz wants to merge 1 commit into
NovaSky-AI:mainfrom
pcmoritz:tinker-multi-node-checkpoints

Conversation

@pcmoritz
Copy link
Copy Markdown
Collaborator

Before this fix, there would be an error like

  2026-05-29 20:09:06,716 - ERROR - skyrl: Error saving checkpoint for model model_5b6cb10d, checkpoint <...>: [Errno 2] No such file
  or directory: '/tmp/tmp6kkqkzrg/model'

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates checkpoint saving and loading in skyrl_train_backend.py to stage temporary files on the same shared filesystem as the target paths, resolving multi-node deployment issues where remote workers and the engine do not share local /tmp. A review comment identifies a potential FileNotFoundError if reference_path is a relative path, and suggests resolving it to an absolute path using os.path.abspath before extracting the directory name.

Comment on lines +1127 to +1129
staging_root = str(os.path.dirname(reference_path))
os.makedirs(staging_root, exist_ok=True)
return staging_root
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If reference_path is a relative path (e.g., a simple filename like "checkpoint.tar"), os.path.dirname(reference_path) will return an empty string "". Calling os.makedirs("", exist_ok=True) will then raise a FileNotFoundError: [Errno 2] No such file or directory: ''.

To prevent this and ensure robustness for relative paths, resolve the absolute path using os.path.abspath before extracting the directory name.

Suggested change
staging_root = str(os.path.dirname(reference_path))
os.makedirs(staging_root, exist_ok=True)
return staging_root
staging_root = os.path.dirname(os.path.abspath(reference_path))
os.makedirs(staging_root, exist_ok=True)
return staging_root

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant