[tinker] Fix checkpointing for multi-node runs#1732
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates checkpoint saving and loading in skyrl_train_backend.py to stage temporary files on the same shared filesystem as the target paths, resolving multi-node deployment issues where remote workers and the engine do not share local /tmp. A review comment identifies a potential FileNotFoundError if reference_path is a relative path, and suggests resolving it to an absolute path using os.path.abspath before extracting the directory name.
| staging_root = str(os.path.dirname(reference_path)) | ||
| os.makedirs(staging_root, exist_ok=True) | ||
| return staging_root |
There was a problem hiding this comment.
If reference_path is a relative path (e.g., a simple filename like "checkpoint.tar"), os.path.dirname(reference_path) will return an empty string "". Calling os.makedirs("", exist_ok=True) will then raise a FileNotFoundError: [Errno 2] No such file or directory: ''.
To prevent this and ensure robustness for relative paths, resolve the absolute path using os.path.abspath before extracting the directory name.
| staging_root = str(os.path.dirname(reference_path)) | |
| os.makedirs(staging_root, exist_ok=True) | |
| return staging_root | |
| staging_root = os.path.dirname(os.path.abspath(reference_path)) | |
| os.makedirs(staging_root, exist_ok=True) | |
| return staging_root |
Before this fix, there would be an error like