-
Notifications
You must be signed in to change notification settings - Fork 120
Add Kubernetes support via SkyPilot #2083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Kubernetes support via SkyPilot #2083
Conversation
Summary: This update introduces two new optional parameters, `workdir` and `file_mounts`, to the `SkyPilotJob` class. The `workdir` parameter allows users to specify a local directory to sync with the cluster, while `file_mounts` enables additional file mounts by mapping remote paths to local paths. These enhancements improve the flexibility and usability of job configurations in SkyPilot.
- Add SKY_README.md with comprehensive documentation: - Architecture overview - Implementation details - Usage examples - Troubleshooting guide - Networking considerations for Kubernetes - Add python/examples/skypilot_getting_started.py: - Example script demonstrating Monarch actors on SkyPilot - Supports multiple clouds (Kubernetes, AWS, GCP, Azure) - Configurable via command-line arguments - Update skypilot.py: - Add host mesh initialization wait - Improve logging for debugging - Fix worker command environment variable setup
|
Hi @romilbhardwaj! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
@romilbhardwaj - I just saw this; sorry for missing it all day. I'll check it out in the morning and circle back :). |
|
very cool! is this ready for review? (if so push it out of draft state?) |
python/monarch/_src/job/skypilot.py
Outdated
| sudo apt-get update | ||
|
|
||
| # Install system dependencies | ||
| sudo apt-get install -y \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skypilot requires new system level dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @colin2328, these are dependencies for monarch. SkyPilot's dependencies are handled by the SkyPilot runtime during setup.
They're needed because this PR currently builds Monarch from source on the workers to keep client/worker versions in sync during development. Once this is merged and a Monarch wheel is released to PyPI, the setup simplifies to just pip install torchmonarch on workers.
The current approach is a temporary workaround for the chicken-and-egg problem: SkyPilotJob lives in Monarch source, but there's no released wheel with it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question on architecture: How should new schedulers extend Monarch's JobTrait?
Should they:
- Live in Monarch source (python/monarch/_src/job/)
- Pro: Clean import path, UX for users (
from monarch.job import SkyPilotJob) - Con: Requires building from source during dev until wheel is released, may need version pinning after release?
- Pro: Clean import path, UX for users (
- Live in a separate package (user code)
- Pro: Don't need to rebuild monarch
- Con: Different import pattern (
from <userpkg> import SkyPilotJob?)
- Something else I'm missing.
Currently I have the SkyPilot implementation in python/monarch/_src/job/skypilot.py, but I can refactor it. Any guidance here would be appreciated 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @zdevito as this is an interesting question.
My intuition is that they should live in python/monarch/_src/job even though writing a job or changing the API will requires building from source for testing purposes. The testing cost is only paid by the developer, not the end user.
IIUC, keeping it in a separate package will still require installation on the workers while the code is in development. However it will not require building rust libraries which are probably the main source of the build latency.
It would be nice for JobTrait implementers to not require to build rust binaries while testing their code. I wonder if simply a cp of the new files into site-packages dir on the workers would work, though it seems like a hack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Job implementations should be buildable on top of Monarch's public APIs. So, in general, I would favor the latter pattern, which also has good DevEx. I don't think we should accumulate many job implementations in core Monarch.
Note that this is distinct from the question of whether these are part of the Monarch "ecosystem" packages. In other words, we might maintain them in Monarch itself, but build and distribute them as separate packages.
We have previously talked about using pip flavors to bundle such dependencies for this purpose, e.g.,
$ pip install torchmonarch[skypilot]
etc.
|
Thanks @colin2328 @johnwhumphreys - I'll polish the PR and get it ready for review. Have a question about the architecture here, happy to do whatever aligns with Monarch's recommended design. |
examples/skypilot/skypilot_ddp.py
Outdated
| import os | ||
| import sys | ||
|
|
||
| # Set timeouts before importing monarch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am working on enabling
from monarch.config import configure
configure(
host_spawn_ready_timeout="300s",
message_delivery_timeout="300s",
mesh_proc_spawn_max_idle="300s"
)to replace the use of env-vars here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shayne-fletcher! That'll be super helpful for backends where cold starts can be > 30s.
|
Monarch build was slowing down worker launches, so I've refactored along the lines of option 2 in 20d36e8. Let me know if this is not the right direction :) Tested on Coreweave k8s cluster with H200s. PR is still a draft, will finish up the examples (torchtitan and ddp) + readme and mark ready for review soon! |
|
@romilbhardwaj "the author does not have a valid CLA on file." can you acept the CLA? and publish? |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
| Usage: | ||
| # Run on Kubernetes: | ||
| python getting_started.py --cloud kubernetes --num-hosts 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/getting_started/skypilot_getting_started?
|
@colin2328 has imported this pull request. If you are a Meta employee, you can view this in D88994552. |
examples/skypilot/skypilot_ddp.py
Outdated
| try: | ||
| import sky | ||
| except ImportError: | ||
| print("ERROR: SkyPilot is not installed. Run: pip install skypilot[kubernetes]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add this as an optional dependency in pyproject.toml if we move skypilot_job.py from examples to the job dir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
|
Cleaned up the readme, added an architecture diagram and included a DDP notebook inspired by the slurm example. Tested on coreweave k8s with H200s. This PR is ready for review now! cc @ahmadsharif1 @colin2328 @johnwhumphreys |
Great point. For our release testing, our CI spins up k8s clusters on EKS, GKE, CW and Nebius and runs our smoke test suite which includes model training examples from pytorch, ray, nemo and other popular frameworks. For lightweight CPU-only tests, some parts of our CI uses kind (via |
|
Added some flags to allow the example to run on CPU-only k8s clusters |
ahmadsharif1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for the most part. You can wait for Colin to take a look as well.
I think he wants this to be part of the release, so you can address them and merge if you want
|
Thanks for the reviews @ahmadsharif1! I'll wait for a bit for @colin2328's feedback before working through the comments and doing a round of final testing. Also I noticed the linter was failing - do I need to take any action? Couldn't find instructions to run the linter myself. |
|
@romilbhardwaj , thanks. I'll fix the lints and push for you. please address the comments first, rest is fine to me. |
|
Thanks @colin2328 and @ahmadsharif1 - I think I've addressed all comments. Please lmk if I missed something. This should be ready to go now. Tested on 2x H200:8 on a Coreweave k8s cluster. |
|
@colin2328 merged this pull request in e409178. |
Hi Monarch community 👋
This PR adds support for running Monarch workloads on Kubernetes and cloud VMs via SkyPilot.
Adds a new
SkyPilotJobclass that provisions Monarch workers on K8s (or any of the 20+ cloud providers supported by SkyPilot):With this I was able to run the getting started example from Monarch docs on my multi-node Coreweave Kubernetes cluster with H200s. Full example in
examples/skypilot_getting_started.py.Notes:
This is an early PR - I'm looking for feedback on the direction. I'm a maintainer of the SkyPilot project and was trying to run Monarch on my k8s cluster. Figured an integration with SkyPilot would be a good fit.
Happy to iterate on the implementation if this aligns with where you'd like Monarch to go.