-
Notifications
You must be signed in to change notification settings - Fork 176
Use NVML for device discovery and health checks #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: nvml
Are you sure you want to change the base?
Conversation
|
I should also ask, is there a way to test this in k8s e2e safely? |
|
@mindprince @jiayingz @vishh Based on the failures, I assume that this may need to be better mocked or hidden behind a build flag |
|
Thanks a lot for making this change, @cmluciano This is very helpful! At high level, I am curious whether we should try to use "github.com/NVIDIA/nvidia-docker/src/nvml" pkg instead of "github.com/mindprince/gonvml"? The former is used in Nvidia device plugin and seems to have better device state monitoring. I think that will also help us converge the two device plugins in the future. |
|
I decided to use mindprince/gonvml because it is currently used in cadvisor. I also noticed that NVIDIA/nvidia-docker is transitioning to a package repo that installs a custom-runtime/hook which in turn leverages libnvidia-container. The mindprince/gonvml library is just a Go wrapper around the NVML C code. This feels cleaner, has less dependencies, and does not require custom runC hooks or runtime patches. |
|
Agree we should avoid runtime hook part. What I am hoping to leverage from https://github.com/NVIDIA/nvidia-docker/blob/1.0/src/nvml/nvml.go is critical error watching part and wonder whether we can re-use it to monitor device health. |
|
I'm fine with adopting this library if it will be supported long-term. But if this branch goes away we are stuck. I think we can easily extend mindprince/gonvml with whatever functionality we need long-term as well. |
|
@mindprince Any input on your library vs others ? |
|
That is a good point. Perhaps we should schedule a meeting with the Nvidia folks to see whether they plan to support the nvml go pkg in long term. |
|
Can we use the library in this PR for now since it is already used by cadvisor? I can hide it behind a build-flag for now if that is desired. |
|
Sorry, I haven't gotten around to reviewing this yet. I am out of office for a couple more weeks. I have just looked at the comment thread here. One of the reasons we went with a different library for cAdvisor instead of using I am happy to accept PRs to If we end up using Whichever library we use, we should hide that behind a build flag, at least initially. One general question about this PR which I asked in the email thread as well a while back: The device plugin is currently deployed as a Kubernetes add-on: how do we make NVML available inside the device plugin container? Device plugin injects the nvidia library host paths in other GPU containers, but who will make that available in the device plugin itself? |
|
We may want to have a meeting with Nvidia folks to learn their long term plan. For now, perhaps we can create a special branch for nvml related changes so that we can make incremental improvements? |
|
i'm fine with a feature branch. Can you create one so that I could move this PR to it? |
|
@cmluciano I can test this PR on my k8s cluster if it's ready to test. |
|
SGTM @pineking let me know about any failures |
|
@mindprince @jiayingz I moved the base branch to the nvml branch. Since this is acting more like a feature branch, do we still require build flag integration? |
|
I am fine without the build flag. We can re-evaluate this after learning the long-term plan from nvidia side. Do you know why presubmit test fails? |
|
If I get the approval internally, I will soon create a new Github repo with the |
|
@cmluciano: this
from @mindprince |
|
Ohhhh ok great thanks! |
|
@jiayingz @mindprince I think the failures may be due to the NVML library not being included in the build image for travis. It also might require a GPU to pass, but I can try to mock them and move these tests to an e2e instead. |
|
FYI we moved the NVML bindings at https://github.com/nvidia/gpu-monitoring-tools |
|
cc @guptaNswati |
|
@jiayingz assuming I can get these mocks passing, do you see anything else blocking merging of this PR? The big confusion is trying to determine if we will be able to ever test this well without provisioning a machine that has drivers installed. |
|
@cmluciano I think we are fine to merge it into the nvml branch. Also agree with the testing part. I think it is fine to test the PR with drivers already installed. Then maybe it is helpful to have a README in the nvml branch to document what the device plugin provided from this branch suites for? |
Notes for reviewers: