Skip to content

Conversation

@cmluciano
Copy link

Notes for reviewers:

  • I added some TODOs for later PRs
  • Should I hide these changes behind a buildflag to preserve the existing behavior?
  • Is there a better way to test some of the functions for health-checking

@cmluciano
Copy link
Author

I should also ask, is there a way to test this in k8s e2e safely?

@cmluciano
Copy link
Author

@mindprince @jiayingz @vishh Based on the failures, I assume that this may need to be better mocked or hidden behind a build flag

@jiayingz
Copy link
Collaborator

jiayingz commented Feb 8, 2018

Thanks a lot for making this change, @cmluciano This is very helpful!

At high level, I am curious whether we should try to use "github.com/NVIDIA/nvidia-docker/src/nvml" pkg instead of "github.com/mindprince/gonvml"? The former is used in Nvidia device plugin and seems to have better device state monitoring. I think that will also help us converge the two device plugins in the future.
https://github.com/NVIDIA/nvidia-docker/blob/1.0/src/nvml/nvml.go

@cmluciano
Copy link
Author

cmluciano commented Feb 8, 2018

I decided to use mindprince/gonvml because it is currently used in cadvisor. I also noticed that NVIDIA/nvidia-docker is transitioning to a package repo that installs a custom-runtime/hook which in turn leverages libnvidia-container.

The mindprince/gonvml library is just a Go wrapper around the NVML C code. This feels cleaner, has less dependencies, and does not require custom runC hooks or runtime patches.

@jiayingz
Copy link
Collaborator

jiayingz commented Feb 8, 2018

Agree we should avoid runtime hook part. What I am hoping to leverage from https://github.com/NVIDIA/nvidia-docker/blob/1.0/src/nvml/nvml.go is critical error watching part and wonder whether we can re-use it to monitor device health.

@cmluciano
Copy link
Author

I'm fine with adopting this library if it will be supported long-term. But if this branch goes away we are stuck. I think we can easily extend mindprince/gonvml with whatever functionality we need long-term as well.

@cmluciano
Copy link
Author

@mindprince Any input on your library vs others ?

@jiayingz
Copy link
Collaborator

jiayingz commented Feb 9, 2018

That is a good point. Perhaps we should schedule a meeting with the Nvidia folks to see whether they plan to support the nvml go pkg in long term.

@cmluciano
Copy link
Author

Can we use the library in this PR for now since it is already used by cadvisor? I can hide it behind a build-flag for now if that is desired.

@rohitagarwal003
Copy link
Contributor

Sorry, I haven't gotten around to reviewing this yet. I am out of office for a couple more weeks. I have just looked at the comment thread here.

One of the reasons we went with a different library for cAdvisor instead of using github.com/NVIDIA/nvidia-docker/src/nvml was that it required that NVML be present in the build environment. Having NVML installed in the build environment was a big no-no for kubernetes. I would prefer that we don't add any NVML requirements to the build environment for this device plugin as well but it may not be as big of a deal as it was for kubernetes core.

I am happy to accept PRs to github.com/mindprince/gonvml if there's any missing functionality there.

If we end up using github.com/NVIDIA/nvidia-docker/src/nvml, we should definitely check with NVIDIA about its future. I also heard from NVIDIA that they were (thinking of?) developing official Go bindings for NVML and DCGM. We should check the status of that with them as well.

Whichever library we use, we should hide that behind a build flag, at least initially.

One general question about this PR which I asked in the email thread as well a while back: The device plugin is currently deployed as a Kubernetes add-on: how do we make NVML available inside the device plugin container? Device plugin injects the nvidia library host paths in other GPU containers, but who will make that available in the device plugin itself?

@jiayingz
Copy link
Collaborator

We may want to have a meeting with Nvidia folks to learn their long term plan. For now, perhaps we can create a special branch for nvml related changes so that we can make incremental improvements?

@cmluciano
Copy link
Author

i'm fine with a feature branch. Can you create one so that I could move this PR to it?

@jiayingz
Copy link
Collaborator

@pineking
Copy link

@cmluciano I can test this PR on my k8s cluster if it's ready to test.

@cmluciano
Copy link
Author

cmluciano commented Feb 13, 2018

SGTM @pineking let me know about any failures

@cmluciano cmluciano changed the base branch from master to nvml February 14, 2018 19:42
@cmluciano
Copy link
Author

@mindprince @jiayingz I moved the base branch to the nvml branch. Since this is acting more like a feature branch, do we still require build flag integration?

@jiayingz
Copy link
Collaborator

I am fine without the build flag. We can re-evaluate this after learning the long-term plan from nvidia side.

Do you know why presubmit test fails?

@flx42
Copy link

flx42 commented Feb 15, 2018

If I get the approval internally, I will soon create a new Github repo with the nvml package from nvidia-docker 1.0.
I've taken note of the issue with nvml.h, and I've asked if we can fix that.

@cmluciano
Copy link
Author

@flx42 Can you link to the nvml.h issue that you are referring to?

@jiayingz I think it may be the lack of GPUs support within travis. The tests are passing on my dev machine with 1 NVIDIA GPU.

@flx42
Copy link

flx42 commented Feb 15, 2018

@cmluciano: this

One of the reasons we went with a different library for cAdvisor instead of using github.com/NVIDIA/nvidia-docker/src/nvml was that it required that NVML be present in the build environment. Having NVML installed in the build environment was a big no-no for kubernetes.

from @mindprince

@cmluciano
Copy link
Author

Ohhhh ok great thanks!

@cmluciano
Copy link
Author

@jiayingz @mindprince I think the failures may be due to the NVML library not being included in the build image for travis. It also might require a GPU to pass, but I can try to mock them and move these tests to an e2e instead.

@flx42
Copy link

flx42 commented Jun 1, 2018

FYI we moved the NVML bindings at https://github.com/nvidia/gpu-monitoring-tools
It now ships with nvml.h, we still have improvements to do.

@flx42
Copy link

flx42 commented Jun 1, 2018

cc @guptaNswati

@cmluciano
Copy link
Author

@jiayingz assuming I can get these mocks passing, do you see anything else blocking merging of this PR? The big confusion is trying to determine if we will be able to ever test this well without provisioning a machine that has drivers installed.

@jiayingz
Copy link
Collaborator

@cmluciano I think we are fine to merge it into the nvml branch. Also agree with the testing part. I think it is fine to test the PR with drivers already installed. Then maybe it is helpful to have a README in the nvml branch to document what the device plugin provided from this branch suites for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants