Use NVML for device discovery and health checks #52

cmluciano · 2018-01-29T21:01:23Z

Notes for reviewers:

I added some TODOs for later PRs
Should I hide these changes behind a buildflag to preserve the existing behavior?
Is there a better way to test some of the functions for health-checking

cmluciano · 2018-01-29T21:44:22Z

I should also ask, is there a way to test this in k8s e2e safely?

cmluciano · 2018-01-31T18:50:35Z

@mindprince @jiayingz @vishh Based on the failures, I assume that this may need to be better mocked or hidden behind a build flag

jiayingz · 2018-02-08T01:38:34Z

Thanks a lot for making this change, @cmluciano This is very helpful!

At high level, I am curious whether we should try to use "github.com/NVIDIA/nvidia-docker/src/nvml" pkg instead of "github.com/mindprince/gonvml"? The former is used in Nvidia device plugin and seems to have better device state monitoring. I think that will also help us converge the two device plugins in the future.
https://github.com/NVIDIA/nvidia-docker/blob/1.0/src/nvml/nvml.go

cmluciano · 2018-02-08T03:51:38Z

I decided to use mindprince/gonvml because it is currently used in cadvisor. I also noticed that NVIDIA/nvidia-docker is transitioning to a package repo that installs a custom-runtime/hook which in turn leverages libnvidia-container.

The mindprince/gonvml library is just a Go wrapper around the NVML C code. This feels cleaner, has less dependencies, and does not require custom runC hooks or runtime patches.

jiayingz · 2018-02-08T06:25:56Z

Agree we should avoid runtime hook part. What I am hoping to leverage from https://github.com/NVIDIA/nvidia-docker/blob/1.0/src/nvml/nvml.go is critical error watching part and wonder whether we can re-use it to monitor device health.

cmluciano · 2018-02-09T22:01:33Z

I'm fine with adopting this library if it will be supported long-term. But if this branch goes away we are stuck. I think we can easily extend mindprince/gonvml with whatever functionality we need long-term as well.

cmluciano · 2018-02-09T22:02:03Z

@mindprince Any input on your library vs others ?

jiayingz · 2018-02-09T22:06:51Z

That is a good point. Perhaps we should schedule a meeting with the Nvidia folks to see whether they plan to support the nvml go pkg in long term.

cmluciano · 2018-02-10T17:22:14Z

Can we use the library in this PR for now since it is already used by cadvisor? I can hide it behind a build-flag for now if that is desired.

rohitagarwal003 · 2018-02-11T10:25:08Z

Sorry, I haven't gotten around to reviewing this yet. I am out of office for a couple more weeks. I have just looked at the comment thread here.

One of the reasons we went with a different library for cAdvisor instead of using github.com/NVIDIA/nvidia-docker/src/nvml was that it required that NVML be present in the build environment. Having NVML installed in the build environment was a big no-no for kubernetes. I would prefer that we don't add any NVML requirements to the build environment for this device plugin as well but it may not be as big of a deal as it was for kubernetes core.

I am happy to accept PRs to github.com/mindprince/gonvml if there's any missing functionality there.

If we end up using github.com/NVIDIA/nvidia-docker/src/nvml, we should definitely check with NVIDIA about its future. I also heard from NVIDIA that they were (thinking of?) developing official Go bindings for NVML and DCGM. We should check the status of that with them as well.

Whichever library we use, we should hide that behind a build flag, at least initially.

One general question about this PR which I asked in the email thread as well a while back: The device plugin is currently deployed as a Kubernetes add-on: how do we make NVML available inside the device plugin container? Device plugin injects the nvidia library host paths in other GPU containers, but who will make that available in the device plugin itself?

jiayingz · 2018-02-12T17:32:10Z

We may want to have a meeting with Nvidia folks to learn their long term plan. For now, perhaps we can create a special branch for nvml related changes so that we can make incremental improvements?

cmluciano · 2018-02-12T17:41:05Z

i'm fine with a feature branch. Can you create one so that I could move this PR to it?

jiayingz · 2018-02-13T06:50:35Z

Done. Created https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/nvml.

pineking · 2018-02-13T07:16:25Z

@cmluciano I can test this PR on my k8s cluster if it's ready to test.

cmluciano · 2018-02-13T15:38:53Z

SGTM @pineking let me know about any failures

cmluciano · 2018-02-14T19:47:32Z

@mindprince @jiayingz I moved the base branch to the nvml branch. Since this is acting more like a feature branch, do we still require build flag integration?

jiayingz · 2018-02-15T18:28:18Z

I am fine without the build flag. We can re-evaluate this after learning the long-term plan from nvidia side.

Do you know why presubmit test fails?

flx42 · 2018-02-15T19:06:38Z

If I get the approval internally, I will soon create a new Github repo with the nvml package from nvidia-docker 1.0.
I've taken note of the issue with nvml.h, and I've asked if we can fix that.

cmluciano · 2018-02-15T20:34:33Z

@flx42 Can you link to the nvml.h issue that you are referring to?

@jiayingz I think it may be the lack of GPUs support within travis. The tests are passing on my dev machine with 1 NVIDIA GPU.

flx42 · 2018-02-15T23:08:08Z

@cmluciano: this

One of the reasons we went with a different library for cAdvisor instead of using github.com/NVIDIA/nvidia-docker/src/nvml was that it required that NVML be present in the build environment. Having NVML installed in the build environment was a big no-no for kubernetes.

from @mindprince

cmluciano · 2018-02-16T03:28:03Z

Ohhhh ok great thanks!

cmluciano · 2018-02-27T18:23:43Z

@jiayingz @mindprince I think the failures may be due to the NVML library not being included in the build image for travis. It also might require a GPU to pass, but I can try to mock them and move these tests to an e2e instead.

flx42 · 2018-06-01T20:57:11Z

FYI we moved the NVML bindings at https://github.com/nvidia/gpu-monitoring-tools
It now ships with nvml.h, we still have improvements to do.

flx42 · 2018-06-01T20:57:24Z

cc @guptaNswati

cmluciano · 2018-06-12T17:47:35Z

@jiayingz assuming I can get these mocks passing, do you see anything else blocking merging of this PR? The big confusion is trying to determine if we will be able to ever test this well without provisioning a machine that has drivers installed.

jiayingz · 2018-06-14T18:59:24Z

@cmluciano I think we are fine to merge it into the nvml branch. Also agree with the testing part. I think it is fine to test the PR with drivers already installed. Then maybe it is helpful to have a README in the nvml branch to document what the device plugin provided from this branch suites for?

Use NVML for device discovery and health checks

405e4a5

googlebot added the cla: yes label Jan 29, 2018

Add gonvml library

e8e8e92

cmluciano changed the base branch from master to nvml February 14, 2018 19:42

fmoessbauer mentioned this pull request Jul 26, 2022

Add support for non A100 GPUs #242

Closed

Use NVML for device discovery and health checks #52

Are you sure you want to change the base?

Use NVML for device discovery and health checks #52

Uh oh!

Conversation

cmluciano commented Jan 29, 2018

Uh oh!

cmluciano commented Jan 29, 2018

Uh oh!

cmluciano commented Jan 31, 2018

Uh oh!

jiayingz commented Feb 8, 2018

Uh oh!

cmluciano commented Feb 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiayingz commented Feb 8, 2018

Uh oh!

cmluciano commented Feb 9, 2018

Uh oh!

cmluciano commented Feb 9, 2018

Uh oh!

jiayingz commented Feb 9, 2018

Uh oh!

cmluciano commented Feb 10, 2018

Uh oh!

rohitagarwal003 commented Feb 11, 2018

Uh oh!

jiayingz commented Feb 12, 2018

Uh oh!

cmluciano commented Feb 12, 2018

Uh oh!

jiayingz commented Feb 13, 2018

Uh oh!

pineking commented Feb 13, 2018

Uh oh!

cmluciano commented Feb 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmluciano commented Feb 14, 2018

Uh oh!

jiayingz commented Feb 15, 2018

Uh oh!

flx42 commented Feb 15, 2018

Uh oh!

cmluciano commented Feb 15, 2018

Uh oh!

flx42 commented Feb 15, 2018

Uh oh!

cmluciano commented Feb 16, 2018

Uh oh!

cmluciano commented Feb 27, 2018

Uh oh!

flx42 commented Jun 1, 2018

Uh oh!

flx42 commented Jun 1, 2018

Uh oh!

cmluciano commented Jun 12, 2018

Uh oh!

jiayingz commented Jun 14, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cmluciano commented Feb 8, 2018 •

edited

Loading

cmluciano commented Feb 13, 2018 •

edited

Loading