Skip to content

Conversation

@konturn
Copy link

@konturn konturn commented Feb 28, 2024

I've run into an issue where node maintenance on GPU nodes prevents the driver installer daemonset from starting up again. Specifically, our issue looks like this:

  1. GCP schedules maintenance for our H100 node (we cannot prevent this)--we’re using the termination maintenance policy here, so the node gets stopped.
  2. Node gets restarted, and GCP tries attaching the local SSD’s from before but cannot. These local SSD’s are used for containerd image storage via a symlink and also Nvidia driver storage. So these means that all the images will be wiped from the node.
  3. The daemonset which exposes GPU’s on the node cannot start, since the image doesn’t exist and the pull policy is set to ‘Never’

The fix here entails self-managing a modified version of the daemonset that has the adjusted pull policy. The GKE documentation should link to a daemonset that's able to work properly after node maintenance events.

@google-cla
Copy link

google-cla bot commented Feb 28, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant