If you are new to Kubernetes then we would recommend using K9s in addition to kubectl.
In general whenever there is a problem.
- What do the pod logs say?
- What do the pod events say?
- What do the deployment events say?
Uniquely to this operator: What is listed in the AIDeployment CRD?
Feel free to create an issue or reach out to us on Prem's Discord etc.
Often this is caused by a resource requirement. In particular GPU, but also CPU.
Even if you have exactly enough CPU and GPU to run your workload, Kubernetes won't be able to perform rolling updates. You may have to manually scale down a deployment to stop the old pods and allow new ones to run. Alternatively you can change Kubernetes update and Scheduling policies.
Errors involving GPU drivers and runtime can be confusing. Sometimes the host's
kernel log can be helpful (dmesg). For instance the following message shows incompatibility between
the runtime and Kernel module/driver.
[ +0.056775] NVRM: API mismatch: the client has the version 545.23.08, but
NVRM: this kernel module has the version 550.54.14. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
- Does the container have the correct version of CUDA?
Check the pod events for the following message or similar
nvidia-container-cli.real: requirement error: invalid expression: unknown
- Does the container have a GPU specified?
If not this can manifest in commands (e.g. nvidia-smi) being absent.