Version: 26.1.2
Installation method: Kubernetes/Helm (NIM Operator mode)
Describe the bug:
The NIMCache templates in the nv-ingest Helm chart (version 26.1.2) do not render nodeSelector or tolerations fields, even when these values are provided in the Helm values file. This causes NIMCache pods to be unschedulable on clusters where GPU nodes have taints or require specific node selection.
Looking at the template templates/llama-3.2-nv-embedqa-1b-v2.yaml, the NIMService section correctly renders nodeSelector and tolerations:
kind: NIMService
spec:
nodeSelector:
{{ toYaml .Values.nimOperator.embedqa.nodeSelector | nindent 4 }}
tolerations:
{{ toYaml .Values.nimOperator.embedqa.tolerations | nindent 4 }}
However, the NIMCache section in the same template only renders source and storage, omitting nodeSelector and tolerations entirely:
kind: NIMCache
spec:
source:
ngc:
modelPuller: "..."
pullSecret: "..."
authSecret: ...
storage:
pvc:
...
Expected behavior:
The NIMCache templates should include nodeSelector and tolerations fields, similar to NIMService:
kind: NIMCache
spec:
source:
...
storage:
...
nodeSelector:
{{ toYaml .Values.nimOperator.embedqa.nodeSelector | nindent 4 }}
tolerations:
{{ toYaml .Values.nimOperator.embedqa.tolerations | nindent 4 }}
Workaround:
Currently requires manually patching each NIMCache resource after Helm deployment:
kubectl patch nimcache <name> -n nim --type=merge -p '{
"spec": {
"nodeSelector": {"cloud.google.com/gke-nodepool": "gpu-pool"},
"tolerations": [{"key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule"}]
}
}'
Affected templates:
- templates/llama-3.2-nv-embedqa-1b-v2.yaml
- templates/nemoretriever-graphic-elements-v1.yaml
- templates/nemoretriever-ocr-v1.yaml
- templates/nemoretriever-page-elements-v3.yaml
- templates/nemoretriever-table-structure-v1.yaml
Version: 26.1.2
Installation method: Kubernetes/Helm (NIM Operator mode)
Describe the bug:
The NIMCache templates in the nv-ingest Helm chart (version 26.1.2) do not render nodeSelector or tolerations fields, even when these values are provided in the Helm values file. This causes NIMCache pods to be unschedulable on clusters where GPU nodes have taints or require specific node selection.
Looking at the template templates/llama-3.2-nv-embedqa-1b-v2.yaml, the NIMService section correctly renders nodeSelector and tolerations:
However, the NIMCache section in the same template only renders source and storage, omitting nodeSelector and tolerations entirely:
Expected behavior:
The NIMCache templates should include nodeSelector and tolerations fields, similar to NIMService:
Workaround:
Currently requires manually patching each NIMCache resource after Helm deployment:
Affected templates: