-
Notifications
You must be signed in to change notification settings - Fork 436
[DRAFT]node-feature-rules: Add 0x2321 as CC-capable device #1798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| - feature: pci.device | ||
| matchExpressions: | ||
| vendor: {op: In, value: ["10de"]} | ||
| device: {op: In, value: ["2321"]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we won't add devices piece meal. Can "whitelist" : 0x23**, 0x2b**, GBXXX -- Blackwell, GBXXX -- Hopper. Some may need to be excluded via "blacklist" then: exclude 2b00 TA1090SA [THOR].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note is that matchExpressions don't allow wildcards (as far as I am aware). Is there another component that could / should create thes labels instead of a nodefeature rule directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, we could do something like the following:
- name: "NVIDIA Hopper GPU Family"
labels:
"nvidia.com/gpu.family": "hopper"
matchFeatures:
- feature: pci.device
matchExpressions:
vendor: {op: In, value: ["10de"]}
device: {op: InRegexp, value: ["^23[0-9a-f]{2}$"]}
- name: "NVIDIA Blackwell GPU Family"
labels:
"nvidia.com/gpu.family": "blackwell"
matchFeatures:
- feature: pci.device
matchExpressions:
vendor: {op: In, value: ["10de"]}
device: {op: InRegexp, value: ["^2b[0-9a-f]{2}$"]}
| valueFrom: | ||
| fieldRef: | ||
| fieldPath: spec.nodeName | ||
| - name: CC_CAPABLE_DEVICE_IDS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder where this CC_CAPABLE_DEVICE_IDS variable is being referenced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! It looks like we may need a change in the k8s-cc-manager as well then. If we want to allow all 23 and 2b Hopper/Blackwell GPUs, we may rather not want to pass a list of specific GPUs.
| imagePullSecrets: [] | ||
| env: | ||
| - name: CC_CAPABLE_DEVICE_IDS | ||
| value: "0x2339,0x2331,0x2330,0x2324,0x2322,0x233d" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned offline: The envvars from the values file should probably be removed so that a user can properly override them. The defaults should be specified in the daemonset template instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Offline we had a discussion that the change were some of the dev defaults were removed in values.yaml was the following: #1580 - the ccManager envvars may have potentially been missed to remove.
|
Closing this pull request as we will address this in a more generic way via #1973 |
No description provided.