Skip to content

Conversation

@manuelh-dev
Copy link

No description provided.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 17, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- feature: pci.device
matchExpressions:
vendor: {op: In, value: ["10de"]}
device: {op: In, value: ["2321"]}
Copy link
Author

@manuelh-dev manuelh-dev Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we won't add devices piece meal. Can "whitelist" : 0x23**, 0x2b**, GBXXX -- Blackwell, GBXXX -- Hopper. Some may need to be excluded via "blacklist" then: exclude 2b00 TA1090SA [THOR].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is that matchExpressions don't allow wildcards (as far as I am aware). Is there another component that could / should create thes labels instead of a nodefeature rule directly?

Copy link
Author

@manuelh-dev manuelh-dev Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, we could do something like the following:

- name: "NVIDIA Hopper GPU Family"
  labels:
    "nvidia.com/gpu.family": "hopper"
  matchFeatures:
    - feature: pci.device
      matchExpressions:
        vendor: {op: In, value: ["10de"]}
        device: {op: InRegexp, value: ["^23[0-9a-f]{2}$"]}
- name: "NVIDIA Blackwell GPU Family"
  labels:
    "nvidia.com/gpu.family": "blackwell"
  matchFeatures:
    - feature: pci.device
      matchExpressions:
        vendor: {op: In, value: ["10de"]}
        device: {op: InRegexp, value: ["^2b[0-9a-f]{2}$"]}

valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: CC_CAPABLE_DEVICE_IDS
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder where this CC_CAPABLE_DEVICE_IDS variable is being referenced.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! It looks like we may need a change in the k8s-cc-manager as well then. If we want to allow all 23 and 2b Hopper/Blackwell GPUs, we may rather not want to pass a list of specific GPUs.

imagePullSecrets: []
env:
- name: CC_CAPABLE_DEVICE_IDS
value: "0x2339,0x2331,0x2330,0x2324,0x2322,0x233d"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned offline: The envvars from the values file should probably be removed so that a user can properly override them. The defaults should be specified in the daemonset template instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Offline we had a discussion that the change were some of the dev defaults were removed in values.yaml was the following: #1580 - the ccManager envvars may have potentially been missed to remove.

@manuelh-dev
Copy link
Author

Closing this pull request as we will address this in a more generic way via #1973

@manuelh-dev manuelh-dev closed this Jan 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants