MGMT-23230: Add MachineConfig for AMD GPU kernel module blacklist#9920
MGMT-23230: Add MachineConfig for AMD GPU kernel module blacklist#9920leo8a wants to merge 1 commit intoopenshift:masterfrom
Conversation
|
@leo8a: This pull request references MGMT-23230 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #9920 +/- ##
==========================================
- Coverage 44.25% 44.24% -0.01%
==========================================
Files 415 415
Lines 72499 72499
==========================================
- Hits 32082 32079 -3
- Misses 37507 37509 +2
- Partials 2910 2911 +1 🚀 New features to boost your workflow:
|
|
/test edge-e2e-metal-assisted-openshift-ai-4-20 |
|
/hold |
|
working with AMD to remove / clarify whether this requirement is at all needed |
|
@leo8a: This pull request references MGMT-23230 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (2)
WalkthroughAdded a Ginkgo unit test that verifies generated AMD GPU MachineConfig manifests include worker and master blacklist entries, and added a new MachineConfig template that writes Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/unhold This blacklist is also required for gpu-operator upgrade scenarios, as the in-tree amdgpu module can interfere with loading new driver versions during the upgrade lifecycle of the AMD gpu-operator. /cc @LaVLaS @yevgeny-shnaidman |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/operators/amdgpu/amd_gpu_manifests_test.go (1)
52-55: Strengthen this test to validate rendered intent, not just token presence.Current substring checks can still pass if role labels or blacklist payload drift. Add assertions for role labels and encoded content to make regressions visible.
Proposed test hardening
It("Includes MachineConfig for amdgpu kernel module blacklist", func() { _, customManifest, err := operator.GenerateManifests(cluster) Expect(err).ToNot(HaveOccurred()) - Expect(string(customManifest)).To(ContainSubstring("kind: MachineConfig")) - Expect(string(customManifest)).To(ContainSubstring("99-amdgpu-module-blacklist-worker")) - Expect(string(customManifest)).To(ContainSubstring("99-amdgpu-module-blacklist-master")) - Expect(string(customManifest)).To(ContainSubstring("/etc/modprobe.d/amdgpu-blacklist.conf")) + rendered := string(customManifest) + Expect(rendered).To(ContainSubstring("kind: MachineConfig")) + Expect(rendered).To(ContainSubstring("99-amdgpu-module-blacklist-worker")) + Expect(rendered).To(ContainSubstring("99-amdgpu-module-blacklist-master")) + Expect(rendered).To(ContainSubstring("machineconfiguration.openshift.io/role: worker")) + Expect(rendered).To(ContainSubstring("machineconfiguration.openshift.io/role: master")) + Expect(rendered).To(ContainSubstring("/etc/modprobe.d/amdgpu-blacklist.conf")) + Expect(rendered).To(ContainSubstring("YmxhY2tsaXN0IGFtZGdwdQo=")) // "blacklist amdgpu\n" })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/operators/amdgpu/amd_gpu_manifests_test.go` around lines 52 - 55, The current test only checks token presence in customManifest; strengthen it to validate rendered intent by asserting role labels and the actual blacklist payload: locate the test's customManifest variable in amd_gpu_manifests_test.go and add assertions that the MachineConfig manifests include the correct node selector labels (e.g., "role: worker" and "role: master" or the exact nodeSelector keys used when rendering), and verify the blacklist file content rather than just its path—extract the ConfigMap/Secret data blob from customManifest (or base64-decode the embedded value if rendered encoded) and assert it contains the expected blacklist line (for example "blacklist amdgpu" or the exact payload string used in the generator) so regressions in labels or payload are caught.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@internal/operators/amdgpu/amd_gpu_manifests_test.go`:
- Around line 52-55: The current test only checks token presence in
customManifest; strengthen it to validate rendered intent by asserting role
labels and the actual blacklist payload: locate the test's customManifest
variable in amd_gpu_manifests_test.go and add assertions that the MachineConfig
manifests include the correct node selector labels (e.g., "role: worker" and
"role: master" or the exact nodeSelector keys used when rendering), and verify
the blacklist file content rather than just its path—extract the
ConfigMap/Secret data blob from customManifest (or base64-decode the embedded
value if rendered encoded) and assert it contains the expected blacklist line
(for example "blacklist amdgpu" or the exact payload string used in the
generator) so regressions in labels or payload are caught.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9e1ee47d-d36a-4b19-a85b-b20403691a55
📒 Files selected for processing (2)
internal/operators/amdgpu/amd_gpu_manifests_test.gointernal/operators/amdgpu/templates/custom/amdgpu_module_blacklist.yaml
Add MachineConfig to blacklist amdgpu in-tree kernel module for out-of-tree driver installation. Includes worker and master configs for SNO and multi-node support. This blacklist is also required for gpu-operator upgrade scenarios, as the in-tree amdgpu module can interfere with loading new driver versions during the upgrade lifecycle. Signed-off-by: Leonardo Ochoa-Aday <lochoa@redhat.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: leo8a The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Looks good. This is mostly likely a requirement for all currently supported OCP 4.x clusters. |
|
/retest |
|
@leo8a: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Add MachineConfig to blacklist amdgpu in-tree kernel module, required for AMD GPU out-of-tree driver installation. Includes configurations for both worker and master nodes to support SNO and multi-node deployments.
This blacklist is also required for gpu-operator upgrade scenarios, as the in-tree amdgpu module can interfere with loading new driver versions during the upgrade lifecycle.
Changes:
/etc/modprobe.d/amdgpu-blacklist.conf(blacklist amdgpu)Reference: https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module
Jira: https://issues.redhat.com/browse/MGMT-23230
List all the issues related to this PR
What environments does this code impact?
Template change only. Unit tests validate MachineConfig generation.