Skip to content

MGMT-23230: Add MachineConfig for AMD GPU kernel module blacklist#9920

Open
leo8a wants to merge 1 commit intoopenshift:masterfrom
leo8a:amdgpu-blacklist
Open

MGMT-23230: Add MachineConfig for AMD GPU kernel module blacklist#9920
leo8a wants to merge 1 commit intoopenshift:masterfrom
leo8a:amdgpu-blacklist

Conversation

@leo8a
Copy link
Copy Markdown
Contributor

@leo8a leo8a commented Feb 23, 2026

Summary

Add MachineConfig to blacklist amdgpu in-tree kernel module, required for AMD GPU out-of-tree driver installation. Includes configurations for both worker and master nodes to support SNO and multi-node deployments.

This blacklist is also required for gpu-operator upgrade scenarios, as the in-tree amdgpu module can interfere with loading new driver versions during the upgrade lifecycle.

Changes:

  • Add amdgpu_module_blacklist.yaml template with worker and master MachineConfigs
  • Blacklist configuration: /etc/modprobe.d/amdgpu-blacklist.conf (blacklist amdgpu)
  • Add test to validate MachineConfig generation

Reference: https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module

Jira: https://issues.redhat.com/browse/MGMT-23230

List all the issues related to this PR

  • Bug fix
  • New Feature
  • Enhancement
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Operator Managed Deployments
  • Automation (CI, tools, etc)
  • Cloud
  • None

Template change only. Unit tests validate MachineConfig generation.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 23, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Feb 23, 2026

@leo8a: This pull request references MGMT-23230 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Add MachineConfig to blacklist amdgpu in-tree kernel module, required for AMD GPU out-of-tree driver installation. Includes configurations for both worker and master nodes to support
SNO and multi-node deployments.

Changes:

  • Add amdgpu_module_blacklist.yaml template with worker and master MachineConfigs
  • Blacklist configuration: /etc/modprobe.d/amdgpu-blacklist.conf (blacklist amdgpu)
  • Add test to validate MachineConfig generation

Reference: https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module

Jira: https://issues.redhat.com/browse/MGMT-23230

List all the issues related to this PR

  • Bug fix
  • New Feature
  • Enhancement
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Operator Managed Deployments
  • Automation (CI, tools, etc)
  • Cloud
  • None

Template change only. Unit tests validate MachineConfig generation.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 23, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.24%. Comparing base (2bcd942) to head (a2b749c).
⚠️ Report is 60 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9920      +/-   ##
==========================================
- Coverage   44.25%   44.24%   -0.01%     
==========================================
  Files         415      415              
  Lines       72499    72499              
==========================================
- Hits        32082    32079       -3     
- Misses      37507    37509       +2     
- Partials     2910     2911       +1     

see 2 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@leo8a
Copy link
Copy Markdown
Contributor Author

leo8a commented Feb 24, 2026

/test edge-e2e-metal-assisted-openshift-ai-4-20
/test edge-e2e-metal-assisted-virtualization-4-20

@leo8a
Copy link
Copy Markdown
Contributor Author

leo8a commented Feb 27, 2026

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 27, 2026
@leo8a
Copy link
Copy Markdown
Contributor Author

leo8a commented Mar 18, 2026

working with AMD to remove / clarify whether this requirement is at all needed

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 9, 2026

@leo8a: This pull request references MGMT-23230 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Add MachineConfig to blacklist amdgpu in-tree kernel module, required for AMD GPU out-of-tree driver installation. Includes configurations for both worker and master nodes to support SNO and multi-node deployments.

This blacklist is also required for gpu-operator upgrade scenarios, as the in-tree amdgpu module can interfere with loading new driver versions during the upgrade lifecycle.

Changes:

  • Add amdgpu_module_blacklist.yaml template with worker and master MachineConfigs
  • Blacklist configuration: /etc/modprobe.d/amdgpu-blacklist.conf (blacklist amdgpu)
  • Add test to validate MachineConfig generation

Reference: https://instinct.docs.amd.com/projects/gpu-operator/en/latest/installation/openshift-olm.html#create-blacklist-for-installing-out-of-tree-kernel-module

Jira: https://issues.redhat.com/browse/MGMT-23230

List all the issues related to this PR

  • Bug fix
  • New Feature
  • Enhancement
  • Tests
  • Documentation
  • CI/CD

What environments does this code impact?

  • Operator Managed Deployments
  • Automation (CI, tools, etc)
  • Cloud
  • None

Template change only. Unit tests validate MachineConfig generation.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@leo8a leo8a force-pushed the amdgpu-blacklist branch from e779b31 to 3b60390 Compare April 9, 2026 08:41
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 9, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f527da6e-f2bf-48be-8312-9e90f36ebb39

📥 Commits

Reviewing files that changed from the base of the PR and between 3b60390 and a2b749c.

📒 Files selected for processing (2)
  • internal/operators/amdgpu/amd_gpu_manifests_test.go
  • internal/operators/amdgpu/templates/custom/amdgpu_module_blacklist.yaml
✅ Files skipped from review due to trivial changes (2)
  • internal/operators/amdgpu/amd_gpu_manifests_test.go
  • internal/operators/amdgpu/templates/custom/amdgpu_module_blacklist.yaml

Walkthrough

Added a Ginkgo unit test that verifies generated AMD GPU MachineConfig manifests include worker and master blacklist entries, and added a new MachineConfig template that writes /etc/modprobe.d/amdgpu-blacklist.conf via Ignition to both worker and master nodes.

Changes

Cohort / File(s) Summary
Test Case
internal/operators/amdgpu/amd_gpu_manifests_test.go
New Ginkgo It test calling operator.GenerateManifests(cluster) and asserting presence of 99-amdgpu-module-blacklist-worker, 99-amdgpu-module-blacklist-master, matching role labels, /etc/modprobe.d/amdgpu-blacklist.conf path, and base64-encoded blacklist amdgpu\n content.
MachineConfig Template
internal/operators/amdgpu/templates/custom/amdgpu_module_blacklist.yaml
New template adding two MachineConfig resources (99-amdgpu-module-blacklist-worker, 99-amdgpu-module-blacklist-master) using Ignition 3.2.0 to create /etc/modprobe.d/amdgpu-blacklist.conf with base64-encoded content and overwrite: true.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@leo8a
Copy link
Copy Markdown
Contributor Author

leo8a commented Apr 9, 2026

/unhold

This blacklist is also required for gpu-operator upgrade scenarios, as the in-tree amdgpu module can interfere with loading new driver versions during the upgrade lifecycle of the AMD gpu-operator.

/cc @LaVLaS @yevgeny-shnaidman
/assign @pastequo

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 9, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/operators/amdgpu/amd_gpu_manifests_test.go (1)

52-55: Strengthen this test to validate rendered intent, not just token presence.

Current substring checks can still pass if role labels or blacklist payload drift. Add assertions for role labels and encoded content to make regressions visible.

Proposed test hardening
 It("Includes MachineConfig for amdgpu kernel module blacklist", func() {
 	_, customManifest, err := operator.GenerateManifests(cluster)
 	Expect(err).ToNot(HaveOccurred())
-	Expect(string(customManifest)).To(ContainSubstring("kind: MachineConfig"))
-	Expect(string(customManifest)).To(ContainSubstring("99-amdgpu-module-blacklist-worker"))
-	Expect(string(customManifest)).To(ContainSubstring("99-amdgpu-module-blacklist-master"))
-	Expect(string(customManifest)).To(ContainSubstring("/etc/modprobe.d/amdgpu-blacklist.conf"))
+	rendered := string(customManifest)
+	Expect(rendered).To(ContainSubstring("kind: MachineConfig"))
+	Expect(rendered).To(ContainSubstring("99-amdgpu-module-blacklist-worker"))
+	Expect(rendered).To(ContainSubstring("99-amdgpu-module-blacklist-master"))
+	Expect(rendered).To(ContainSubstring("machineconfiguration.openshift.io/role: worker"))
+	Expect(rendered).To(ContainSubstring("machineconfiguration.openshift.io/role: master"))
+	Expect(rendered).To(ContainSubstring("/etc/modprobe.d/amdgpu-blacklist.conf"))
+	Expect(rendered).To(ContainSubstring("YmxhY2tsaXN0IGFtZGdwdQo=")) // "blacklist amdgpu\n"
 })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/operators/amdgpu/amd_gpu_manifests_test.go` around lines 52 - 55,
The current test only checks token presence in customManifest; strengthen it to
validate rendered intent by asserting role labels and the actual blacklist
payload: locate the test's customManifest variable in amd_gpu_manifests_test.go
and add assertions that the MachineConfig manifests include the correct node
selector labels (e.g., "role: worker" and "role: master" or the exact
nodeSelector keys used when rendering), and verify the blacklist file content
rather than just its path—extract the ConfigMap/Secret data blob from
customManifest (or base64-decode the embedded value if rendered encoded) and
assert it contains the expected blacklist line (for example "blacklist amdgpu"
or the exact payload string used in the generator) so regressions in labels or
payload are caught.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@internal/operators/amdgpu/amd_gpu_manifests_test.go`:
- Around line 52-55: The current test only checks token presence in
customManifest; strengthen it to validate rendered intent by asserting role
labels and the actual blacklist payload: locate the test's customManifest
variable in amd_gpu_manifests_test.go and add assertions that the MachineConfig
manifests include the correct node selector labels (e.g., "role: worker" and
"role: master" or the exact nodeSelector keys used when rendering), and verify
the blacklist file content rather than just its path—extract the
ConfigMap/Secret data blob from customManifest (or base64-decode the embedded
value if rendered encoded) and assert it contains the expected blacklist line
(for example "blacklist amdgpu" or the exact payload string used in the
generator) so regressions in labels or payload are caught.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9e1ee47d-d36a-4b19-a85b-b20403691a55

📥 Commits

Reviewing files that changed from the base of the PR and between 2bcd942 and 3b60390.

📒 Files selected for processing (2)
  • internal/operators/amdgpu/amd_gpu_manifests_test.go
  • internal/operators/amdgpu/templates/custom/amdgpu_module_blacklist.yaml

Add MachineConfig to blacklist amdgpu in-tree kernel module for
out-of-tree driver installation. Includes worker and master configs
for SNO and multi-node support.

This blacklist is also required for gpu-operator upgrade scenarios, as
the in-tree amdgpu module can interfere with loading new driver versions
during the upgrade lifecycle.

Signed-off-by: Leonardo Ochoa-Aday <lochoa@redhat.com>
@leo8a leo8a force-pushed the amdgpu-blacklist branch from 3b60390 to a2b749c Compare April 9, 2026 09:09
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 9, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: leo8a
Once this PR has been reviewed and has the lgtm label, please ask for approval from pastequo. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@LaVLaS
Copy link
Copy Markdown
Contributor

LaVLaS commented Apr 10, 2026

Looks good. This is mostly likely a requirement for all currently supported OCP 4.x clusters.

@leo8a
Copy link
Copy Markdown
Contributor Author

leo8a commented Apr 22, 2026

/retest

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 22, 2026

@leo8a: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/edge-e2e-metal-assisted-virtualization-4-20 e779b31 link true /test edge-e2e-metal-assisted-virtualization-4-20
ci/prow/edge-e2e-metal-assisted-virtualization-4-21 a2b749c link true /test edge-e2e-metal-assisted-virtualization-4-21

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants