OCPBUGS-78148: [release-4.21] block device plugin until SR-IOV config applied#1178
Conversation
Add blockDevicePluginUntilConfigured feature gate that prevents the SR-IOV device plugin from starting until the sriov-config-daemon has applied the configuration for the node. When enabled, the device plugin daemonset runs an init container that sets a wait-for-config annotation on its pod. The init container then waits until the sriov-config-daemon removes this annotation, which happens after the daemon has applied the SR-IOV configuration for the node. This feature addresses the race condition where the device plugin starts and reports available resources before the configuration is actually applied, which can lead to pods being scheduled prematurely. Key changes: - Add wait-for-config subcommand to sriov-network-config-daemon - Add init container to device plugin daemonset (when feature enabled) - Add logic in daemon to remove annotation after config is applied - Add Role/RoleBinding for device plugin pod access Signed-off-by: Yury Kulazhenkov <ykulazhenkov@nvidia.com>
When the blockDevicePluginUntilConfigured feature gate is enabled and there are no SriovNetworkNodePolicy resources targeting a node, the config-daemon's apply() function calls waitForDevicePluginPodAndTryUnblock which polls for up to 2 minutes waiting for a device plugin pod that will never arrive. The device plugin daemonset is only scheduled on nodes with policies (SriovDevicePluginLabel=Enabled), so this wait always times out when Spec.Interfaces is empty. Skip the device plugin wait and the periodic unblock API call when the desired node state has no interfaces configured. This matches the existing guard in tryUnblockDevicePlugin() which already checks for empty interfaces before removing the wait-for-config annotation. Signed-off-by: Sebastian Sch <sebassch@gmail.com>
|
/jira cherrypick OCPBUGS-66342 |
|
@zeeke: Jira Issue OCPBUGS-66342 has been cloned as Jira Issue OCPBUGS-78148. Will retitle bug to link to clone. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
55e0da9 to
4758925
Compare
|
Hi @zeeke can you check this one? it's failing in the CI |
to align to `/deploy/role.yaml` Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
…0.20` Signed-off-by: Andrea Panattoni <apanatto@redhat.com>
4758925 to
1390679
Compare
|
@zeeke: This pull request references Jira Issue OCPBUGS-78148, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-telco5g-sriov |
|
@zeeke: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Failing job is not related to this backport |
|
@SchSeba please take another look |
|
/jira backport release-4.20 |
|
/jira refresh |
|
@zeeke: This pull request references Jira Issue OCPBUGS-78148, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@SchSeba please take another look |
|
/lgtm |
|
@SchSeba: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: SchSeba, zeeke The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/jira refresh |
|
@zeeke: This pull request references Jira Issue OCPBUGS-78148, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
91d9df4
into
openshift:release-4.21
|
@zeeke: Jira Issue OCPBUGS-78148: All pull requests linked via external trackers have merged: All linked pull requests have the DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherrypick release-4.20 |
|
@zeeke: #1178 failed to apply on top of branch "release-4.20": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
backport of
Conflicts faced and solved in
pkg/daemon/daemon_test.go