Skip to content

AGENT-1416: Add default NodeDisruptionPolicy for IRI#5683

Open
bfournie wants to merge 1 commit intoopenshift:mainfrom
bfournie:iri-node-disruption-policy
Open

AGENT-1416: Add default NodeDisruptionPolicy for IRI#5683
bfournie wants to merge 1 commit intoopenshift:mainfrom
bfournie:iri-node-disruption-policy

Conversation

@bfournie
Copy link
Contributor

@bfournie bfournie commented Feb 23, 2026

Avoid node reboot upon deletion of the IRI resource as its not necessary.

Added "None" action policy for the following which are affected by the IRI resource deletion:
Files:

  1. /etc/iri-registry - TLS certificates directory
  2. /usr/local/bin/load-registry-image.sh - Registry loading script
  3. /var/lib/iri-registry - Registry data directory and subdirs
    Units:
  4. iri-registry.service - The systemd service that runs the registry

- What I did
Added default NodeDisruptionPolicy for InternalReleaseImage

Added "None" action policy for the following which are affected by the IRI resource deletion:
Files:

  1. /etc/iri-registry - TLS certificates directory
  2. /usr/local/bin/load-registry-image.sh - Registry loading script
  3. /var/lib/iri-registry - Registry data directory and subdirs
    Units:
  4. iri-registry.service - The systemd service that runs the registry

- How to verify it
Check that the new policy is added in the nodeDisruptionPolicyStatus
$ oc get MachineConfiguration/cluster -o yaml

Check that a IRI resource exists:
$ oc get internalreleaseimage cluster -n openshift-machine-config-operator

Delete the InternalReleaseImage resource:
$ oc delete internalreleaseimage cluster -n openshift-machine-config-operator

Check logs
$ oc logs -n openshift-machine-config-operator | grep -i "node disruption|iri-registry"

Confirm

  • Nodes apply the change without rebooting
  • The iri-registry.service is stopped and removed
  • Files under /etc/iri-registry and /var/lib/iri-registry are cleaned up

- Description for the changelog
Added default NodeDisruptionPolicy for InternalReleaseImage

Avoid node reboot upon deletion of the IRI resource as its not necessary.

Added "None" action policy for the following which are affected by the
IRI resource deletion:
Files:
  1. /etc/iri-registry - TLS certificates directory
  2. /usr/local/bin/load-registry-image.sh - Registry loading script
  3. /var/lib/iri-registry - Registry data directory and subdirs
Units:
  1. iri-registry.service - The systemd service that runs the registry
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 23, 2026
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Feb 23, 2026

@bfournie: This pull request references AGENT-1416 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Avoid node reboot upon deletion of the IRI resource as its not necessary.

Added "None" action policy for the following which are affected by the IRI resource deletion:
Files:

  1. /etc/iri-registry - TLS certificates directory
  2. /usr/local/bin/load-registry-image.sh - Registry loading script
  3. /var/lib/iri-registry - Registry data directory and subdirs
    Units:
  4. iri-registry.service - The systemd service that runs the registry

- What I did
Added default NodeDisruptionPolicy for InternalReleaseImage

Added "None" action policy for the following which are affected by the IRI resource deletion:
Files:

  1. /etc/iri-registry - TLS certificates directory
  2. /usr/local/bin/load-registry-image.sh - Registry loading script
  3. /var/lib/iri-registry - Registry data directory and subdirs
    Units:
  4. iri-registry.service - The systemd service that runs the registry

- How to verify it
Check that the new policy is added in the nodeDisruptionPolicyStatus
$ oc get MachineConfiguration/cluster -o yaml

Check that a IRI resource exists:
$ oc get internalreleaseimage cluster -n openshift-machine-config-operator

Delete the InternalReleaseImage resource:
$ oc delete internalreleaseimage cluster -n openshift-machine-config-operator

- Description for the changelog
Added default NodeDisruptionPolicy for InternalReleaseImage

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Feb 23, 2026

@bfournie: This pull request references AGENT-1416 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Avoid node reboot upon deletion of the IRI resource as its not necessary.

Added "None" action policy for the following which are affected by the IRI resource deletion:
Files:

  1. /etc/iri-registry - TLS certificates directory
  2. /usr/local/bin/load-registry-image.sh - Registry loading script
  3. /var/lib/iri-registry - Registry data directory and subdirs
    Units:
  4. iri-registry.service - The systemd service that runs the registry

- What I did
Added default NodeDisruptionPolicy for InternalReleaseImage

Added "None" action policy for the following which are affected by the IRI resource deletion:
Files:

  1. /etc/iri-registry - TLS certificates directory
  2. /usr/local/bin/load-registry-image.sh - Registry loading script
  3. /var/lib/iri-registry - Registry data directory and subdirs
    Units:
  4. iri-registry.service - The systemd service that runs the registry

- How to verify it
Check that the new policy is added in the nodeDisruptionPolicyStatus
$ oc get MachineConfiguration/cluster -o yaml

Check that a IRI resource exists:
$ oc get internalreleaseimage cluster -n openshift-machine-config-operator

Delete the InternalReleaseImage resource:
$ oc delete internalreleaseimage cluster -n openshift-machine-config-operator

Check logs
$ oc logs -n openshift-machine-config-operator | grep -i "node disruption|iri-registry"

Confirm

  • Nodes apply the change without rebooting
  • The iri-registry.service is stopped and removed
  • Files under /etc/iri-registry and /var/lib/iri-registry are cleaned up

- Description for the changelog
Added default NodeDisruptionPolicy for InternalReleaseImage

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bfournie
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Feb 23, 2026

@bfournie: This pull request references AGENT-1416 which is a valid jira issue.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bfournie
Copy link
Contributor Author

/cc @andfasano

@openshift-ci openshift-ci bot requested a review from andfasano February 23, 2026 17:18
@andfasano
Copy link
Contributor

Nodes apply the change without rebooting
The iri-registry.service is stopped and removed
Files under /etc/iri-registry and /var/lib/iri-registry are cleaned up

@bfournie I think that as part of this task only the first point could be relevant to be tested

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 25, 2026

@bfournie: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-ocl 4561e0e link false /test e2e-gcp-op-ocl
ci/prow/e2e-gcp-op-part2 4561e0e link true /test e2e-gcp-op-part2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@bfournie
Copy link
Contributor Author

bfournie commented Mar 4, 2026

Testing confirmed new policies have been added as below

$ oc get MachineConfiguration/cluster -o yaml
apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
  creationTimestamp: "2026-03-03T20:11:46Z"
  generation: 2
  name: cluster
  resourceVersion: "1061210"
  uid: 615b82c4-4898-4c20-8002-1b869061da91
spec:
  logLevel: Debug
  managementState: Managed
  operatorLogLevel: Normal
status:
  managedBootImagesStatus: {}
  nodeDisruptionPolicyStatus:
    clusterPolicies:
      files:
      - actions:
        - type: None
        path: /var/lib/kubelet/config.json
      - actions:
        - reload:
            serviceName: crio.service
          type: Reload
        path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
      - actions:
        - reload:
            serviceName: crio.service
          type: Reload
        path: /etc/containers/policy.json
      - actions:
        - type: Special
        path: /etc/containers/registries.conf
      - actions:
        - reload:
            serviceName: crio.service
          type: Reload
        path: /etc/containers/registries.d
      - actions:
        - reload:
            serviceName: crio.service
          type: Reload
        path: /etc/crio/policies
      - actions:
        - type: None
        path: /etc/nmstate/openshift
      - actions:
        - restart:
            serviceName: coreos-update-ca-trust.service
          type: Restart
        - restart:
            serviceName: crio.service
          type: Restart
        path: /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt
      - actions:
        - type: None
        path: /etc/iri-registry
      - actions:
        - type: None
        path: /usr/local/bin/load-registry-image.sh
      - actions:
        - type: None
        path: /var/lib/iri-registry
      sshkey:
        actions:
        - type: None
      units:
      - actions:
        - type: None
        name: iri-registry.service
  observedGeneration: 2

Applied a test file to /var/lib/iri-registry/ and confirmed no node reboot.

@zaneb
Copy link
Member

zaneb commented Mar 11, 2026

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 11, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 11, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bfournie, zaneb
Once this PR has been reviewed and has the lgtm label, please assign yuqi-zhang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@andfasano andfasano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @bfournie I missed to submit my previous review and it remained pending. I have some questions

},
},
{
Path: "/var/lib/iri-registry",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: since there's no part in the IRI machine configs about /var/lib/iri-registry (that part will be handled by the daemon directly), is that really required?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not sure if the SpecialStatusAction may be required instead?)

Name: "iri-registry.service",
Actions: []opv1.NodeDisruptionPolicyStatusAction{
{
Type: opv1.NoneStatusAction,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure this is going to work? I was expecting at least a ReloadStatusAction

Type: opv1.NoneStatusAction,
},
},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the IRI TLS ca cert added to the node trust store? IIRC (please @djoshy make me honest) that may force a node reboot in any case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants