Skip to content

INTEROP-9230,INTEROP-9231: Add OPP GA-to-nightly upgrade step#81418

Open
amp-rh wants to merge 5 commits into
openshift:mainfrom
amp-rh:opp-upgrade-step
Open

INTEROP-9230,INTEROP-9231: Add OPP GA-to-nightly upgrade step#81418
amp-rh wants to merge 5 commits into
openshift:mainfrom
amp-rh:opp-upgrade-step

Conversation

@amp-rh

@amp-rh amp-rh commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

INTEROP-9230 + INTEROP-9231: OPP GA-to-Nightly Upgrade Step

Adds an OCP upgrade automation step for OPP (OpenShift Platform Plus)
interop testing. The upgrade step provisions at the latest GA release,
installs OPP operators, then upgrades to a nightly build and validates
that operators survive the version transition.

New files

  • ci-operator/step-registry/interop/opp/upgrade/ - Step registry ref
    that performs GA-to-nightly upgrade with stall detection, admin ack,
    platform health checks, and OPP operator CSV validation
  • ci-operator/config/stolostron/policy-collection/stolostron-policy-collection-main__ocp-upgrade.yaml -
    Periodic job config (cron disabled) for the upgrade variant

Testing

  • Cron set to Feb 31 (disabled); will be enabled after manual validation
  • Rehearsal job validates config schema and step DAG resolution

/cc @mpruitt-rh

Summary by CodeRabbit

This PR extends the stolostron/policy-collection CI infrastructure (via ci-operator) with a new OCP GA-to-nightly upgrade interop scenario focused on OPP (OpenShift Platform Plus) operator persistence.

It adds a new periodic ocp-upgrade variant config (ci-operator/config/stolostron/policy-collection/stolostron-policy-collection-main__ocp-upgrade.yaml) that defines an interop-opp-upgrade-aws workflow. The job is intentionally disabled (cron set to an invalid Feb 31 schedule until manual validation), includes Slack state reporting, and wires a staged workflow (pre/post/setup plus test steps) that provisions an AWS cluster using the GA baseline images, applies upgrade/install release image overrides, installs the targeted OPP operator set, then performs the GA→nightly upgrade and follow-up collection/deprovisioning and issue reporting.

In support of the workflow, it introduces a new interop step-registry command (ci-operator/step-registry/interop/opp/upgrade/) with:

  • A new step reference (interop-opp-upgrade-ref.yaml) supporting upgrade timeouts, polling, stall detection, and OPENSHIFT_UPGRADE_RELEASE_IMAGE_OVERRIDE, plus a grace period to avoid transient CI issues.
  • An upgrade/validation implementation script (interop-opp-upgrade-commands.sh) that runs oc adm upgrade to the nightly, detects upgrade stalls (no-progress window), performs CVO/cluster health verification, and validates OPP operator survival by checking expected operator CSVs reach Succeeded and that their pods are ready.
  • Updated OWNERS metadata/anchors for correct reviewer/approver ownership and a new metadata JSON binding the step to the above YAML reference.

Overall, the PR provides automated GA-to-nightly upgrade coverage specifically to ensure OPP operators remain healthy across the version transition, with safeguards for upgrade stalls and comprehensive post-upgrade health checks.

New step registry ref (interop-opp-upgrade) and ci-operator config
variant for OPP upgrade testing. Provisions at GA, installs OPP
operators, upgrades to nightly, validates platform and operator health.

Cron disabled; to be enabled after manual validation.
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 2, 2026
@openshift-ci-robot

openshift-ci-robot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@amp-rh: This pull request references INTEROP-9230 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

This pull request references INTEROP-9231 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

INTEROP-9230 + INTEROP-9231: OPP GA-to-Nightly Upgrade Step

Adds an OCP upgrade automation step for OPP (OpenShift Platform Plus)
interop testing. The upgrade step provisions at the latest GA release,
installs OPP operators, then upgrades to a nightly build and validates
that operators survive the version transition.

New files

  • ci-operator/step-registry/interop/opp/upgrade/ - Step registry ref
    that performs GA-to-nightly upgrade with stall detection, admin ack,
    platform health checks, and OPP operator CSV validation
  • ci-operator/config/stolostron/policy-collection/stolostron-policy-collection-main__ocp-upgrade.yaml -
    Periodic job config (cron disabled) for the upgrade variant

Testing

  • Cron set to Feb 31 (disabled); will be enabled after manual validation
  • Rehearsal job validates config schema and step DAG resolution

/cc @mpruitt-rh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Note

Currently processing new changes in this PR. This may take a few minutes, please wait...

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 1d4224a8-4f45-4bb0-a5ee-e58fec5fa877

📥 Commits

Reviewing files that changed from the base of the PR and between 4173f42 and 334cc5a.

⛔ Files ignored due to path filters (1)
  • ci-operator/jobs/stolostron/policy-collection/stolostron-policy-collection-main-periodics.yaml is excluded by !ci-operator/jobs/**
📒 Files selected for processing (3)
  • ci-operator/config/stolostron/policy-collection/stolostron-policy-collection-main__ocp-upgrade.yaml
  • ci-operator/step-registry/interop/opp/upgrade/interop-opp-upgrade-commands.sh
  • ci-operator/step-registry/interop/opp/upgrade/interop-opp-upgrade-ref.yaml

Walkthrough

This PR adds a new OCP upgrade CI test for the stolostron policy collection and registers a new interop OPP upgrade step. The step script handles upgrade prechecks, upgrade execution, progress monitoring, cluster stabilization, and post-upgrade validation of platform and operator health.

Changes

Interop OPP upgrade flow

Layer / File(s) Summary
CI config and step wiring
ci-operator/config/stolostron/policy-collection/stolostron-policy-collection-main__ocp-upgrade.yaml, ci-operator/step-registry/interop/opp/upgrade/interop-opp-upgrade-ref.yaml, ci-operator/step-registry/interop/opp/upgrade/OWNERS, ci-operator/step-registry/interop/OWNERS, ci-operator/step-registry/interop/opp/OWNERS, ci-operator/step-registry/interop/opp/upgrade/interop-opp-upgrade-ref.metadata.json
Adds the upgrade test definition, release/resource settings, step references, ownership wiring, and step metadata for the new interop OPP upgrade flow.
Upgrade setup and initiation
ci-operator/step-registry/interop/opp/upgrade/interop-opp-upgrade-commands.sh
Sets shell defaults, emits exit diagnostics, resolves and checks the target image, applies admin-ack and CCO annotation updates, and starts the cluster upgrade.
Upgrade monitoring and validation
ci-operator/step-registry/interop/opp/upgrade/interop-opp-upgrade-commands.sh
Monitors upgrade progress with stall and timeout handling, waits for cluster stability, checks platform health, validates OPP operators, and runs the main orchestration flow.

Estimated code review effort: 4 (Complex) | ~45 minutes

Sequence Diagram(s)

sequenceDiagram
  participant CIConfig as policy-collection config
  participant StepRef as interop-opp-upgrade-ref.yaml
  participant Script as interop-opp-upgrade-commands.sh
  participant OC as oc CLI
  participant Cluster as OpenShift cluster
  CIConfig->>StepRef: schedule interop-opp-upgrade
  StepRef->>Script: run command script
  Script->>OC: registry login
  Script->>OC: read target release and upgrade status
  Script->>Cluster: apply admin-ack / CCO annotation updates
  Script->>OC: start oc adm upgrade
  loop monitor_upgrade
    Script->>OC: poll upgrade status
    Script->>Cluster: snapshot ClusterVersion
  end
  Script->>OC: wait-for-stable-cluster
  Script->>Cluster: validate platform and OPP operator health
Loading

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 2 warnings)

Check name Status Explanation Resolution
No-Sensitive-Data-In-Logs ❌ Error Failure logs dump node/CO/MCP describes, ClusterVersion YAML, and full CSV lists, which can expose internal hostnames and cluster data. Redact or remove broad cluster dumps; log only minimal status fields and sanitize node/host identifiers before printing.
Single Node Openshift (Sno) Test Compatibility ⚠️ Warning The new upgrade job provisions 6 compute nodes and 3 zones, so it assumes a multi-node cluster and would not work on SNO. Add a SNO skip/guard (or label) if this workflow must be SNO-safe; otherwise document it as intentionally multi-node only.
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning The new upgrade step fetches raw.githubusercontent.com and a workers.dev signature URL, so it depends on public internet in disconnected CI. Mirror/vendor the external config and signature lookup to internal/cluster sources, or mark the test [Skipped:Disconnected] if internet is unavoidable.
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding an OPP GA-to-nightly upgrade step.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo tests were added or modified; the touched files are CI config, OWNERS, and a shell step script, with no It/Describe/Context/When titles present.
Test Structure And Quality ✅ Passed No Ginkgo test code is added here; the PR only introduces ci-operator config, step-registry scripts, and OWNERS files.
Microshift Test Compatibility ✅ Passed No new Ginkgo e2e tests were added; the PR only adds ci-operator config/step-registry YAML and a shell step script, so MicroShift test compatibility is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed Touched files are CI config/step scripts only; no pod specs, affinity, nodeSelectors, spreads, or PDBs were added.
Ote Binary Stdout Contract ✅ Passed Only CI step/config files were added; no Go main/TestMain/init or OTE binary stdout writers were introduced.
No-Weak-Crypto ✅ Passed No weak crypto or secret/token comparison was added; the script only parses sha256 digests and checks upgrade status strings.
Container-Privileges ✅ Passed Reviewed the added config and step-registry files; none introduce privileged=true, hostPID/Network/IPC, SYS_ADMIN, runAsUser:0, or allowPrivilegeEscalation:true.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@amp-rh, pj-rehearse: unable to determine affected jobs. This could be due to a branch that needs to be rebased. ERROR:

could not determine changed registry steps: could not load step registry: test `interop-opp-upgrade` has `commands` containing `trap` command, but test step is missing grace_period
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

- Add grace_period: 10m to ref.yaml (required when script uses trap)
- Add OWNERS files at interop/ and interop/opp/ parent directories
- Add generated metadata JSON for step registry
@openshift-merge-bot openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jul 2, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/interop/opp/upgrade/interop-opp-upgrade-commands.sh`:
- Around line 12-21: The EXIT/TERM trap is being installed before debug_on_exit
is defined, so failures in the setup commands can trigger an undefined function
and hide the real error. Move the trap setup in interop-opp-upgrade-commands.sh
to after the debug_on_exit function definition, or define debug_on_exit first
and then install the trap, so the trap always resolves to a valid function.
- Around line 299-347: The local IFS setting in the OPP upgrade check is leaking
past the operator parsing and breaking the namespace loop in the same function.
Limit the comma IFS change to the `read -ra operators` call in
`interop-opp-upgrade-commands.sh` and restore normal splitting before the `for
ns in ${opp_namespaces}` loop so `opp_namespaces` iterates correctly over each
namespace for the pod readiness check.
- Around line 187-209: The upgrade timeout logic in monitor_upgrade is tied to
poll iterations instead of real elapsed time, so changing POLL_INTERVAL changes
the effective timeout and command runtime is not counted. Use the existing
start_time in monitor_upgrade to compute elapsed wall-clock time on each loop
iteration and stop when elapsed time reaches UPGRADE_TIMEOUT in minutes, rather
than decrementing a counter once per sleep. Keep the remaining/polling logic and
status collection in monitor_upgrade, but base the timeout check on actual time
so the advertised timeout matches behavior regardless of POLL_INTERVAL or oc
command duration.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: fe747f1d-5afa-4e07-ab19-3ecfbeb1e661

📥 Commits

Reviewing files that changed from the base of the PR and between 0b7a0cc and dc43cd7.

📒 Files selected for processing (4)
  • ci-operator/config/stolostron/policy-collection/stolostron-policy-collection-main__ocp-upgrade.yaml
  • ci-operator/step-registry/interop/opp/upgrade/OWNERS
  • ci-operator/step-registry/interop/opp/upgrade/interop-opp-upgrade-commands.sh
  • ci-operator/step-registry/interop/opp/upgrade/interop-opp-upgrade-ref.yaml

Comment thread ci-operator/step-registry/interop/opp/upgrade/interop-opp-upgrade-commands.sh Outdated
Comment thread ci-operator/step-registry/interop/opp/upgrade/interop-opp-upgrade-commands.sh Outdated
Comment thread ci-operator/step-registry/interop/opp/upgrade/interop-opp-upgrade-commands.sh Outdated
amp-rh added 2 commits July 2, 2026 13:23
- Fix ref.yaml dependencies to use correct name/env mapping
  (name=image stream tag, env=variable name)
- Remove ODF_VERSION_MAJOR_MINOR from config (not declared in any step)
- Remove OPENSHIFT_UPGRADE_RELEASE_IMAGE_OVERRIDE from config deps
  (ref.yaml already declares it correctly)
Generated periodic job definition for the new ocp-upgrade config
variant, matching the format of existing periodics.
@openshift-merge-bot openshift-merge-bot Bot removed the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jul 2, 2026
@openshift-ci

openshift-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@amp-rh: you cannot LGTM your own PR.

Details

In response to this:

/lgtm
/approve

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

- Move trap installation after debug_on_exit definition so early
  failures in setup commands invoke a defined function
- Use wall-clock deadline instead of iteration counter for upgrade
  timeout so behavior is correct regardless of POLL_INTERVAL value
- Scope IFS=',' to the read call only so the namespace loop in
  validate_opp_operators splits correctly on whitespace
@amp-rh

amp-rh commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

/pj-rehearse ack

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@amp-rh: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-bot openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jul 2, 2026
@openshift-ci

openshift-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: amp-rh
Once this PR has been reviewed and has the lgtm label, please assign justinkuli for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot removed the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jul 2, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@amp-rh: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
periodic-ci-stolostron-policy-collection-main-ocp-upgrade-interop-opp-upgrade-aws N/A periodic Periodic changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@amp-rh

amp-rh commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

/pj-rehearse periodic-ci-stolostron-policy-collection-main-ocp-upgrade-interop-opp-upgrade-aws

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@amp-rh: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci

openshift-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@amp-rh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/periodic-ci-stolostron-policy-collection-main-ocp-upgrade-interop-opp-upgrade-aws 334cc5a link unknown /pj-rehearse periodic-ci-stolostron-policy-collection-main-ocp-upgrade-interop-opp-upgrade-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants