Add ha-policy-management#1932

Open

nhoriguchi wants to merge 1 commit intoopenshift:masterfrom

nhoriguchi:ha_policy_management_v1

nhoriguchi commented Jan 29, 2026

This enhancement improves guideline compliance checks within the CI process (the Red Hat-internal pipeline for OpenShift) to improve overall HA. Specifically, it integrates a mechanism to evaluate HA levels based on implementation status and developers' input. By notifying developers of non-compliant components, the management process encourages developers to follow the guidelines. All data will be stored in a common repository, allowing both developers and partners to grasp the overall HA status early and easily.


          Add ha-policy-management

fb6d692

openshift-ci bot requested review from bear-redhat and patrickdillon

January 29, 2026 00:24

Contributor

openshift-ci bot commented Jan 29, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jeffdyoung for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Contributor

openshift-ci bot commented Jan 29, 2026

@nhoriguchi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/markdownlint	`fb6d692`	link	true	`/test markdownlint`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

dgoodwin requested changes

View reviewed changes

Contributor

dgoodwin left a comment

Sorry for the delay here. I'm proposing a pretty dramatic alteration of the plan here, but one that will make this much simpler, quicker, and easy to get off the ground. In my mind you can start on implementation of the monitortest described below in origin as soon as you like. We can assist to help understand how to gather lots of data while the PR is open without having to merge it. I'd be quite interested to see what the testing turns up.

enhancements/ha-policy-management/ha-policy-management.md

+              Currently, HA implementation is often left to developers’ discretion,
+              leading to inconsistent or insufficient HA configurations.
+              Although general guidelines exist ([CONVENTIONS.md](https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability)).

Contributor

dgoodwin Feb 24, 2026

Is it fair to say this is the set of conventions you want to enforce with this framework? Are there additional items you would like added? If so I would suggest a PR to that linked enhancement. It helps to have agreed conventions before we start enforcing.

enhancements/ha-policy-management/ha-policy-management.md


		### Non-Goals

		* Strict enforcement of guidelines that block product releases is out of scope.

Contributor

dgoodwin Feb 24, 2026

This will be easy to do when we're ready with the approaches I will spell out below.

enhancements/ha-policy-management/ha-policy-management.md

+              * As an OpenShift Product Manager, I want a clear overview of HA
+                implementation status across components, so I can identify issues
+                from overall HA quality earlier.

Contributor

dgoodwin Feb 24, 2026

Can you define the list of statuses you envision? Is it just compliant and non-compliant? What other levels/statuses do you envision?

enhancements/ha-policy-management/ha-policy-management.md

+              * This proposal targets only all core and infrastructure-related components,
+                and the other components are out of scope.
+              ## Proposal

Contributor

dgoodwin Feb 24, 2026

I would propose radical simplification of the proposal. I believe we can meet your goals with well established precedents, without the need for lots of new processes and systems, and you can get up and running quite quickly.

We've done this sort of things many times, the process is as follows:

Establish the Tests

Typically these are implemented as monitortests, they will run at the end of most of our hundreds of CI jobs. The monitortests generate junit test results per openshift component namespace, and per HA check you'd like to implement.

Example: https://github.com/openshift/origin/blob/00eaaf722f71858b3af6091af44b7225b5f8a6d7/pkg/monitortests/kubelet/containerfailures/container_failures.go#L137

Typically these kinds of tests encode exceptions linked to jiras. So you write the tests, do some preliminary testing in the PR (we can help), see what violations it finds, then write a Jira for each. (more below)

I suggest having the tests only flake when they find a problem for now, so we do not merge the PR and cause mass failures. Once all the problems are identified with bugs filed and exceptions added, the test can be moved to a state where it's allowed to fail.

In this case envision:

[Monitor:ha-compliance][Jira:"console"] pods in ns/openshift-console should define health checks
[Monitor:ha-compliance][Jira:"console"] pods in ns/openshift-console should sufficient replicas for HA

etc.

File Bugs for Violations

Sippy provides the dashboard of current state. Example for the monitortest linked above.

As problems are identified, someone will need to file bugs and add exceptions within the test. Typically we'll label the jiras with a specific label to help keep track. For any approved exception the test will usually permanently flake.

In the event the jiras is closed as not applicable or can't be fixed by engineering or PM, those should d likely transition from exceptions to just permanently approved whitelist with a comment explaining why, or a link to the jira that explains.

Once the test is stable in the wild, new violations will immediately start failing jobs and we have ample provisions for that to make it's way to dev teams. This prevents new components from coming in without the capability unless someone explicitly approves it, as well as regressions for existing components.

It can take time and effort for someone to find all the exceptions to be added and allow the test to start failing on regressions/problems, but in the interim the tests are live, gathering data, and not causing mass failures/panic.

enhancements/ha-policy-management/ha-policy-management.md

+              * Create test cases to collect HA policy information from running OpenShift clusters.
+              * Define HA configs to define the type of HA feature to be handled
+                (redundancy and health check in the first proposal).
+              * Define the data structure of input and output of "HA level check" process.

Contributor

dgoodwin Feb 24, 2026

Covered by junit test results.

enhancements/ha-policy-management/ha-policy-management.md

+                an HA level check for an HA config.
+              * Define the criteria that must be met to pass the HA level check for each
+                component and for each HA config.
+              * Define the workflow of how to collect the responses from notified component owners.

Contributor

dgoodwin Feb 24, 2026

Jira collects the responses from component owners.

enhancements/ha-policy-management/ha-policy-management.md

+              #### HA level check
+              HA level check uses these types of input information to judge whether each
+              component properly covers HA configs or not, then the result is output

Contributor

dgoodwin Feb 24, 2026

This storage specifically is a concern, we need this to fit existing processes, and introducing new storage mechanisms and formats is probably beyond what we can undertake and fit into our existing org workflows. The good news however is that with the above we can get you up and running and working towards these goals much more quickly.

enhancements/ha-policy-management/ha-policy-management.md

+              one of the three values: pass, fail, and skip. Each config has its own
+              HA implementation status info and component specific info.
+              This flowchart is essential for HA policy management, so detailed explanations

Contributor

dgoodwin Feb 24, 2026

This would be replaced by the states in the test:

pass
flake (because the component is permanently whitelisted)
flake (because a pending jira is awaiting a response)
fail (no exception/whitelist entry exists, and the violation appears unapproved)

enhancements/ha-policy-management/ha-policy-management.md

+              #### How component owners respond?
+              A component owner whose component failed the HA Level Check will receive a
+              notification containing the following data (details are omitted for brevity):

Contributor

dgoodwin Feb 24, 2026

Hoping to avoid any new notification mechanisms, the above outlines how we notify component owners when they have a problem that needs addressing.

enhancements/ha-policy-management/ha-policy-management.md

+              Risk: Development teams bear the burden of responding to notifications
+              in a timely manner to prioritize and plan the development of HA features.
+              Mitigation: The management process will only issue warnings without
+              blocking the actual release process.

Contributor

dgoodwin Feb 24, 2026

Agreed, we can accommodate this while the test is in flake mode only, and cannot fail. In future, once all exceptions look covered, we can make the test official and let it fail for anything new.

While an exception is granted with an open jira, the test will flake and not fail.

Periodic monitoring or automation is required to check the list of exception jiras to see if they were closed, and take appropriate action. (either reopen in disagreement, or moving the exception to the permanent whitelist) At this point I would recommend claude command helper in origin repo to help maintain this aspect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet