Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@nhoriguchi: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
dgoodwin
left a comment
There was a problem hiding this comment.
Sorry for the delay here. I'm proposing a pretty dramatic alteration of the plan here, but one that will make this much simpler, quicker, and easy to get off the ground. In my mind you can start on implementation of the monitortest described below in origin as soon as you like. We can assist to help understand how to gather lots of data while the PR is open without having to merge it. I'd be quite interested to see what the testing turns up.
|
|
||
| Currently, HA implementation is often left to developers’ discretion, | ||
| leading to inconsistent or insufficient HA configurations. | ||
| Although general guidelines exist ([CONVENTIONS.md](https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#high-availability)). |
There was a problem hiding this comment.
Is it fair to say this is the set of conventions you want to enforce with this framework? Are there additional items you would like added? If so I would suggest a PR to that linked enhancement. It helps to have agreed conventions before we start enforcing.
|
|
||
| ### Non-Goals | ||
|
|
||
| * Strict enforcement of guidelines that block product releases is out of scope. |
There was a problem hiding this comment.
This will be easy to do when we're ready with the approaches I will spell out below.
|
|
||
| * As an OpenShift Product Manager, I want a clear overview of HA | ||
| implementation status across components, so I can identify issues | ||
| from overall HA quality earlier. |
There was a problem hiding this comment.
Can you define the list of statuses you envision? Is it just compliant and non-compliant? What other levels/statuses do you envision?
| * This proposal targets only all core and infrastructure-related components, | ||
| and the other components are out of scope. | ||
|
|
||
| ## Proposal |
There was a problem hiding this comment.
I would propose radical simplification of the proposal. I believe we can meet your goals with well established precedents, without the need for lots of new processes and systems, and you can get up and running quite quickly.
We've done this sort of things many times, the process is as follows:
Establish the Tests
Typically these are implemented as monitortests, they will run at the end of most of our hundreds of CI jobs. The monitortests generate junit test results per openshift component namespace, and per HA check you'd like to implement.
Typically these kinds of tests encode exceptions linked to jiras. So you write the tests, do some preliminary testing in the PR (we can help), see what violations it finds, then write a Jira for each. (more below)
I suggest having the tests only flake when they find a problem for now, so we do not merge the PR and cause mass failures. Once all the problems are identified with bugs filed and exceptions added, the test can be moved to a state where it's allowed to fail.
In this case envision:
[Monitor:ha-compliance][Jira:"console"] pods in ns/openshift-console should define health checks
[Monitor:ha-compliance][Jira:"console"] pods in ns/openshift-console should sufficient replicas for HA
etc.
File Bugs for Violations
Sippy provides the dashboard of current state. Example for the monitortest linked above.
As problems are identified, someone will need to file bugs and add exceptions within the test. Typically we'll label the jiras with a specific label to help keep track. For any approved exception the test will usually permanently flake.
In the event the jiras is closed as not applicable or can't be fixed by engineering or PM, those should d likely transition from exceptions to just permanently approved whitelist with a comment explaining why, or a link to the jira that explains.
Once the test is stable in the wild, new violations will immediately start failing jobs and we have ample provisions for that to make it's way to dev teams. This prevents new components from coming in without the capability unless someone explicitly approves it, as well as regressions for existing components.
It can take time and effort for someone to find all the exceptions to be added and allow the test to start failing on regressions/problems, but in the interim the tests are live, gathering data, and not causing mass failures/panic.
| * Create test cases to collect HA policy information from running OpenShift clusters. | ||
| * Define HA configs to define the type of HA feature to be handled | ||
| (redundancy and health check in the first proposal). | ||
| * Define the data structure of input and output of "HA level check" process. |
There was a problem hiding this comment.
Covered by junit test results.
| an HA level check for an HA config. | ||
| * Define the criteria that must be met to pass the HA level check for each | ||
| component and for each HA config. | ||
| * Define the workflow of how to collect the responses from notified component owners. |
There was a problem hiding this comment.
Jira collects the responses from component owners.
| #### HA level check | ||
|
|
||
| HA level check uses these types of input information to judge whether each | ||
| component properly covers HA configs or not, then the result is output |
There was a problem hiding this comment.
This storage specifically is a concern, we need this to fit existing processes, and introducing new storage mechanisms and formats is probably beyond what we can undertake and fit into our existing org workflows. The good news however is that with the above we can get you up and running and working towards these goals much more quickly.
| one of the three values: pass, fail, and skip. Each config has its own | ||
| HA implementation status info and component specific info. | ||
|
|
||
| This flowchart is essential for HA policy management, so detailed explanations |
There was a problem hiding this comment.
This would be replaced by the states in the test:
- pass
- flake (because the component is permanently whitelisted)
- flake (because a pending jira is awaiting a response)
- fail (no exception/whitelist entry exists, and the violation appears unapproved)
| #### How component owners respond? | ||
|
|
||
| A component owner whose component failed the HA Level Check will receive a | ||
| notification containing the following data (details are omitted for brevity): |
There was a problem hiding this comment.
Hoping to avoid any new notification mechanisms, the above outlines how we notify component owners when they have a problem that needs addressing.
| Risk: Development teams bear the burden of responding to notifications | ||
| in a timely manner to prioritize and plan the development of HA features. | ||
| Mitigation: The management process will only issue warnings without | ||
| blocking the actual release process. |
There was a problem hiding this comment.
Agreed, we can accommodate this while the test is in flake mode only, and cannot fail. In future, once all exceptions look covered, we can make the test official and let it fail for anything new.
While an exception is granted with an open jira, the test will flake and not fail.
Periodic monitoring or automation is required to check the list of exception jiras to see if they were closed, and take appropriate action. (either reopen in disagreement, or moving the exception to the permanent whitelist) At this point I would recommend claude command helper in origin repo to help maintain this aspect.
This enhancement improves guideline compliance checks within the CI process (the Red Hat-internal pipeline for OpenShift) to improve overall HA. Specifically, it integrates a mechanism to evaluate HA levels based on implementation status and developers' input. By notifying developers of non-compliant components, the management process encourages developers to follow the guidelines. All data will be stored in a common repository, allowing both developers and partners to grasp the overall HA status early and easily.