-
Notifications
You must be signed in to change notification settings - Fork 555
new approach to featuregates for coordination in the cluster #1373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,293 @@ | ||
| --- | ||
| title: FeatureGates, round 2 | ||
| authors: | ||
| - @deads2k | ||
| reviewers: | ||
| - team-leads # so they know how featuregates are changing. The old way will work, but it will be unnecessarily hard. | ||
| approvers: | ||
| - @jspeed # api-approvers are picking up some code maintenance and we have to merge FeatureGate changes | ||
| - staff-eng # so when asked by new team leads, they have some sense of how to do this. | ||
| api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers). If there is no API change, use "None" | ||
| - @jspeed | ||
| creation-date: 2023-03-29 | ||
| last-updated: 2023-03-29 | ||
| --- | ||
|
|
||
| # FeatureGates, round 2 | ||
|
|
||
| ## Summary | ||
|
|
||
| This enhancement aims to reduce the effort required to add a feature gate to TechPreviewNoUpgrade and to promote | ||
| that feature gate to Default. | ||
| Feature gates in OpenShift are enabled and disabled in a particular FeatureSet, mixing and matching is not allowed. | ||
| Prior to this enhancement, it is necessary to vendor openshift/api into every impacted repository. | ||
| After this enhancement, it will only be necessary to vendor into cluster-config-operator. | ||
|
|
||
| ## Motivation | ||
|
|
||
| Reduce the effort required to add and promote feature gates. | ||
|
|
||
| ### User Stories | ||
|
|
||
| As a staff engineer or release manager, I want to have no-developer-action result in features of unproven reliability | ||
| inaccessible-by-default in upgradeable clusters. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I still don't understand this sentence. Is there some way to say it without the jargon or hyphenated words?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @deads2k Maybe this should've been "accessible-by-default" ? You want all features to go through a tech-preview phase where they're expected to show that their tests pass at historic product wide baseline before they're available without TechPreviewNoUpgrade featureset. |
||
|
|
||
| As a developer, I want to have a low friction way to add a feature gate to TechPreviewNoUpgrade and to promote a feature | ||
| gate to on-by-default. | ||
|
|
||
| ### Goals | ||
|
|
||
| 1. have a low friction way to add a feature gate to TechPreviewNoUpgrade | ||
| 2. have a low friction to promote a feature gate to on-by-default | ||
| 3. allow feature gates to move from TechPreviewNoUpgrade to Default without releases as TechPreviewNoUpgrade first. | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| 1. Change thresholds for requiring TechPreviewNoUpgrade. | ||
| This is a possible future goal, but an update here would be required. | ||
| 2. Change thresholds for promoting from TechPreviewNoUpgrade to Default. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we feel we have a well defined existing threshold? if not, this non-goal seems to conflict w/ the user story above of "i have sufficient evidence to promote this to on by default" (I don't know how we meet that user story w/o defining those thresholds, so if they aren't well defined then by implication they need to be changed)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's been informal so far. I'm willing to describe in more detail, but to my knowledge, the informal, existing mechanism has been working well and I wasn't inclined to change it. |
||
| This is a possible future goal, but an update here would be required. | ||
|
|
||
| ## Proposal | ||
| ### Phase 1 | ||
| For developers using this, the first phase will look like | ||
| 1. Open PR to openshift/api to add your feature gate to [TechPreviewNoUpgrade](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L117). | ||
| The PR should be confined to just the feature gate change and should include a link to a merged enhancement. | ||
| 2. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it worth noting that nagging API approvers should be in Should we have a slack alias to use to allow the group to be summoned to a particular thread?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have precedent elsewhere for delegating decisions like this to team leads so we can maintain velocity and team autonomy. Let's do that here, and put appropriate training in place, so we don't have everyone queued up for 2 people to review important changes.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We talked about this in the staff engineer call today and agreed to move ahead with the technical aspects of this to unblock some other work, and have a deeper conversation about delegation of the process portion in the future.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Given these comments are a few months old, we've had an opportunity now to see how people are holding the feature gate tooling and how they're expecting them to work. There have been occasions where I have noticed teams incorrectly holding the feature gate tooling and have caught them early enough to stop themselves either breaking payloads or shooting themselves in the foot. We've been updating the FAQs along the way but some hadn't even read those. I think we need to either do a very good job of socialising the expectations and how to hold the tools, or, have a limited number of knowledgable (staff?) engineers who can be the SMEs for the tooling and help others in their pursuit of feature gated features. It still seems, from what I've seen, that more often than not, the PRs to promote features to stable are not coming with proof by default, and that pushing back is required to ask for proof that the promotion of certain features isn't going to break payload. An education point or documentation point maybe? |
||
| 3. Vendor the change into openshift/cluster-config-operator. | ||
| 4. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it. | ||
|
|
||
| Notice that this flow eliminates vendoring into multiple repositories and makes it possible to test a feature gate change | ||
| with a single PR, so /payload testing functions properly. | ||
|
|
||
| ### Phase 2 | ||
| 1. Open PR to openshift/api to add your feature gate to [TechPreviewNoUpgrade](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L117). | ||
| The PR should be confined to just the feature gate change and should include a link to a merged enhancement. | ||
| 2. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it. | ||
| 3. Automation vendors openshift/api into cluster-config-operator and opens a PR in a few hours. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We haven't done this bit yet, is anyone prioritising it? |
||
| If policy allows it, machine /lgtm-ing is also desired. | ||
|
|
||
| Notice that we're down to a single PR to add or promote a featuregate. | ||
|
|
||
| ### Mechanics of how it works | ||
| The reason that the current state of determining which feature gates are enabled for which feature sets is handled | ||
| by vendoring code as opposed to API status, is that when feature gates are promoted between releases the vN-1 will | ||
| have a pre-GA version. | ||
| So if vN writes status that indicates a feature gate should be on, the vN-1 must not honor that. | ||
| At the time, we had so few feature gates (see the last 12 releases), that this didn't matter, but since we've added | ||
| TechPreviewNoUpgrade CI jobs, stabilized them, and developed means of inspecting reliability of tests and jobs, there | ||
| is now a benefit of using these capabilities to test our code before impacting real clusters. | ||
|
|
||
|
|
||
| To overcome the original limitation, we will need to key the enabled/disabled feature gates with the payload version | ||
| that they are for. | ||
| This will ensure that older versions do not enable feature gates that were not GA until the later version. | ||
| To do this we will | ||
| 1. Update FeatureGateStatus to contain a list of enabled and disabled feature gates for up to every version listed | ||
| in the CVO history. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ClusterVersion I'm not all that clear on why you'd want such deep history in |
||
| 2. Update cluster-config-operator to have a control loop owned by api-approvers to set the FeatureGateStatus | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. again not really sure why would be owned by api-approvers. I see api review/approval as a distinct need/skillset/requirement that is unrelated to "definition of the maturity of a feature" is it just the lack of another obvious team/set of people to own this code?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. And avoiding the dumping ground of unowned cruft that was developing six months back or so. Small quantity of code to bootstrap a configuration API owned by the people approving those configuration APIs. |
||
| 3. Update library-go to make it easy to | ||
| 1. wait for FeatureGates to be read from the cluster for this version | ||
| 2. make it easy to function with a development build on a patched cluster | ||
| 3. make it easy to read the current state of FeatureGates | ||
| 4. by default, exit when feature gates change (most processes don't react cleanly to changes at runtime) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is going to result in extra+stampeding restarts (of processes, not nodes and not pods, i realize) on upgrades, right? i know we have a semi-common pattern of self-terminating processes to pick up changes, but this seems particularly widespread and likely to happen in close temporal proximity. Maybe not a concern, but felt worthy of note/discussion. wondering if there are optimizations to avoid this. e.g. there's no reason for the process to restart if the only hange in the featuregates is to versions they are ignoring anyway so if we can update the featuregates (add the new version entry) before we upgrade the various process images, then when they restart due to the image change, they'll come up, see the latest feature gate values, and we can avoid any extra restarts.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same section describes how it could be more gracefully. But the default works. Going from default to upgradable resulting a mass blip of operators isn't disruptive to the cluster as a whole. Remember, this is the operator, not the operand.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The config operator today is run level 30, which for context is the same as MAPI and comes after most of the control plane components. If we were to promote the run level, then the new version feature gate information should be available early enough in the upgrade to avoid the duplicate restarts of all the operators. I agree the default works so maybe this is something we want to trial and then optimise later?
What are the odds that various components bake this directly into the operand? I could see it in machine API for example, if there's an AWS specific feature, it may be more obvious to put the feature gate processing into the AWS implementation rather than using a flag controlled by the operator. Maybe we want to clarify in the documentation for this process that operands should use feature flags controlled by operators to avoid the exits/need to gracefully handle changes |
||
| 5. make it easy to wire a hook to handle FeatureGate updates if exit isn't desired | ||
| 4. Update existing operators to use the library-go implementation to consume feature gates in this way | ||
| 5. Update the installer to pass full FeatureGate manifests to the five-ish operators that participate in rendering | ||
| This is needed to ensure that when cluster-config-operator is updated, all the bootstrap processes will have the same | ||
| feature gates enabled. | ||
|
|
||
| #### openshift/api | ||
| The type change will look about like this. | ||
| Minor fitting expected during API review. | ||
| cluster-config-operator will be updated to maintain this status. | ||
| ```go | ||
| type FeatureGateStatus struct { | ||
| // conditions represent the observations of the current state. | ||
| // Known .status.conditions.type are: "DeterminationDegraded" | ||
| // +listType=map | ||
| // +listMapKey=type | ||
| Conditions []metav1.Condition `json:"conditions,omitempty"` | ||
|
|
||
| // featureGates contains a list of enabled and disabled featureGates that are keyed by payloadVersion. | ||
| // Operators other than the CVO and cluster-config-operator, must read the .status.featureGates, locate | ||
| // the version they are managing, find the enabled/disabled featuregates and make the operand and operator match. | ||
| // The enabled/disabled values for a particular version may change during the life of the cluster as various | ||
| // .spec.featureSet values are selected. | ||
| // Operators may choose to restart their processes to pick up these changes, but remembering past enable/disable | ||
| // lists is beyond the scope of this API and is the responsibility of individual operators. | ||
| // Only featureGates with .version in the ClusterVersion.status will be present in this list. | ||
| // +listType=map | ||
| // +listMapKey=version | ||
| FeatureGates []FeatureGateDetails `json:"featureGates"` | ||
| } | ||
|
|
||
| type FeatureGateDetails struct { | ||
| // version matches the version provided by the ClusterVersion and in the ClusterOperator.Status.Versions field. | ||
| // +kubebuilder:validation:Required | ||
| // +required | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which tools actually still use this? I thought we had a preference to just use |
||
| Version string `json:"version"` | ||
| // enabled is a list of all feature gates that are enabled in the cluster for the named version | ||
| // +optional | ||
| Enabled []FeatureGateAttributes `json:"enabled"` | ||
| // disabled is a list of all feature gates that are disabled in the cluster for the named version | ||
| // +optional | ||
| Disabled []FeatureGateAttributes `json:"disabled"` | ||
| } | ||
|
|
||
| type FeatureGateAttributes struct { | ||
| // name is the name of the FeatureGate | ||
| // +kubebuilder:validation:Pattern=`^([A-Za-z0-9-]+\.)*[A-Za-z0-9-]+\.?$` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we have any validation at the point that feature gates are added to the API that matches this? I don't see it in the featuregates API but I know there's an admission plugin for that which may be doing the validation |
||
| // +kubebuilder:validation:Required | ||
| // +required | ||
| Name string `json:"name"` | ||
|
|
||
| // possible (probable?) future additions include | ||
| // 1. support level (Stable, ServiceDeliveryOnly, TechPreview, DevPreview) | ||
| // 2. description | ||
| } | ||
| ``` | ||
|
|
||
| #### openshift/library-go | ||
| The existing config observer will be updated to be a drop-in replacement (this covers a dozen or so operators). | ||
| Operators that don't use that pattern can use one like this | ||
|
|
||
| ```go | ||
| featureGateAccessor := NewFeatureGateAccess(args) | ||
| go featureGateAccessor.Run(ctx) | ||
|
|
||
| // wait for feature gates to observed. | ||
| select{ | ||
| case <- featureGateAccessor.InitialFeatureGatesObserved(): | ||
| enabled, disabled, _ := featureGateAccessor.CurrentFeatureGates() | ||
| klog.Infof("FeatureGates initialized: enabled=%v disabled=%v", enabled, disabled) | ||
| case <- time.After(1*time.Minute): | ||
| klog.Errorf("timed out waiting for FeatureGate detection") | ||
| return fmt.Errorf("timed out waiting for FeatureGate detection") | ||
| } | ||
|
|
||
| // continue from here. | ||
| // also open to changes to wire up to k/k FeatureGates if there is demand. | ||
| ``` | ||
|
|
||
| library-go also has code allow a "default" payload version (one that hasn't been substituted) to mean, "give me the | ||
| feature gates for the latest CVO version". | ||
| This will support the development patch use-cases. | ||
|
|
||
| ### Workflow Description | ||
|
|
||
| #### Variation [optional] | ||
|
|
||
| ### API Extensions | ||
|
|
||
| No admission webhooks are needed. | ||
| If complicated validation is required, there is already a FeatureGate admission plugin hardcoded in our openshift/kubernetes | ||
| fork and we'll use that. | ||
|
|
||
| ### Implementation Details/Notes/Constraints [optional] | ||
|
|
||
| ### Risks and Mitigations | ||
|
|
||
| ### Drawbacks | ||
|
|
||
| ## Design Details | ||
|
|
||
| ### Open Questions [optional] | ||
|
|
||
| ### Test Plan | ||
|
|
||
| Unit tests are already added to implementation PRs. | ||
| Clusters will not install without the controller functional and we don't support changing feature gates back from | ||
| TechPreviewNoUpgrade, so e2e tests are impractical. | ||
|
|
||
| ### Graduation Criteria | ||
|
|
||
| #### Dev Preview -> Tech Preview | ||
|
|
||
| #### Tech Preview -> GA | ||
|
|
||
| Since this is actually controlling the gates, it is not practical to pre-test it. | ||
|
bparees marked this conversation as resolved.
|
||
| Once unit tests pass and clusters successfully install equivalently using this mechanism, PRs will be opened against | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am interested how the MCO and conversely the Kubelet would digest the configured FeatureGates at bootstrap. MCO Reference for Features Gates: openshift/machine-config-operator#2668 |
||
| several initial operators, including | ||
| 1. etcd | ||
| 2. kube-apiserver | ||
| 3. kube-controller-manager | ||
| 4. kube-scheduler | ||
| 5. openshift-apiserver | ||
| 6. openshift-controller-manager | ||
| 7. authentication | ||
| 8. ingress | ||
| 9. cloud-credential-operator | ||
| 10. networking | ||
| 11. storage | ||
|
|
||
| There is incentive to use this mechanism when introducing TechPreviewNoUpgrade features because consuming feature gates | ||
| this way is less work. | ||
|
|
||
| We're done when all operators using feature gates are reading from FeatureGate.Status instead of vendoring the list. | ||
|
|
||
| #### Removing a deprecated feature | ||
|
|
||
| - Announce deprecation and support policy of the existing feature | ||
| - Deprecate the feature | ||
|
|
||
| ### Upgrade / Downgrade Strategy | ||
|
|
||
| On upgrade to the first level with this change, the cluster-config-operator goes first, so the FeatureGateStatus will | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this true? I just double checked and the CCO deployment manifest is at run level 30, so I wouldn't expect it to be upgraded until after all of the core control plane components |
||
| be available. | ||
|
|
||
| On downgrade to the last level without this change, the components have hardcoded feature gates, so they don't need | ||
| FeatureGateStatus. | ||
|
|
||
| On upgrade to another level with this change, the cluster cluster-config-operator goes first, so the FeatureGateStatus will | ||
| be available. | ||
|
|
||
| On downgrade to another level with this change, the cluster cluster-config-operator goes first, so the FeatureGateStatus will | ||
| be available. | ||
|
|
||
| ### Version Skew Strategy | ||
|
|
||
| Every version known to CVO will have FeatureGateStatus left in the FeatureGate resource, so all levels installed on a cluster | ||
| will have access to the feature gates they need for their version. | ||
| During version changes (upgrade and downgrade), the cluster-config-operator goes first, so feature gates will be set | ||
| for whatever target level we're after. | ||
|
|
||
| ### Operational Aspects of API Extensions | ||
|
|
||
| No extensions. | ||
|
|
||
| #### Failure Modes | ||
|
|
||
| 1. cluster-config-operator doesn't update the FeatureGate. | ||
| In this case, the operator version never reaches the desired level, so the CVO doesn't progress to the next component. | ||
| The cluster runs skewed (see above for how) and the cluster-config-operator is debugged. | ||
|
|
||
| #### Support Procedures | ||
|
|
||
| The cluster-config-operator will use ClusterOperatorStatus to communicate failures. | ||
|
|
||
| ## Implementation History | ||
|
|
||
| Major milestones in the life cycle of a proposal should be tracked in `Implementation | ||
| History`. | ||
|
|
||
| ## Alternatives | ||
|
|
||
| None identified. | ||
|
|
||
| ## Infrastructure Needed [optional] | ||
|
|
||
| Phase two automation requires a bot similar to the TRT bot to automatically open PRs. | ||
|
|
||
| ## Appendix | ||
|
|
||
| ### Why do we not mix and match feature gates? | ||
| 1. Having a single set vastly reduces the testing matrix. | ||
| This allows us to maintain stability of our TechPreview features and has been largely successful. | ||
| 2. You can't mix and match schemas or other manifests. | ||
| This applies to all manifests, but schemas make it obvious. | ||
| If gate/A requires new-field/one in type/first and gate/B requires new-field/two in type/first, then you logically need | ||
| need manifests for no gates, gate/A only, gate/B only, and gate/A and gate/B. | ||
| This is conceptually the case for resource, leading to a significant increase in the manifests (more likely a new approach | ||
| to manifest creation). | ||
| 3. Makes it impossible to turn on a feature gate that you think is GA, but is actually TechPreview. | ||
Uh oh!
There was an error while loading. Please reload this page.