Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
293 changes: 293 additions & 0 deletions dev-guide/feature-gate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
---
title: FeatureGates, round 2
authors:
- @deads2k
reviewers:
- team-leads # so they know how featuregates are changing. The old way will work, but it will be unnecessarily hard.
approvers:
- @jspeed # api-approvers are picking up some code maintenance and we have to merge FeatureGate changes
- staff-eng # so when asked by new team leads, they have some sense of how to do this.
api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers). If there is no API change, use "None"
- @jspeed
creation-date: 2023-03-29
last-updated: 2023-03-29
---

# FeatureGates, round 2

## Summary

This enhancement aims to reduce the effort required to add a feature gate to TechPreviewNoUpgrade and to promote
that feature gate to Default.
Feature gates in OpenShift are enabled and disabled in a particular FeatureSet, mixing and matching is not allowed.
Comment thread
bparees marked this conversation as resolved.
Prior to this enhancement, it is necessary to vendor openshift/api into every impacted repository.
After this enhancement, it will only be necessary to vendor into cluster-config-operator.

## Motivation

Reduce the effort required to add and promote feature gates.

### User Stories

As a staff engineer or release manager, I want to have no-developer-action result in features of unproven reliability
inaccessible-by-default in upgradeable clusters.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand this sentence. Is there some way to say it without the jargon or hyphenated words?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k Maybe this should've been "accessible-by-default" ? You want all features to go through a tech-preview phase where they're expected to show that their tests pass at historic product wide baseline before they're available without TechPreviewNoUpgrade featureset.


As a developer, I want to have a low friction way to add a feature gate to TechPreviewNoUpgrade and to promote a feature
gate to on-by-default.

### Goals

1. have a low friction way to add a feature gate to TechPreviewNoUpgrade
2. have a low friction to promote a feature gate to on-by-default
3. allow feature gates to move from TechPreviewNoUpgrade to Default without releases as TechPreviewNoUpgrade first.

### Non-Goals

1. Change thresholds for requiring TechPreviewNoUpgrade.
This is a possible future goal, but an update here would be required.
2. Change thresholds for promoting from TechPreviewNoUpgrade to Default.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we feel we have a well defined existing threshold? if not, this non-goal seems to conflict w/ the user story above of "i have sufficient evidence to promote this to on by default" (I don't know how we meet that user story w/o defining those thresholds, so if they aren't well defined then by implication they need to be changed)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been informal so far. I'm willing to describe in more detail, but to my knowledge, the informal, existing mechanism has been working well and I wasn't inclined to change it.

This is a possible future goal, but an update here would be required.

## Proposal
### Phase 1
For developers using this, the first phase will look like
1. Open PR to openshift/api to add your feature gate to [TechPreviewNoUpgrade](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L117).
The PR should be confined to just the feature gate change and should include a link to a merged enhancement.
2. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth noting that nagging API approvers should be in #forum-api-review to balance the load rather than going directly to <insert your favourite API approver here> in a DM?

Should we have a slack alias to use to allow the group to be summoned to a particular thread?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have precedent elsewhere for delegating decisions like this to team leads so we can maintain velocity and team autonomy. Let's do that here, and put appropriate training in place, so we don't have everyone queued up for 2 people to review important changes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this in the staff engineer call today and agreed to move ahead with the technical aspects of this to unblock some other work, and have a deeper conversation about delegation of the process portion in the future.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given these comments are a few months old, we've had an opportunity now to see how people are holding the feature gate tooling and how they're expecting them to work. There have been occasions where I have noticed teams incorrectly holding the feature gate tooling and have caught them early enough to stop themselves either breaking payloads or shooting themselves in the foot.

We've been updating the FAQs along the way but some hadn't even read those. I think we need to either do a very good job of socialising the expectations and how to hold the tools, or, have a limited number of knowledgable (staff?) engineers who can be the SMEs for the tooling and help others in their pursuit of feature gated features.

It still seems, from what I've seen, that more often than not, the PRs to promote features to stable are not coming with proof by default, and that pushing back is required to ask for proof that the promotion of certain features isn't going to break payload. An education point or documentation point maybe?

3. Vendor the change into openshift/cluster-config-operator.
4. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it.

Notice that this flow eliminates vendoring into multiple repositories and makes it possible to test a feature gate change
with a single PR, so /payload testing functions properly.

### Phase 2
1. Open PR to openshift/api to add your feature gate to [TechPreviewNoUpgrade](https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L117).
The PR should be confined to just the feature gate change and should include a link to a merged enhancement.
2. Nag api-approvers with link to your PR right away and then every 24h or so until they merge it.
3. Automation vendors openshift/api into cluster-config-operator and opens a PR in a few hours.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't done this bit yet, is anyone prioritising it?

If policy allows it, machine /lgtm-ing is also desired.

Notice that we're down to a single PR to add or promote a featuregate.

### Mechanics of how it works
The reason that the current state of determining which feature gates are enabled for which feature sets is handled
by vendoring code as opposed to API status, is that when feature gates are promoted between releases the vN-1 will
have a pre-GA version.
So if vN writes status that indicates a feature gate should be on, the vN-1 must not honor that.
At the time, we had so few feature gates (see the last 12 releases), that this didn't matter, but since we've added
TechPreviewNoUpgrade CI jobs, stabilized them, and developed means of inspecting reliability of tests and jobs, there
is now a benefit of using these capabilities to test our code before impacting real clusters.


To overcome the original limitation, we will need to key the enabled/disabled feature gates with the payload version
that they are for.
This will ensure that older versions do not enable feature gates that were not GA until the later version.
To do this we will
1. Update FeatureGateStatus to contain a list of enabled and disabled feature gates for up to every version listed
in the CVO history.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterVersion status.history is not exhaustive: openshift/cluster-version-operator#791, openshift/cluster-version-operator#805, #1153. It's possible that the fraction of history maintained by #1153 is sufficient for this enhancement. And it's possible that implicit in this enhancement is that status.featureGates will be pruned to keep up with ClusterVersion status.history pruning.

I'm not all that clear on why you'd want such deep history in featureGates though. Wouldn't the most recent Completed version, and any subsequent Partial versions be sufficient? Or possibly the useful version horizon extends back to the previous minor version, to accommodate possible eventual support for rollbacks within a z stream? But once you've Completed a 4.y.z, I don't understand why any 4.(y-1) or older feature gate information would matter.

2. Update cluster-config-operator to have a control loop owned by api-approvers to set the FeatureGateStatus
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again not really sure why would be owned by api-approvers.

I see api review/approval as a distinct need/skillset/requirement that is unrelated to "definition of the maturity of a feature"

is it just the lack of another obvious team/set of people to own this code?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it just the lack of another obvious team/set of people to own this code?

Yes. And avoiding the dumping ground of unowned cruft that was developing six months back or so. Small quantity of code to bootstrap a configuration API owned by the people approving those configuration APIs.

3. Update library-go to make it easy to
1. wait for FeatureGates to be read from the cluster for this version
2. make it easy to function with a development build on a patched cluster
3. make it easy to read the current state of FeatureGates
4. by default, exit when feature gates change (most processes don't react cleanly to changes at runtime)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is going to result in extra+stampeding restarts (of processes, not nodes and not pods, i realize) on upgrades, right?

i know we have a semi-common pattern of self-terminating processes to pick up changes, but this seems particularly widespread and likely to happen in close temporal proximity. Maybe not a concern, but felt worthy of note/discussion.

wondering if there are optimizations to avoid this.

e.g. there's no reason for the process to restart if the only hange in the featuregates is to versions they are ignoring anyway

so if we can update the featuregates (add the new version entry) before we upgrade the various process images, then when they restart due to the image change, they'll come up, see the latest feature gate values, and we can avoid any extra restarts.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same section describes how it could be more gracefully. But the default works. Going from default to upgradable resulting a mass blip of operators isn't disruptive to the cluster as a whole. Remember, this is the operator, not the operand.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if we can update the featuregates (add the new version entry) before we upgrade the various process images, then when they restart due to the image change, they'll come up, see the latest feature gate values, and we can avoid any extra restarts.

The config operator today is run level 30, which for context is the same as MAPI and comes after most of the control plane components. If we were to promote the run level, then the new version feature gate information should be available early enough in the upgrade to avoid the duplicate restarts of all the operators.

I agree the default works so maybe this is something we want to trial and then optimise later?

Remember, this is the operator, not the operand.

What are the odds that various components bake this directly into the operand? I could see it in machine API for example, if there's an AWS specific feature, it may be more obvious to put the feature gate processing into the AWS implementation rather than using a flag controlled by the operator.

Maybe we want to clarify in the documentation for this process that operands should use feature flags controlled by operators to avoid the exits/need to gracefully handle changes

5. make it easy to wire a hook to handle FeatureGate updates if exit isn't desired
4. Update existing operators to use the library-go implementation to consume feature gates in this way
5. Update the installer to pass full FeatureGate manifests to the five-ish operators that participate in rendering
This is needed to ensure that when cluster-config-operator is updated, all the bootstrap processes will have the same
feature gates enabled.

#### openshift/api
The type change will look about like this.
Minor fitting expected during API review.
cluster-config-operator will be updated to maintain this status.
```go
type FeatureGateStatus struct {
// conditions represent the observations of the current state.
// Known .status.conditions.type are: "DeterminationDegraded"
// +listType=map
// +listMapKey=type
Conditions []metav1.Condition `json:"conditions,omitempty"`

// featureGates contains a list of enabled and disabled featureGates that are keyed by payloadVersion.
// Operators other than the CVO and cluster-config-operator, must read the .status.featureGates, locate
// the version they are managing, find the enabled/disabled featuregates and make the operand and operator match.
// The enabled/disabled values for a particular version may change during the life of the cluster as various
// .spec.featureSet values are selected.
// Operators may choose to restart their processes to pick up these changes, but remembering past enable/disable
// lists is beyond the scope of this API and is the responsibility of individual operators.
// Only featureGates with .version in the ClusterVersion.status will be present in this list.
// +listType=map
// +listMapKey=version
FeatureGates []FeatureGateDetails `json:"featureGates"`
}

type FeatureGateDetails struct {
// version matches the version provided by the ClusterVersion and in the ClusterOperator.Status.Versions field.
// +kubebuilder:validation:Required
// +required
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which tools actually still use this? I thought we had a preference to just use +kubebuilder:validation:Required?

Version string `json:"version"`
// enabled is a list of all feature gates that are enabled in the cluster for the named version
// +optional
Enabled []FeatureGateAttributes `json:"enabled"`
// disabled is a list of all feature gates that are disabled in the cluster for the named version
// +optional
Disabled []FeatureGateAttributes `json:"disabled"`
}

type FeatureGateAttributes struct {
// name is the name of the FeatureGate
// +kubebuilder:validation:Pattern=`^([A-Za-z0-9-]+\.)*[A-Za-z0-9-]+\.?$`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any validation at the point that feature gates are added to the API that matches this? I don't see it in the featuregates API but I know there's an admission plugin for that which may be doing the validation

// +kubebuilder:validation:Required
// +required
Name string `json:"name"`

// possible (probable?) future additions include
// 1. support level (Stable, ServiceDeliveryOnly, TechPreview, DevPreview)
// 2. description
}
```

#### openshift/library-go
The existing config observer will be updated to be a drop-in replacement (this covers a dozen or so operators).
Operators that don't use that pattern can use one like this

```go
featureGateAccessor := NewFeatureGateAccess(args)
go featureGateAccessor.Run(ctx)

// wait for feature gates to observed.
select{
case <- featureGateAccessor.InitialFeatureGatesObserved():
enabled, disabled, _ := featureGateAccessor.CurrentFeatureGates()
klog.Infof("FeatureGates initialized: enabled=%v disabled=%v", enabled, disabled)
case <- time.After(1*time.Minute):
klog.Errorf("timed out waiting for FeatureGate detection")
return fmt.Errorf("timed out waiting for FeatureGate detection")
}

// continue from here.
// also open to changes to wire up to k/k FeatureGates if there is demand.
```

library-go also has code allow a "default" payload version (one that hasn't been substituted) to mean, "give me the
feature gates for the latest CVO version".
This will support the development patch use-cases.

### Workflow Description

#### Variation [optional]

### API Extensions

No admission webhooks are needed.
If complicated validation is required, there is already a FeatureGate admission plugin hardcoded in our openshift/kubernetes
fork and we'll use that.

### Implementation Details/Notes/Constraints [optional]

### Risks and Mitigations

### Drawbacks

## Design Details

### Open Questions [optional]

### Test Plan

Unit tests are already added to implementation PRs.
Clusters will not install without the controller functional and we don't support changing feature gates back from
TechPreviewNoUpgrade, so e2e tests are impractical.

### Graduation Criteria

#### Dev Preview -> Tech Preview

#### Tech Preview -> GA

Since this is actually controlling the gates, it is not practical to pre-test it.
Comment thread
bparees marked this conversation as resolved.
Once unit tests pass and clusters successfully install equivalently using this mechanism, PRs will be opened against
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am interested how the MCO and conversely the Kubelet would digest the configured FeatureGates at bootstrap.

MCO Reference for Features Gates: openshift/machine-config-operator#2668

several initial operators, including
1. etcd
2. kube-apiserver
3. kube-controller-manager
4. kube-scheduler
5. openshift-apiserver
6. openshift-controller-manager
7. authentication
8. ingress
9. cloud-credential-operator
10. networking
11. storage

There is incentive to use this mechanism when introducing TechPreviewNoUpgrade features because consuming feature gates
this way is less work.

We're done when all operators using feature gates are reading from FeatureGate.Status instead of vendoring the list.

#### Removing a deprecated feature

- Announce deprecation and support policy of the existing feature
- Deprecate the feature

### Upgrade / Downgrade Strategy

On upgrade to the first level with this change, the cluster-config-operator goes first, so the FeatureGateStatus will
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I just double checked and the CCO deployment manifest is at run level 30, so I wouldn't expect it to be upgraded until after all of the core control plane components

be available.

On downgrade to the last level without this change, the components have hardcoded feature gates, so they don't need
FeatureGateStatus.

On upgrade to another level with this change, the cluster cluster-config-operator goes first, so the FeatureGateStatus will
be available.

On downgrade to another level with this change, the cluster cluster-config-operator goes first, so the FeatureGateStatus will
be available.

### Version Skew Strategy

Every version known to CVO will have FeatureGateStatus left in the FeatureGate resource, so all levels installed on a cluster
will have access to the feature gates they need for their version.
During version changes (upgrade and downgrade), the cluster-config-operator goes first, so feature gates will be set
for whatever target level we're after.

### Operational Aspects of API Extensions

No extensions.

#### Failure Modes

1. cluster-config-operator doesn't update the FeatureGate.
In this case, the operator version never reaches the desired level, so the CVO doesn't progress to the next component.
The cluster runs skewed (see above for how) and the cluster-config-operator is debugged.

#### Support Procedures

The cluster-config-operator will use ClusterOperatorStatus to communicate failures.

## Implementation History

Major milestones in the life cycle of a proposal should be tracked in `Implementation
History`.

## Alternatives

None identified.

## Infrastructure Needed [optional]

Phase two automation requires a bot similar to the TRT bot to automatically open PRs.

## Appendix

### Why do we not mix and match feature gates?
1. Having a single set vastly reduces the testing matrix.
This allows us to maintain stability of our TechPreview features and has been largely successful.
2. You can't mix and match schemas or other manifests.
This applies to all manifests, but schemas make it obvious.
If gate/A requires new-field/one in type/first and gate/B requires new-field/two in type/first, then you logically need
need manifests for no gates, gate/A only, gate/B only, and gate/A and gate/B.
This is conceptually the case for resource, leading to a significant increase in the manifests (more likely a new approach
to manifest creation).
3. Makes it impossible to turn on a feature gate that you think is GA, but is actually TechPreview.