OPRUN-4099: OLMv1 Deployment Configuration API#1915
OPRUN-4099: OLMv1 Deployment Configuration API#1915oceanc80 wants to merge 7 commits intoopenshift:masterfrom
Conversation
|
@oceanc80: This pull request references OPRUN-4099 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
| - As a cluster extension admin, I want to attach custom storage volumes to operator pods, so that I can provide persistent storage or configuration files to operators. | ||
| - As a cluster extension admin, I want to configure pod affinity rules for operator deployments, so that I can control how operator pods are distributed across cluster nodes. | ||
| - As a cluster extension admin, I want to add custom annotations to operator deployments, so that I can integrate with monitoring and observability tools. | ||
|
|
There was a problem hiding this comment.
I wonder what the story for the Selector is. I wonder if its to handle changes in the pod label selector in the operator's controller deployment between versions (the label selector in the deployment spec is immutable). This configuration could provide upgrade continuity across this type of breaking change.
There was a problem hiding this comment.
I could also see it being used for blue/green deployments or other similar deployment strategies.
There was a problem hiding this comment.
The selector of a deployment is immutable, iirc. I dove into the history of this field, and it looks like it was basically there from the beginning with no real explanation that I could find, and it has never been honored as far as I can tell.
Chalk it up to how fast and loose the early days of OLM were.
|
@oceanc80 I know the PR is in WIP, and I also am not sure if the following comment is the scope of this EP. If the following comment is not the scope of this EP or it is not correct time to raise the comments, you could ignore the following the comment: The deploymentConfig API design looks well thought out. I noticed that the proposal currently focuses on initial installation scenarios, and I was wondering if you could clarify the behavior for runtime configuration updates, which I expect will be a common operational workflow. Question 1: Modifying Existing deploymentConfig Values Scenario: After creating a ClusterExtension with deploymentConfig, a user wants to update some values (e.g., changing memory limits from 256Mi to 512Mi, or adding a new nodeSelector). Could you clarify:
Example: # Initial configuration
deploymentConfig:
resources:
limits:
memory: "256Mi"
# User updates to:
deploymentConfig:
resources:
limits:
memory: "512Mi" # ← modified
nodeSelector: # ← added
infrastructure: "dedicated"Question 2: Adding deploymentConfig After Creation Scenario: A user creates a ClusterExtension without defining deploymentConfig initially, then later wants to add deployment configuration. Could you clarify:
Example: # Initial creation (no deploymentConfig)
apiVersion: olm.operatorframework.io/v1
kind: ClusterExtension
metadata:
name: my-operator
spec:
source:
sourceType: Catalog
catalog:
packageName: my-operator
# Note: no config.inline.deploymentConfig
---
# Later, user adds deploymentConfig
spec:
config:
inline:
deploymentConfig: # ← newly added
nodeSelector:
infrastructure: "dedicated" |
static at runtime, dynamic at build time
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@kuiwang02 let me try to reply to your questions
From the perspective of OLMv1, bundle configuration is opaque. It will take user input, validated it against the configuration schema provided by the bundle, and apply it to generate the final manifests. So, any configuration can
Yes. The Deployment will be regenerated with the new values and applied to the cluster.
I'd say so, yes. Any changes to the pod template should trigger a new replicaset and the deployment will transition towards that.
This is a good question. I know there are fields in the Deployment spec that are immutable (e.g. the label selector). That the only one I can think of. I think the configuration options under the deployment config are mutable.
Yes. For the same reasons in Q1.1
Yes.
Yes.
I don't think so.
Then we are back to the Deployment spec defined in the bundle by the author. The mental model here is really no different than:
AFAIK only the pod label selector is immutable once set. |
@perdasilva Thanks for your great reply. I got it. |
|
@JoelSpeed PTAL, thanks! |
|
|
||
| ## Proposal | ||
|
|
||
| This proposal extends the existing ClusterExtension inline configuration structure to support deployment customization by using [OLMv0's](https://github.com/operator-framework/api/blob/master/pkg/operators/v1alpha1/subscription_types.go#L42-L100) `v1alpha1.SubscriptionConfig`. For clarity in the OLMv1 codebase, a type alias, `DeploymentConfig`, will be introduced, since Subscription is a v0 concept only. OLMv1 will directly use the `SubscriptionConfig` type from operator-framework/api to define the `DeploymentConfig` type. This ensures feature parity with OLMv0 by using the same type definition and reduces our maintenance overhead while we navigate a period of simultaneous maintenance of both OLMv0 and OLMv1. |
There was a problem hiding this comment.
Sharing types between APIs like this is generally discouraged. Someone may make a change to the SubscriptionConfig that you didn't realise and then you end up either supporting (or breaking) functionality inadvertently in the DeploymentConfig usage.
Is the code shared between v0 and v1 for this part of the deployment strategy, or are you just trying to make sure you support the same set of customizations? Is there anything in the alpha API that you might want to make better from lessons learned as you promote this to v1?
There was a problem hiding this comment.
@JoelSpeed we had this discussion upstream where I'd proposed a new, completely separate structure for v1, but it was vetoed in favor of keeping v0 and v1 in sync.
There was a problem hiding this comment.
The purpose of this enhancement is to provide a feature that is needed for the migration pathway from OLMv0-installed operators to OLMv1 ClusterExtensions. We explicitly need all subscription.spec.config fields and behaviors to be supported in OLMv1 to ensure migration from OLMv0 to OLMv1 is not blocked on a diff between the configuration APIs.
The idea here is that a migration tool would be able to take the subscription.spec.config and copy it into the clusterextension.spec.config.inline.deploymentConfig.
So one way or another the OLMv1 bundle-schema for deploymentConfig needs to include everything that OLMv0 includes for equivalent OCP versions.
There was a problem hiding this comment.
Is there anything we could do downstream to eliminate some of the issues I've raised here? For example, deploying a VAP to prevent someone from leveraging the fields in the API apart from the ones we care about/support?
There was a problem hiding this comment.
We explicitly need all subscription.spec.config fields and behaviors to be supported in OLMv1 to ensure migration from OLMv0 to OLMv1 is not blocked on a diff between the configuration APIs.
At least one field (Selector) is called out in this EP as not being supported, is the intention to support that long term or is that the only exception we expect?
There was a problem hiding this comment.
@JoelSpeed this should help with your concern with the Selector field: oceanc80#2
operator-framework/operator-controller#2525
more context in this comment: #1915 (comment)
|
|
||
| #### Validation Failure | ||
|
|
||
| If the deployment configuration fails JSON schema validation: |
There was a problem hiding this comment.
What do you mean by json schema validation here?
There was a problem hiding this comment.
| 1. The ClusterExtension controller rejects the configuration during admission or runtime validation | ||
| 2. The ClusterExtension status is updated with a detailed error message indicating which fields failed validation | ||
| 3. The cluster extension admin corrects the configuration based on the error message | ||
| 4. The controller retries the installation with the corrected configuration |
There was a problem hiding this comment.
Why can't you reject invalid configuration at admission time instead of leaving a controller to check this asynchronously?
There was a problem hiding this comment.
Hey Joel, this enhancement is related to this previous one, where all of this was discussed. The TL;DR is, OLMv1 aims to get out of the bundle configuration game, therefore there won't but a one-size-fits-all configuration schema to validate against at ingress time. We want to shift the configuration surface definition to the authors and allow them to provide content-specific knobs. OLMv1 will only validate configuration against a bundle provided schema.
|
|
||
| ### API Extensions | ||
|
|
||
| The enhancement does not introduce new APIs, CRDs, webhooks, or aggregated API servers. As the inline configuration structure in the ClusterExtension API accepts any valid JSON object, the API will not be changed. This enhancement modifies the existing configuration schema for registry+v1 bundles to accept a `deploymentConfig` field. |
There was a problem hiding this comment.
Yes it does? You're adding a new field with significant structure to the ClusterExtension API as far as I can tell from reading this?
Even if this is a new structure within something that is effectively unstructured data, you are still introducing a new API contract, this should be explained here and reviewed
Can you link to the existing schema and show where it is updated to allow this deploymentConfig field?
There was a problem hiding this comment.
The CRD doesn't change. inline remains as is:
There are two lenses to this (and I feel like a lot of this was already discussed in the original EP):
- OLMv1 conceptually claims no control over schemas provided by bundles. Bundles define the schemas of their APIs, OLM validates the provided configurations using the bundle-provided schema.
- Technically, the registry+v1 schema definition is driven by two things:
- The install modes of the operator (this dictates the schema of the watch namespace configuration)
- OLMv0's subscription.spec.config
So this EP is not technically a new contract. It is establishing the OLMv1 implementation of an existing contract (OLMv0's subscription.spec.config)
Follow-up question: Is OCP okay with every single layered product being able to define their own configuration schema in the future? If so, will every layered product be expected to bring their configuration schema to OpenShift's API review? This API change is essentially the equivalent that.
If that seems untenable, perhaps we should drop the configuration API from OLMv1 entirely and require that layered products have a one-size-fits-all deployment (similar to core payload operators). But that would have major implications on migration from OLMv0 to OLMv1 that PM would have to weigh in on.
There was a problem hiding this comment.
Follow-up question: Is OCP okay with every single layered product being able to define their own configuration schema in the future? If so, will every layered product be expected to bring their configuration schema to OpenShift's API review? This API change is essentially the equivalent that.
Interesting question. Are, per chance, the json schemas that the bundles are supposed to provide similar/the same as the CRD schemas we generate in o/api? If they were similar enough, I wonder if they could indeed be reviewed and generated in the same way CRDs are
Perhaps we need to grow an API reviewer from within the OLM team to help the layered products?
There was a problem hiding this comment.
Perhaps we need to grow an API reviewer from within the OLM team to help the layered products?
In other words, a team like the "Portfolio team" we had for olmv0 (staffed by Lance/Varsha/Brett etc)...coz it's too much for one person being asked to do all of it and just this full time :)
There was a problem hiding this comment.
And none of those people are working on OLM any more...
| ### API Extensions | ||
|
|
||
| The enhancement does not introduce new APIs, CRDs, webhooks, or aggregated API servers. As the inline configuration structure in the ClusterExtension API accepts any valid JSON object, the API will not be changed. This enhancement modifies the existing configuration schema for registry+v1 bundles to accept a `deploymentConfig` field. | ||
| The configuration is validated using JSON schema generated from Kubernetes core v1 and apps v1 OpenAPI specifications. |
There was a problem hiding this comment.
These generated schemas are not particularly reliable sources of information, why can you not validate using the internal validation packages within these APIs?
There was a problem hiding this comment.
They are reliable sources of information in the sense that they're generated from the source itself. ref: operator-framework/operator-controller#2454
I'm not sure what you mean by internal validation packages. I'm guessing the PR makes things more clear?
There was a problem hiding this comment.
generated from Kubernetes core v1 and apps v1 OpenAPI specifications.
The schemas generated from core v1 and apps v1 Go types are only as good as the markers on the types that the openapi schema generator understands. Except we know (as a K8s community) that these are wildly incomplete right now, hence projects like the declaraitve validation and linting projects that are aiming to make these validation markers more complete on the upstream types
If you're leveraging openapi generated from those types, it will not be complete.
By internal validation, I mean functions like ValidatePodSpec. Depending on which core v1 types you use, all of the validation code will exist somewhere around here
There was a problem hiding this comment.
That's really good to know!
@anik120, let's make sure we capture this for follow-up investigation is to see what our exposure is in terms of diff between OpenAPI that we use vs. validation functions that apiserver actually uses.
At worst, we would need to document this gap. At best, perhaps we could call these validation functions as part of our schema validation?
| } | ||
| ``` | ||
|
|
||
| The `Selector` field in the `SubscriptionConfig` is present but is not ever extracted or used by OLMv0. OLMv1 will maintain this behavior so the field will be accepted but ignored. |
There was a problem hiding this comment.
Accepting but ignoring a field is bad practice. Why not create a new type for the deployment config? It doesn't look like it'll be particularly complex to implement
There was a problem hiding this comment.
Same answer as above:
we had this discussion upstream where I'd proposed a new, completely separate structure for v1, but it was vetoed in favor of keeping v0 and v1 in sync.
It was discussed that reusing the v0 structure would mean carrying over debts, but the cost of it was assessed to be acceptable for long term maintainability
There was a problem hiding this comment.
I agree with @JoelSpeed 's point here. Even if we re-use the type, we are also in control of the schema generation for that type, right? So at a minimum we could specifically exclude that field from the generated schema.
There was a problem hiding this comment.
Big +1 to excluding this field from the schema downstream if it's not going to be supported
There was a problem hiding this comment.
Looks like there's some confusion here: we are NOT exposing the field in the schema, even though it's being carried over because of the usage of v1alpha1.SubscriptionConfig.
Wrote a test to make that more explicit: operator-framework/operator-controller#2525
Also @oceanc80 fyi oceanc80#2
| The inline configuration will be validated using JSON schema-based validation. The JSON schema for `DeploymentConfig` will be generated by introspecting the `v1alpha1.SubscriptionConfig` struct. This approach: | ||
|
|
||
| - **Ensures parity with OLMv0**: The schema is derived from the exact same type definition used in OLMv0 | ||
| - **Automatic updates**: When the github.com/operator-framework/api dependency is updated, the schema generation can be re-run to incorporate any new fields |
There was a problem hiding this comment.
How do you know that the schema generated by your v1 tooling is the same as the v0 tooling? What if there's drift between the generators?
There was a problem hiding this comment.
The nature of the generators in v0 and v1 are completely different. In v0 they were part of the API surface, in v1 they are not. ie different validation structure in v1 and v0, but both are tied together by the underlying SubscritionConfig structure.
And the v1 generator uses SubscriptionConfig struct to generate the schema.
In other words, this is why it was decided to use the v0 SubscripitonConfig directly, so that any drift in v0, automatically gets tracked in v1.
Note that the schema is not expected to change in v1 (thereby requiring a tracking back of change to v0), as this effort is solely to support old registry+v1 bundles in v1 so that they can be migrated over.
There was a problem hiding this comment.
The nature of the generators in v0 and v1 are completely different
Yes, this is my point. You have the same structure in two APIs, but with the validation being generated by two different tool sets.
How will you ensure that an object that is accepted by v0 is accepted by v1, and one that is rejected by v0 is rejected by v1? You either need to ensure the schemas are the same (use the same generation?) or have some test suite to ensure this
There was a problem hiding this comment.
But our goal isn't to maintain parity between the objects accepted in v0 and v1. Case in point being the selector field. An object with selector field will be accepted in v0 but will be rejected in v1.
Having said that, one thing that needs to be present is some framework through which we test "v0 -> v1 migration" kind of scenarios. Right now that framework is absent. Openshift integration test suites could have been one place to add these kind of tests, but with OTE being the thing now, the tests live in different repositories. We'll have to think about the best place to write tests like these.
There was a problem hiding this comment.
Once we get around to implementing the actual migration code, I expect that will live upstream and have upstream tests. I think we can cover the scenario of "Subscription had selector, and migration tool simply drops it because OLMv0 ignores it and OLMv1 disallows it"
| curl -sSL https://raw.githubusercontent.com/kubernetes/kubernetes/refs/tags/$(OPENAPI_VERSION)/api/openapi-spec/v3/apis__apps__v1_openapi.json > ./pkg/schema/apis__apps__v1_openapi.json | ||
| ``` | ||
|
|
||
| This ensures that while the first iteration uses a static snapshot, there is an established, low-effort path to update the schema manifests whenever the project's Kubernetes dependencies are bumped. |
There was a problem hiding this comment.
Will you have checks in place so that people keep this up to date?
There was a problem hiding this comment.
Yes upstream runs verify every time and this check has been included in that.
| ## Open Questions / Considerations | ||
|
|
||
| ### Track changes to underlying kubernetes corev1 structures? | ||
| SubscriptionConfig uses many kubernetes corev1 structures from the standard kube lib. This means that the OLMv0 Subscription API would track changes to those structures (e.g. if a new Volume type is added to the API etc.). We need to think about whether we want the same behavior here, and if so how we'd like to implement it. E.g. we could have some process downloading and mining the openapi specs for the given kube lib version we have in go.mod, and having make verify fail when that changes. We'd want to think about how we'd handle any CEL expressions in those corev1 structures when doing the validation (and whether we want to handle them?). |
There was a problem hiding this comment.
Are you doing any processing of these fields, or just setting them directly on the deployment that you're rendering and applying? If you aren't processing them and are just passing them through, then this is probably fine
There was a problem hiding this comment.
Just passing them through without any processing. See https://docs.google.com/document/d/18O4qBvu5I4WIJgo5KU1opyUKcrfgk64xsI3tyXxmVEU/edit?tab=t.0#bookmark=kix.1y89z5sf1akf for more details.
add selector field clarification
|
@JoelSpeed can we move this along? Our downstreaming efforts are currently blocked by this EP's merging (the feature is planned for TPNU for this release) |
|
None of my comments here are blocking, though I do recommend you double check on #1915 (comment) /override ci/prow/markdownlint New sections were added to the template after you started this /assign @joelanford Joe is listed as your approver |
|
@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/markdownlint DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@oceanc80: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| - Sufficient time for feedback from Tech Preview users | ||
| - Production deployment validation | ||
| - End-user documentation including best practices and examples for common scenarios | ||
| - Address any issues found during Tech Preview |
There was a problem hiding this comment.
I would add to this list:
- Robust testing that would verify expected migration scenarios from OLMv0 to OLMv1.
| 1. OLM Team (primary) | ||
| 2. Layered Product Team |
There was a problem hiding this comment.
Flip this around. Layered product teams are better positioned to answer questions about issues with their specific operator configurations and interplay with their CSV's stated defaults. I would expect:
- Customer escalates to layered product team
- If LP team can't diagnose, LP team escalates to OLM team.
Enhancement extending OLMv1's ClusterExtension API to support deployment configuration in order to provide feature parity with OLMv0's SubscriptionConfig.