OCPNODE-4052: Add enhancement for additional storage configuration in CRI-O#1934
Conversation
|
@saschagrunert: This pull request references OCPNODE-4052 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
0fb47dd to
81eb91b
Compare
ded08b2 to
2c776ef
Compare
6a65cb2 to
06e6f90
Compare
|
@saschagrunert: This pull request references OCPNODE-4052 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
5fac884 to
f9d5c35
Compare
|
FYI @ktock @GerrySeidman |
f9d5c35 to
702a3c2
Compare
13b9f60 to
206470d
Compare
|
/lgtm |
|
This is awsome. Thanks for making this happen. This is particularly great for the AuriStor's ACA AdditionalLayerStore. We have users who have been waiting for this capability! - Again Thank you!! Note Regarding "Supported Image formats and Technologies"
|
yuqi-zhang
left a comment
There was a problem hiding this comment.
The general idea lgtm, just some comments/questions inline
|
|
||
| #### Additional Layer Stores Workflow | ||
|
|
||
| 1. Cluster administrator identifies slow pod startup times from large images (>5GB) and installs a storage plugin (e.g., stargz-store) on target nodes via DaemonSet or MachineConfig |
There was a problem hiding this comment.
Curious how this could be done today. Presumably the only supported way is to use image mode to build an RHCOS image with the plugin installed?
There was a problem hiding this comment.
Yeah they could either use a DaemonSet, a Machine Config with systemd unit or image mode. I clarified that inthe statement.
| additionalLayerStores: | ||
| - path: /var/lib/stargz-store | ||
| ``` | ||
| 3. MCO generates `storage.conf`, creates MachineConfig, and applies to selected pool; nodes reboot |
There was a problem hiding this comment.
I'm curious if these be dropin files instead of modifying the main configuration file? I guess the benefit of modifying the main configuration file is we maintain our overwrite-not-merge management of CRC objects.
(The background is that due to how the CRC->MC rendering is implemented today, we don't do a merge of configs unless they happen to touch different files/units on-disk, which is a bit of a weird behaviour detail. So if someone were to apply all 3 example CRCs to the cluster, only a subset of them would take effect, depending on how they actually render to files. Perhaps we should document it as "merge this into the main containerruntimeconfig object so you only have 1 CRC per pool still").
There was a problem hiding this comment.
support for a drop-in is on the horizon for storage.conf, but I don't think we'll get it until RHEL 10.4 or so (podman 6 timeframe)
There was a problem hiding this comment.
I added a note wrt that in the doc.
| - For `additionalArtifactStores`: Update CRI-O configuration with `additional_artifact_stores` array | ||
| 3. **Create MachineConfig**: Bundle generated configuration into a MachineConfig | ||
| 4. **Apply to nodes**: MCO applies configuration to nodes matching the `machineConfigPoolSelector` | ||
| 5. **Trigger reboot**: Nodes reboot to apply new storage/runtime configuration |
There was a problem hiding this comment.
Is a reboot required for these changes or would a reload/restart of crio be sufficient?
There was a problem hiding this comment.
we could add support for reload but we currently don't AFAIU
There was a problem hiding this comment.
Updated to clarify that reboot is currently required, but noted that CRI-O reload/restart without reboot may be considered for future releases.
206470d to
5a3d190
Compare
c8630b5 to
fbd9ea2
Compare
… CRI-O This enhancement extends ContainerRuntimeConfig API with three features for improved container storage flexibility: - additionalLayerStores: Enable lazy image pulling via storage plugins (BYOS approach with stargz-snapshotter, nydus-storage-plugin) - additionalImageStores: Read-only container image caches on shared or high-performance storage (NFS, SSD) for faster startup and reduced network overhead - additionalArtifactStores: Configurable OCI artifact storage locations (SSD-backed storage, pre-populated caches, air-gapped deployments) All features target AI/ML workload performance improvements and will ship as Tech Preview in 4.22 behind TechPreviewNoUpgrade feature gate. Path-based configuration with graceful fallback to standard behavior on failure. Unified API design pattern across all three features. Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
fbd9ea2 to
525af9b
Compare
yuqi-zhang
left a comment
There was a problem hiding this comment.
In the spirit of trying to get more enhancements merged before we start work on the API/feature, the general plan lgtm, so leaving an approval from the MCO side
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ktock, yuqi-zhang The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@mrunalp @JoelSpeed PTAL |
| pools.operator.machineconfiguration.openshift.io/worker: "" | ||
| containerRuntimeConfig: | ||
| additionalImageStores: | ||
| - path: /mnt/nfs-images |
There was a problem hiding this comment.
Does this only have a path for now? If there were multiple entries in the list, what does CRIO do? Does it balance across the stores somehow?
There was a problem hiding this comment.
Yes, for now it's just a path but in the future there may be more (distinct) configuration options. That's also one reason I decided to not merge the 3 types into one.
The underlying library does not do any deduplication, means we would have access to the configured store multiple times in the order specified. I don't have a use case in mind where this would make sense, though. 🤷
| // +optional | ||
| // +listType=atomic | ||
| // +kubebuilder:validation:MinItems=1 | ||
| // +kubebuilder:validation:MaxItems=5 |
There was a problem hiding this comment.
Out of interest, why is this limited to 5, but the other two fields limited to 10?
There was a problem hiding this comment.
Layer-level lookups are more granular than regular image/artifact accesses. When a container starts, CRI-O may need to look up multiple individual layers per image/artifact. Considering that, there is also a higher risk of corruption and sequential lookups.
Beside that, the amount of available plugin types for this use case is really limited. It's harder to setup and I can't imagine that a user requires to setup more than 5 distinct additional layer stores. We only have 4 supported technologies right now.
There was a problem hiding this comment.
I am not sure what you mean by "higher risk of corruption and sequential lookups." related to multiple individual layers per image/artifact. Are you saying the ALS processed layers are tightly bound to the container image?
TL/DR; This is not about the original comment on limits, but on the ALS itself
Example:
- docker.io/python:3.13-alpine is processed in such that all of its layer can be handled by an ALS plugin (for example with stargz conversions or AuriStorFS Layer Volumes generated per unique Layer ID in the distributed file system)
- Container Image FOO is "FROM docker.io/python:3.13-alpine" , but container FOO is not explicitly ALS processed for the ALS plugin
Expected Behavior:
- All the layers from docker.io/python:3.13-alpine inherited from Container FOO will be able to leverage the ALS plugin optimization
- The Layers from FOO above docker.io/python:3.13-alpine would fall back to the default (non ALS optimized) processing.
Namely the ALS FUSE plugin is queried for <root>/store/<image-ref>/<layer-digest>/diff paths, but it is entirely up to the plugin how to interpret that path. The same content would be used across multiple .
Note: Unlike stargz which presumably would require specially built stargz "FROM images", the AuriStor ALS with with any OCI container image manifests and layers without the need for special image builds. Additionally its plugin is less strict on with its ALS Plugin configurable with wildcard container image name equivalencies, (ie github.com/your-org/* and docker.io/*) all being mapped to the same namespace of Layer Volumes in the Distributed File System based on Layer ID
Background: The AuriStor Additional Layer Store leverages the AuriStorFS distributed file system and the local cache manager. Unlike stargz, AuriStorFS does not require a special layer format, but it does does check if the layer content has been made available in the distributed file system (general done in the CI/CD pipeline after the container is pushed to the registry) with a AuriStor Volume created per Layer based on the Layer Digest/ID
|
LGTM, will leave the labelling for other reviewers |
|
/lgtm |
|
@saschagrunert: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| #### Single-node Deployments or MicroShift | ||
|
|
||
| - **Single-node OpenShift (SNO)**: Supported. All three features work on SNO deployments, though the BYOS approach for layer stores may be operationally complex on single-node systems with limited resources. | ||
| - **MicroShift**: Not supported. MicroShift does not use the Machine Config Operator, so the API is unavailable. These features are deferred for MicroShift environments. |
There was a problem hiding this comment.
since microshift support multi-nodes now microshift support multi nodes, should we manually configure these features in the future?
Extend ContainerRuntimeConfig API with two features for AI/ML workload performance improvements:
Tech Preview for 4.22 behind TechPreviewNoUpgrade feature gate.
Path-based configuration with FUSE filesystem interface. Graceful fallback to standard behavior on failure.
PTAL @haircommander @QiWang19 @sairameshv @harche @bitoku @yuqi-zhang
API: openshift/api#2681