-
Notifications
You must be signed in to change notification settings - Fork 48
CFP-41953: add ClusterMesh Service v2 #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
MrFreezeex
wants to merge
7
commits into
cilium:main
Choose a base branch
from
MrFreezeex:clustermesh-svc
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
98dced8
CFP-41953: add ClusterMesh Service v2 CFP
MrFreezeex 33465e2
wip: iteration 2
MrFreezeex 941acfd
wip: iteration 3
MrFreezeex 4db1bd0
wip: revise timeline to cilium 1.20 and shorten transition period
MrFreezeex 7fc2a65
wip: revise conditions details after clustermesh backends fix
MrFreezeex e6f32f7
wip: update with latest lb improvements in main
MrFreezeex de012ce
wip: remove ratio with json decoding
MrFreezeex File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,341 @@ | ||
| # CFP-41953: ClusterMesh Service v2 | ||
|
|
||
| **SIG: SIG-clustermesh** | ||
|
|
||
| **Begin Design Discussion:** 2025-10-01 | ||
|
|
||
| **Cilium Release:** 1.20 | ||
|
|
||
| **Authors:** Arthur Outhenin-Chalandre <git@mrfreezeex.fr> | ||
|
|
||
| **Status:** Implementable | ||
|
|
||
| ## Summary | ||
|
|
||
| This CFP proposes introducing v2 of the clustermesh global service data | ||
| format stored in etcd. It transitions from `cilium/state/services/v1/` | ||
| to `cilium/state/services/v2/` and harmonizes backend data insertion | ||
| techniques between clustermesh services and Kubernetes services. | ||
|
|
||
| ## Motivation | ||
|
|
||
| The current clustermesh global service data is handled with the | ||
| [`ClusterService` struct](https://github.com/cilium/cilium/blob/d83cf8ab5e20f8ef6031d9e0f66f577cd095ef89/pkg/clustermesh/store/store.go#L52). | ||
| This struct is encoded in JSON format and stored in etcd. While this format | ||
| has served well initially, it now faces several limitations that prevent | ||
| clustermesh from scaling efficiently and supporting new features. These | ||
| limitations fall into three main areas: missing backend conditions, suboptimal | ||
| performance for service updates, and inefficient data format and encoding. | ||
|
|
||
| ### Missing backend conditions | ||
|
|
||
| The current format omits all backend conditions. It directly removes | ||
| backends that are not ready and not serving. This means that we cannot | ||
| properly perform graceful termination as described in the | ||
| [KPR documentation](https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/#graceful-termination), | ||
| most likely resulting in some traffic loss during rolling updates. | ||
|
|
||
| ### Performance gap with the loadbalancer k8s reflector | ||
|
|
||
| There is a large performance gap between the clustermesh and the | ||
| standard loadbalancer k8s reflector. Their update and ingestion behavior | ||
| is not currently on par. The table below shows how large this gap is: | ||
|
|
||
| | Backends | clustermesh (µs) | clustermesh w/o JSON decoding (µs) | loadbalancer k8s (µs) | Ratio (clustermesh w/o JSON/k8s) | | ||
| |----------|------------------|------------------------------------|-----------------------|----------------------------------| | ||
| | 1 | 44 | 35 | 4 | 9x slower | | ||
| | 100 | 629 | 271 | 4 | 68x slower | | ||
| | 1 000 | 6 349 | 2 626 | 7 | 375x slower | | ||
| | 5 000 | 36 861 | 15 831 | 30 | 528x slower | | ||
| | 10 000 | 78 810 | 44 289 | 70 | 633x slower | | ||
|
|
||
| These benchmarks are not strictly equivalent, but they give a good idea | ||
| of the performance gap, how much JSON decoding contributes to it, and how | ||
| clustermesh degrades as the number of backends grows. | ||
|
|
||
| ### Inefficient data format and network traffic | ||
|
|
||
| Over time, the current data format has shown its limits. Backend | ||
| information is largely the same across ports, so the total size of the | ||
| object tends to grow almost in proportion to the number of ports. In | ||
| addition, JSON encoding adds extra overhead for field names and number | ||
| formatting compared to a binary representation. | ||
|
|
||
| All of this data must be replicated to every node in the mesh. A mesh often | ||
| has many more nodes than a single cluster, which results in a high volume | ||
| of control plane network traffic. When combined with the increased churn | ||
| from backend conditions, this inefficiency becomes a significant scaling | ||
| concern. | ||
|
|
||
| Additionally, etcd imposes a hard limit of 1.5 MiB per object. Without a | ||
| breaking change to a more efficient format, adding the missing fields to | ||
| the current `ClusterService` struct would restrict scalability to fewer | ||
| than 10,000 backends per service per cluster. While this is a high limit, | ||
| having headroom beyond this point is useful for future growth. | ||
|
|
||
| Even below the limit, keeping objects small is important to reduce network | ||
| traffic when backends change. This situation is similar to the Kubernetes | ||
| community's move from `Endpoints` to `EndpointSlice`, but the problem is | ||
| even stronger in clustermesh because a mesh can contain many more nodes | ||
| than a single Kubernetes cluster. | ||
|
|
||
| ## Goals | ||
|
|
||
| * Reduce network bandwidth needed for control plane operations on large services | ||
| * Improve clustermesh service ingestion performance in the agent and in particular | ||
| related to service churn scenarios | ||
| * Allow scaling on the clustermesh level to a larger number of backends per | ||
| service per cluster | ||
| * Add backend conditions to clustermesh services to allow correct backend | ||
| state handling in the loadbalancer | ||
| * Add `EndpointSlice` name to the clustermesh service data to simplify | ||
| the EndpointSliceSync logic | ||
|
|
||
| ## Non-Goals | ||
|
|
||
| * Changes not specific to clustermesh global services (for example | ||
| MCS-API handling) | ||
| * Large changes to non-clustermesh loadbalancer logic | ||
|
|
||
| ## Proposal | ||
|
|
||
| ### Overview | ||
|
|
||
| This CFP proposes transitioning to a different format that directly | ||
| embeds Kubernetes structs and uses zstd compression. | ||
|
|
||
| This will allow more code reuse between the Kubernetes reflector in the | ||
| loadbalancer packages and the clustermesh package. | ||
|
|
||
| The new format will also include endpoint conditions, enabling the | ||
| loadbalancer to retain backends in various states and allowing the | ||
| loadbalancer to apply the same logic as for local services instead of only | ||
| having backends in an active state. | ||
|
|
||
| This will allow EndpointSliceSync to also include endpoints that were | ||
| previously excluded and also include their full conditions allowing more | ||
| native integration with non Cilium GW-API implementations. The inclusion | ||
| of an `EndpointSlice` name should also significantly simplify the | ||
| EndpointSliceSync logic and code base. | ||
|
|
||
| While these new backend conditions will increase service churn, we expect | ||
| to keep it manageable by following a very similar code path and adopting | ||
| the ingestion techniques already used in the loadbalancer k8s reflector, | ||
| with significant improvement over v1 for service churn scenarios (about | ||
| 10-600x when excluding JSON decoding). Additionally, etcd operations are | ||
| already rate limited (by default 20qps), which naturally coalesces multiple | ||
| service updates in the workqueue, ensuring good latency and throughput without | ||
| requiring explicit export throttling. | ||
|
|
||
| Compressing the data with zstd will also dramatically reduce the on-wire size | ||
| and etcd object size. We expect around 50x compression ratios or higher for | ||
| services with thousands of backends. This significantly reduces control plane | ||
| network bandwidth consumption. | ||
|
|
||
| ### Unifying ingestion pipelines through shared data structures | ||
|
|
||
| The clustermesh and loadbalancer k8s reflector currently diverge in both | ||
| their data structures and ingestion logic. This divergence has created the | ||
| performance gap described in the Motivation section and makes it harder to | ||
| maintain feature parity between the two code paths. | ||
|
|
||
| We now propose unifying these pipelines by adopting shared Kubernetes data | ||
| structures. Specifically, we will align clustermesh and the loadbalancer k8s | ||
| reflector directly with Kubernetes slim `EndpointSlice` structs instead of | ||
| using Cilium specific intermediate representations. | ||
|
|
||
| The Kubernetes `EndpointSlice` API must remain backward compatible across | ||
| Kubernetes versions. This aligns well with Cilium clustermesh's upgrade | ||
| requirement to support at least two consecutive minor versions. Using the | ||
| Kubernetes format directly also allows the loadbalancer code to use the | ||
| same struct whether the data comes from a local Kubernetes cluster or from | ||
| clustermesh. | ||
|
|
||
| Cilium currently uses an internal `Endpoints` resource with the | ||
| following struct: | ||
|
|
||
| ```go | ||
| type Endpoints struct { | ||
| types.UnserializableObject | ||
| slim_metav1.ObjectMeta | ||
|
|
||
| EndpointSliceID | ||
|
|
||
| // Backends is a map containing all backend IPs and ports. The key to | ||
| // the map is the backend IP in string form. The value defines the list | ||
| // of ports for that backend IP, plus an additional optional node name. | ||
| // Backends map[cmtypes.AddrCluster]*Backend | ||
| Backends map[cmtypes.AddrCluster]*Backend | ||
| } | ||
| ``` | ||
|
|
||
| This struct is a Cilium specific transformation of Kubernetes `EndpointSlice` | ||
| data. It was designed to support both the legacy `Endpoints` API and the newer | ||
| `EndpointSlice` API. With the introduction of statedb and most consumers watching | ||
| statedb instead of relying on this resource, the only place actually using it | ||
| is `operator/watchers/service_sync.go` to export service data for clustermesh. | ||
|
|
||
| Since this resource is now primarily used for clustermesh exports, we propose | ||
| changing it to directly expose the slim Kubernetes `EndpointSlice` struct. The | ||
| new `ClusterServiceV2` format will embed these `EndpointSlice` objects directly, | ||
| avoiding the need for an intermediate Cilium-specific representation. This | ||
| approach prevents future divergence that could occur if the internal `Endpoints` | ||
| struct is later optimized or changed for other purposes. | ||
|
|
||
| The new `ClusterService` v2 struct would thus look like this: | ||
|
|
||
| ```go | ||
| type ClusterServiceV2 struct { | ||
| Cluster string `json:"cluster"` | ||
| ClusterID uint32 `json:"clusterID"` | ||
| Namespace string `json:"namespace"` | ||
| Name string `json:"name"` | ||
| // Note that not every field from the EndpointSlice will | ||
| // be populated (for instance, fields from TypeMeta and most | ||
| // fields from ObjectMeta) | ||
| EndpointSlices []*slim_discovery_v1.EndpointSlice `json:"endpointslices"` | ||
| } | ||
| ``` | ||
|
|
||
| By using the Kubernetes `EndpointSlice` format directly, both the loadbalancer | ||
| and clustermesh will consume similar data structures. This enables code reuse | ||
| for the agent-side ingestion logic. We can extract most of the code in | ||
| `pkg/loadbalancer/reflectors/k8s.go` into a shared package under | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Definitely like the idea of sharing the ingestion logic. I anyway wanted to get rid of |
||
| `pkg/loadbalancer` that both the loadbalancer and clustermesh can use. This | ||
| shared code will handle buffering, coalescing, and statedb updates, ensuring | ||
| similar ingestion and optimization for both code paths. This alignment allows | ||
| us to introduce an ingestion buffer in the agent for clustermesh services, | ||
| similar to the loadbalancer reflector. We might tweak some parameters for | ||
| these two different use cases (max buffer size, timeouts, etc.), but the core | ||
| logic would be shared. | ||
|
|
||
| We should also find out which is the most efficient between having a single | ||
| buffer for all remote clusters or having a per-cluster buffer and its exact | ||
| parameters. This could be experimented with during implementation. | ||
|
|
||
| We expect performance to be closer to the k8s reflector. According to the | ||
| benchmarks in the Motivation section, the k8s reflector is currently about | ||
| 10-600x faster than clustermesh v1 for updates when excluding JSON decoding. | ||
| The added churn from condition changes should thus remain manageable even at | ||
| scale, with much of the remaining overhead coming from JSON decoding and decompression. | ||
|
|
||
| ### Compressing the clustermesh service data | ||
|
|
||
| We propose compressing the service data stored in etcd using zstd. For | ||
| large services with thousands of backends, we expect compression ratios | ||
| around 50x or higher for services beyond that scale. For example, the | ||
| current `ClusterService` v1 format for a 5,000 backend service and 2 | ||
| ports compresses from 786.89 KiB down to 14.22 KiB with zstd. This | ||
| achieves a 55x compression ratio. | ||
|
|
||
| For instance, consider an 11 clusters mesh where each cluster has 1,000 | ||
| nodes. With the same `ClusterService` in v1 format with 5,000 backends | ||
| and 2 ports, any update from any endpoint in this single service would | ||
| result in about 7.5 GiB of uncompressed data propagated across the mesh. | ||
| With zstd compression, this reduces to approximately 139 MiB. This is | ||
| especially important when service churn happens during service flaps or | ||
| even a simple Deployment rollout, which could potentially trigger these | ||
| updates hundreds of times. | ||
|
|
||
| The etcd 1.5 MiB per object limit is not a concern for the supported | ||
| scale. Kubernetes supports a maximum of 150k endpoints per cluster total | ||
| (across all services). Cilium's default configuration limits the number | ||
| of backends to 64k (both local cluster and clustermesh combined) via the | ||
| `bpf-lb-map-max` setting. These limits are not per service. Even if they | ||
| can be increased, individual service backends from a particular cluster | ||
| should remain well below the etcd object size limit even without | ||
| compression. However, compression is still critical for reducing control | ||
| plane network bandwidth needed to propagate service changes to all agents | ||
| in the mesh. | ||
|
|
||
| All clustermesh v2 service objects will be stored as raw | ||
| `zstd(JSON(ClusterServiceV2))` bytes under the `cilium/state/services/v2/` | ||
| key prefix. All other objects in etcd (including `ClusterService` v1) will | ||
| remain uncompressed. Services are the only objects in etcd that can grow | ||
| unbounded with cluster size. This makes them the primary candidate for | ||
| compression. | ||
|
|
||
| In our benchmarks, zstd decompression added minimal overhead compared | ||
| to JSON decoding (about 5%). Decompression and decoding are done | ||
| concurrently per remote cluster in the agent. This further limits the | ||
| performance impact. | ||
|
|
||
| ### Rollout strategy | ||
|
|
||
| To introduce v2 of the clustermesh service data format, this CFP | ||
| proposes a global switch rather than per-cluster detection. This keeps | ||
| the transition simple. Per-cluster detection would add complexity | ||
| because we would need to handle downgrade and upgrade of remote clusters | ||
| at runtime. | ||
|
|
||
| With this approach, we would add a new temporary option | ||
| `clustermesh-service-v2-enabled`. This option will control whether the operator | ||
| and agent use the v1 or v2 format. The option will be disabled by default in | ||
| Cilium 1.20 and removed in Cilium 1.21. In Cilium 1.20, we would also | ||
| unconditionally export both the v1 and v2 format while `KVStoreMesh` mirrors | ||
| both. This means "double" etcd storage during the transition period, but agents | ||
| will watch only one version (based on their configuration). This means that the | ||
| network traffic will not be "doubled". | ||
|
|
||
| This gives a balance between keeping the change simple and ensuring | ||
| that users can upgrade without traffic disruptions. Cilium 1.20 will | ||
| essentially serve as a transition release where both formats are supported. | ||
| Users will be able to turn on `clustermesh-service-v2-enabled` early in Cilium | ||
| 1.20, assuming all the clusters in their mesh already run Cilium 1.20 or higher. | ||
|
|
||
| To make this transition easier to understand, we could make Cilium | ||
| export its own version in its `CiliumClusterConfig`. We could prevent | ||
| connecting to remote clusters running Cilium 1.19 or lower when | ||
| `clustermesh-service-v2-enabled` is enabled. This would result in a hard | ||
| error when attempting to establish a connection to an incompatible | ||
| cluster. We could also add a warning when we connect to remote clusters | ||
| that run with more than one minor version difference. This is not | ||
| officially supported or tested in our CI. | ||
|
|
||
| ## Impacts / Key Questions | ||
|
|
||
| ### Impact: Service format breaking change | ||
|
|
||
| This change will introduce v2 of service data in etcd. As proposed, this | ||
| would introduce an incompatibility between clusters running Cilium 1.18 | ||
| or lower and Cilium 1.20 or higher by default. | ||
|
|
||
| ### Impact: Text format readability for debugging | ||
|
|
||
| If we compress data with zstd, it will be harder to inspect the content | ||
| of a service object in etcd for debugging. We think this is an acceptable | ||
| trade-off given the benefits for large services. | ||
|
|
||
| ### Option 1: Use a slice approach | ||
|
MrFreezeex marked this conversation as resolved.
|
||
|
|
||
| We could also use a slice approach very similar to what Kubernetes has | ||
| done with `EndpointSlice` vs the original `Endpoints`. | ||
|
|
||
| #### Pros | ||
|
|
||
| * Consistent with the Kubernetes approach | ||
|
|
||
| #### Cons | ||
|
|
||
| * Requires more objects encoded in etcd and generates more churn that we | ||
| cannot easily coalesce across multiple slice objects | ||
| * Compression ratio might be worse than a single-object approach | ||
| * As a result of the two previous points, it would most likely be worse in | ||
| terms of network usage | ||
|
|
||
| ### Option 2: Use protobuf encoding | ||
|
|
||
| #### Pros | ||
|
|
||
| * More efficient encoding than JSON in terms of size | ||
| * Faster decoding than JSON | ||
|
|
||
| #### Cons | ||
|
|
||
| * Readability for debugging would be worse | ||
| * zstd achieves worse compression ratios with protobuf than with JSON, | ||
| which reduces the effective size benefit from switching to protobuf | ||
| (although protobuf is still better) | ||
| * Work on `encoding/json/v2` in recent Go versions will reduce the JSON | ||
| decoding gap in the future. Data decoding is also done concurrently | ||
| per cluster, which further limits the impact of decoding speed. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm slightly worried that using the
EndpointSlicestruct directly might limit us in making extensions easily. I suppose we can always embedEndpointSliceinto a new struct later to add new fields while keeping the format mostly the same.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm what do you mean by extension? Like something not from the Kubernetes API? One of the argument of directly embedding the EndpointSlice is to present the data (almost) un-transformed so that we can then change the data in any way we want like we would do by directly reading the Kubernetes API essentially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking something like cilium/cilium#40190 or other properties we might want to set that are not part of
EndpointSlicedirectly. Though as the issue suggests this would be done via annotations and we can carry those through with this design.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see so I think right now we sync labels and annotations of the Service in the ClusterService struct but those are (I am pretty sure?) actually unused afterwards. Like it's always the service annotations from the local kube-apiserver that are taken into account and not on the remote cluster usually.
But sharing service annotations in remote clusters is pretty easy so we could pretty much keep syncing those as it should be rather small and would be easier to introduce features relying on that later on!