cilium · MrFreezeex · Oct 1, 2025 · Nov 2, 2025 · Dec 7, 2025 · Dec 13, 2025
diff --git a/cilium/CFP-41953-clustermesh-service-v2.md b/cilium/CFP-41953-clustermesh-service-v2.md
@@ -0,0 +1,341 @@
+# CFP-41953: ClusterMesh Service v2
+
+**SIG: SIG-clustermesh**
+
+**Begin Design Discussion:** 2025-10-01
+
+**Cilium Release:** 1.20
+
+**Authors:** Arthur Outhenin-Chalandre <git@mrfreezeex.fr>
+
+**Status:** Implementable
+
+## Summary
+
+This CFP proposes introducing v2 of the clustermesh global service data
+format stored in etcd. It transitions from `cilium/state/services/v1/`
+to `cilium/state/services/v2/` and harmonizes backend data insertion
+techniques between clustermesh services and Kubernetes services.
+
+## Motivation
+
+The current clustermesh global service data is handled with the
+[`ClusterService` struct](https://github.com/cilium/cilium/blob/d83cf8ab5e20f8ef6031d9e0f66f577cd095ef89/pkg/clustermesh/store/store.go#L52).
+This struct is encoded in JSON format and stored in etcd. While this format
+has served well initially, it now faces several limitations that prevent
+clustermesh from scaling efficiently and supporting new features. These
+limitations fall into three main areas: missing backend conditions, suboptimal
+performance for service updates, and inefficient data format and encoding.
+
+### Missing backend conditions
+
+The current format omits all backend conditions. It directly removes
+backends that are not ready and not serving. This means that we cannot
+properly perform graceful termination as described in the
+[KPR documentation](https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/#graceful-termination),
+most likely resulting in some traffic loss during rolling updates.
+
+### Performance gap with the loadbalancer k8s reflector
+
+There is a large performance gap between the clustermesh and the
+standard loadbalancer k8s reflector. Their update and ingestion behavior
+is not currently on par. The table below shows how large this gap is:
+
+| Backends | clustermesh (µs) | clustermesh w/o JSON decoding (µs) | loadbalancer k8s (µs) | Ratio (clustermesh w/o JSON/k8s) |
+|----------|------------------|------------------------------------|-----------------------|----------------------------------|
+| 1        | 44               | 35                                 | 4                     | 9x slower                        |
+| 100      | 629              | 271                                | 4                     | 68x slower                       |
+| 1 000    | 6 349            | 2 626                              | 7                     | 375x slower                      |
+| 5 000    | 36 861           | 15 831                             | 30                    | 528x slower                      |
+| 10 000   | 78 810           | 44 289                             | 70                    | 633x slower                      |
+
+These benchmarks are not strictly equivalent, but they give a good idea
+of the performance gap, how much JSON decoding contributes to it, and how
+clustermesh degrades as the number of backends grows.
+
+### Inefficient data format and network traffic
+
+Over time, the current data format has shown its limits. Backend
+information is largely the same across ports, so the total size of the
+object tends to grow almost in proportion to the number of ports. In
+addition, JSON encoding adds extra overhead for field names and number
+formatting compared to a binary representation.
+
+All of this data must be replicated to every node in the mesh. A mesh often
+has many more nodes than a single cluster, which results in a high volume
+of control plane network traffic. When combined with the increased churn
+from backend conditions, this inefficiency becomes a significant scaling
+concern.
+
+Additionally, etcd imposes a hard limit of 1.5 MiB per object. Without a
+breaking change to a more efficient format, adding the missing fields to
+the current `ClusterService` struct would restrict scalability to fewer
+than 10,000 backends per service per cluster. While this is a high limit,
+having headroom beyond this point is useful for future growth.
+
+Even below the limit, keeping objects small is important to reduce network
+traffic when backends change. This situation is similar to the Kubernetes
+community's move from `Endpoints` to `EndpointSlice`, but the problem is
+even stronger in clustermesh because a mesh can contain many more nodes
+than a single Kubernetes cluster.
+
+## Goals
+
+* Reduce network bandwidth needed for control plane operations on large services
+* Improve clustermesh service ingestion performance in the agent and in particular
+  related to service churn scenarios
+* Allow scaling on the clustermesh level to a larger number of backends per
+  service per cluster
+* Add backend conditions to clustermesh services to allow correct backend
+  state handling in the loadbalancer
+* Add `EndpointSlice` name to the clustermesh service data to simplify
+  the EndpointSliceSync logic
+
+## Non-Goals
+
+* Changes not specific to clustermesh global services (for example
+  MCS-API handling)
+* Large changes to non-clustermesh loadbalancer logic
+
+## Proposal
+
+### Overview
+
+This CFP proposes transitioning to a different format that directly
+embeds Kubernetes structs and uses zstd compression.
+
+This will allow more code reuse between the Kubernetes reflector in the
+loadbalancer packages and the clustermesh package.
+
+The new format will also include endpoint conditions, enabling the
+loadbalancer to retain backends in various states and allowing the
+loadbalancer to apply the same logic as for local services instead of only
+having backends in an active state.
+
+This will allow EndpointSliceSync to also include endpoints that were
+previously excluded and also include their full conditions allowing more
+native integration with non Cilium GW-API implementations. The inclusion
+of an `EndpointSlice` name should also significantly simplify the
+EndpointSliceSync logic and code base.
+
+While these new backend conditions will increase service churn, we expect
+to keep it manageable by following a very similar code path and adopting
+the ingestion techniques already used in the loadbalancer k8s reflector,
+with significant improvement over v1 for service churn scenarios (about
+10-600x when excluding JSON decoding). Additionally, etcd operations are
+already rate limited (by default 20qps), which naturally coalesces multiple
+service updates in the workqueue, ensuring good latency and throughput without
+requiring explicit export throttling.
+
+Compressing the data with zstd will also dramatically reduce the on-wire size
+and etcd object size. We expect around 50x compression ratios or higher for
+services with thousands of backends. This significantly reduces control plane
+network bandwidth consumption.
+
+### Unifying ingestion pipelines through shared data structures
+
+The clustermesh and loadbalancer k8s reflector currently diverge in both
+their data structures and ingestion logic. This divergence has created the
+performance gap described in the Motivation section and makes it harder to
+maintain feature parity between the two code paths.
+
+We now propose unifying these pipelines by adopting shared Kubernetes data
+structures. Specifically, we will align clustermesh and the loadbalancer k8s
+reflector directly with Kubernetes slim `EndpointSlice` structs instead of
+using Cilium specific intermediate representations.
+
+The Kubernetes `EndpointSlice` API must remain backward compatible across
+Kubernetes versions. This aligns well with Cilium clustermesh's upgrade
+requirement to support at least two consecutive minor versions. Using the
+Kubernetes format directly also allows the loadbalancer code to use the
+same struct whether the data comes from a local Kubernetes cluster or from
+clustermesh.
+
+Cilium currently uses an internal `Endpoints` resource with the
+following struct:
+
+```go
+type Endpoints struct {
+        types.UnserializableObject
+        slim_metav1.ObjectMeta
+
+        EndpointSliceID
+
+        // Backends is a map containing all backend IPs and ports. The key to
+        // the map is the backend IP in string form. The value defines the list
+        // of ports for that backend IP, plus an additional optional node name.
+        // Backends map[cmtypes.AddrCluster]*Backend
+        Backends map[cmtypes.AddrCluster]*Backend
+}
+```
+
+This struct is a Cilium specific transformation of Kubernetes `EndpointSlice`
+data. It was designed to support both the legacy `Endpoints` API and the newer
+`EndpointSlice` API. With the introduction of statedb and most consumers watching
+statedb instead of relying on this resource, the only place actually using it
+is `operator/watchers/service_sync.go` to export service data for clustermesh.
+
+Since this resource is now primarily used for clustermesh exports, we propose
+changing it to directly expose the slim Kubernetes `EndpointSlice` struct. The
+new `ClusterServiceV2` format will embed these `EndpointSlice` objects directly,
+avoiding the need for an intermediate Cilium-specific representation. This
+approach prevents future divergence that could occur if the internal `Endpoints`
+struct is later optimized or changed for other purposes.
+
+The new `ClusterService` v2 struct would thus look like this:
+
+```go
+type ClusterServiceV2 struct {
+        Cluster string `json:"cluster"`
+        ClusterID uint32 `json:"clusterID"`
+        Namespace string `json:"namespace"`
+        Name string `json:"name"`
+        // Note that not every field from the EndpointSlice will
+        // be populated (for instance, fields from TypeMeta and most
+        // fields from ObjectMeta)
+        EndpointSlices []*slim_discovery_v1.EndpointSlice `json:"endpointslices"`
+}
+```
+
+By using the Kubernetes `EndpointSlice` format directly, both the loadbalancer
+and clustermesh will consume similar data structures. This enables code reuse
+for the agent-side ingestion logic. We can extract most of the code in
+`pkg/loadbalancer/reflectors/k8s.go` into a shared package under
+`pkg/loadbalancer` that both the loadbalancer and clustermesh can use. This
+shared code will handle buffering, coalescing, and statedb updates, ensuring
+similar ingestion and optimization for both code paths. This alignment allows
+us to introduce an ingestion buffer in the agent for clustermesh services,
+similar to the loadbalancer reflector. We might tweak some parameters for
+these two different use cases (max buffer size, timeouts, etc.), but the core
+logic would be shared.
+
+We should also find out which is the most efficient between having a single
+buffer for all remote clusters or having a per-cluster buffer and its exact
+parameters. This could be experimented with during implementation.
+
+We expect performance to be closer to the k8s reflector. According to the
+benchmarks in the Motivation section, the k8s reflector is currently about
+10-600x faster than clustermesh v1 for updates when excluding JSON decoding.
+The added churn from condition changes should thus remain manageable even at
+scale, with much of the remaining overhead coming from JSON decoding and decompression.
+
+### Compressing the clustermesh service data
+
+We propose compressing the service data stored in etcd using zstd. For
+large services with thousands of backends, we expect compression ratios
+around 50x or higher for services beyond that scale. For example, the
+current `ClusterService` v1 format for a 5,000 backend service and 2
+ports compresses from 786.89 KiB down to 14.22 KiB with zstd. This
+achieves a 55x compression ratio.
+
+For instance, consider an 11 clusters mesh where each cluster has 1,000
+nodes. With the same `ClusterService` in v1 format with 5,000 backends
+and 2 ports, any update from any endpoint in this single service would
+result in about 7.5 GiB of uncompressed data propagated across the mesh.
+With zstd compression, this reduces to approximately 139 MiB. This is
+especially important when service churn happens during service flaps or
+even a simple Deployment rollout, which could potentially trigger these
+updates hundreds of times.
+
+The etcd 1.5 MiB per object limit is not a concern for the supported
+scale. Kubernetes supports a maximum of 150k endpoints per cluster total
+(across all services). Cilium's default configuration limits the number
+of backends to 64k (both local cluster and clustermesh combined) via the
+`bpf-lb-map-max` setting. These limits are not per service. Even if they
+can be increased, individual service backends from a particular cluster
+should remain well below the etcd object size limit even without
+compression. However, compression is still critical for reducing control
+plane network bandwidth needed to propagate service changes to all agents
+in the mesh.
+
+All clustermesh v2 service objects will be stored as raw
+`zstd(JSON(ClusterServiceV2))` bytes under the `cilium/state/services/v2/`
+key prefix. All other objects in etcd (including `ClusterService` v1) will
+remain uncompressed. Services are the only objects in etcd that can grow
+unbounded with cluster size. This makes them the primary candidate for
+compression.
+
+In our benchmarks, zstd decompression added minimal overhead compared
+to JSON decoding (about 5%). Decompression and decoding are done
+concurrently per remote cluster in the agent. This further limits the
+performance impact.
+
+### Rollout strategy
+
+To introduce v2 of the clustermesh service data format, this CFP
+proposes a global switch rather than per-cluster detection. This keeps
+the transition simple. Per-cluster detection would add complexity
+because we would need to handle downgrade and upgrade of remote clusters
+at runtime.
+
+With this approach, we would add a new temporary option
+`clustermesh-service-v2-enabled`. This option will control whether the operator
+and agent use the v1 or v2 format. The option will be disabled by default in
+Cilium 1.20 and removed in Cilium 1.21. In Cilium 1.20, we would also
+unconditionally export both the v1 and v2 format while `KVStoreMesh` mirrors
+both. This means "double" etcd storage during the transition period, but agents
+will watch only one version (based on their configuration). This means that the
+network traffic will not be "doubled".
+
+This gives a balance between keeping the change simple and ensuring
+that users can upgrade without traffic disruptions. Cilium 1.20 will
+essentially serve as a transition release where both formats are supported.
+Users will be able to turn on `clustermesh-service-v2-enabled` early in Cilium
+1.20, assuming all the clusters in their mesh already run Cilium 1.20 or higher.
+
+To make this transition easier to understand, we could make Cilium
+export its own version in its `CiliumClusterConfig`. We could prevent
+connecting to remote clusters running Cilium 1.19 or lower when
+`clustermesh-service-v2-enabled` is enabled. This would result in a hard
+error when attempting to establish a connection to an incompatible
+cluster. We could also add a warning when we connect to remote clusters
+that run with more than one minor version difference. This is not
+officially supported or tested in our CI.
+
+## Impacts / Key Questions
+
+### Impact: Service format breaking change
+
+This change will introduce v2 of service data in etcd. As proposed, this
+would introduce an incompatibility between clusters running Cilium 1.18
+or lower and Cilium 1.20 or higher by default.
+
+### Impact: Text format readability for debugging
+
+If we compress data with zstd, it will be harder to inspect the content
+of a service object in etcd for debugging. We think this is an acceptable
+trade-off given the benefits for large services.
+
+### Option 1: Use a slice approach
+
+We could also use a slice approach very similar to what Kubernetes has
+done with `EndpointSlice` vs the original `Endpoints`.
+
+#### Pros
+
+* Consistent with the Kubernetes approach
+
+#### Cons
+
+* Requires more objects encoded in etcd and generates more churn that we
+  cannot easily coalesce across multiple slice objects
+* Compression ratio might be worse than a single-object approach
+* As a result of the two previous points, it would most likely be worse in
+  terms of network usage
+
+### Option 2: Use protobuf encoding
+
+#### Pros
+
+* More efficient encoding than JSON in terms of size
+* Faster decoding than JSON
+
+#### Cons
+
+* Readability for debugging would be worse
+* zstd achieves worse compression ratios with protobuf than with JSON,
+  which reduces the effective size benefit from switching to protobuf
+  (although protobuf is still better)
+* Work on `encoding/json/v2` in recent Go versions will reduce the JSON
+  decoding gap in the future. Data decoding is also done concurrently
+  per cluster, which further limits the impact of decoding speed.