Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
341 changes: 341 additions & 0 deletions cilium/CFP-41953-clustermesh-service-v2.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,341 @@
# CFP-41953: ClusterMesh Service v2

**SIG: SIG-clustermesh**

**Begin Design Discussion:** 2025-10-01

**Cilium Release:** 1.20

**Authors:** Arthur Outhenin-Chalandre <git@mrfreezeex.fr>

**Status:** Implementable

## Summary

This CFP proposes introducing v2 of the clustermesh global service data
format stored in etcd. It transitions from `cilium/state/services/v1/`
to `cilium/state/services/v2/` and harmonizes backend data insertion
techniques between clustermesh services and Kubernetes services.

## Motivation

The current clustermesh global service data is handled with the
[`ClusterService` struct](https://github.com/cilium/cilium/blob/d83cf8ab5e20f8ef6031d9e0f66f577cd095ef89/pkg/clustermesh/store/store.go#L52).
This struct is encoded in JSON format and stored in etcd. While this format
has served well initially, it now faces several limitations that prevent
clustermesh from scaling efficiently and supporting new features. These
limitations fall into three main areas: missing backend conditions, suboptimal
performance for service updates, and inefficient data format and encoding.

### Missing backend conditions

The current format omits all backend conditions. It directly removes
backends that are not ready and not serving. This means that we cannot
properly perform graceful termination as described in the
[KPR documentation](https://docs.cilium.io/en/stable/network/kubernetes/kubeproxy-free/#graceful-termination),
most likely resulting in some traffic loss during rolling updates.

### Performance gap with the loadbalancer k8s reflector

There is a large performance gap between the clustermesh and the
standard loadbalancer k8s reflector. Their update and ingestion behavior
is not currently on par. The table below shows how large this gap is:

| Backends | clustermesh (µs) | clustermesh w/o JSON decoding (µs) | loadbalancer k8s (µs) | Ratio (clustermesh w/o JSON/k8s) |
|----------|------------------|------------------------------------|-----------------------|----------------------------------|
| 1 | 44 | 35 | 4 | 9x slower |
| 100 | 629 | 271 | 4 | 68x slower |
| 1 000 | 6 349 | 2 626 | 7 | 375x slower |
| 5 000 | 36 861 | 15 831 | 30 | 528x slower |
| 10 000 | 78 810 | 44 289 | 70 | 633x slower |

These benchmarks are not strictly equivalent, but they give a good idea
of the performance gap, how much JSON decoding contributes to it, and how
clustermesh degrades as the number of backends grows.

### Inefficient data format and network traffic

Over time, the current data format has shown its limits. Backend
information is largely the same across ports, so the total size of the
object tends to grow almost in proportion to the number of ports. In
addition, JSON encoding adds extra overhead for field names and number
formatting compared to a binary representation.

All of this data must be replicated to every node in the mesh. A mesh often
has many more nodes than a single cluster, which results in a high volume
of control plane network traffic. When combined with the increased churn
from backend conditions, this inefficiency becomes a significant scaling
concern.

Additionally, etcd imposes a hard limit of 1.5 MiB per object. Without a
breaking change to a more efficient format, adding the missing fields to
the current `ClusterService` struct would restrict scalability to fewer
than 10,000 backends per service per cluster. While this is a high limit,
having headroom beyond this point is useful for future growth.

Even below the limit, keeping objects small is important to reduce network
traffic when backends change. This situation is similar to the Kubernetes
community's move from `Endpoints` to `EndpointSlice`, but the problem is
even stronger in clustermesh because a mesh can contain many more nodes
than a single Kubernetes cluster.

## Goals

* Reduce network bandwidth needed for control plane operations on large services
* Improve clustermesh service ingestion performance in the agent and in particular
related to service churn scenarios
* Allow scaling on the clustermesh level to a larger number of backends per
service per cluster
* Add backend conditions to clustermesh services to allow correct backend
state handling in the loadbalancer
* Add `EndpointSlice` name to the clustermesh service data to simplify
the EndpointSliceSync logic

## Non-Goals

* Changes not specific to clustermesh global services (for example
MCS-API handling)
* Large changes to non-clustermesh loadbalancer logic

## Proposal

### Overview

This CFP proposes transitioning to a different format that directly
embeds Kubernetes structs and uses zstd compression.

This will allow more code reuse between the Kubernetes reflector in the
loadbalancer packages and the clustermesh package.

The new format will also include endpoint conditions, enabling the
loadbalancer to retain backends in various states and allowing the
loadbalancer to apply the same logic as for local services instead of only
having backends in an active state.

This will allow EndpointSliceSync to also include endpoints that were
previously excluded and also include their full conditions allowing more
native integration with non Cilium GW-API implementations. The inclusion
of an `EndpointSlice` name should also significantly simplify the
EndpointSliceSync logic and code base.

While these new backend conditions will increase service churn, we expect
to keep it manageable by following a very similar code path and adopting
the ingestion techniques already used in the loadbalancer k8s reflector,
with significant improvement over v1 for service churn scenarios (about
10-600x when excluding JSON decoding). Additionally, etcd operations are
already rate limited (by default 20qps), which naturally coalesces multiple
service updates in the workqueue, ensuring good latency and throughput without
requiring explicit export throttling.

Compressing the data with zstd will also dramatically reduce the on-wire size
and etcd object size. We expect around 50x compression ratios or higher for
services with thousands of backends. This significantly reduces control plane
network bandwidth consumption.

### Unifying ingestion pipelines through shared data structures

The clustermesh and loadbalancer k8s reflector currently diverge in both
their data structures and ingestion logic. This divergence has created the
performance gap described in the Motivation section and makes it harder to
maintain feature parity between the two code paths.

We now propose unifying these pipelines by adopting shared Kubernetes data
structures. Specifically, we will align clustermesh and the loadbalancer k8s
reflector directly with Kubernetes slim `EndpointSlice` structs instead of
using Cilium specific intermediate representations.

The Kubernetes `EndpointSlice` API must remain backward compatible across
Kubernetes versions. This aligns well with Cilium clustermesh's upgrade
requirement to support at least two consecutive minor versions. Using the
Kubernetes format directly also allows the loadbalancer code to use the
same struct whether the data comes from a local Kubernetes cluster or from
clustermesh.

Cilium currently uses an internal `Endpoints` resource with the
following struct:

```go
type Endpoints struct {
types.UnserializableObject
slim_metav1.ObjectMeta

EndpointSliceID

// Backends is a map containing all backend IPs and ports. The key to
// the map is the backend IP in string form. The value defines the list
// of ports for that backend IP, plus an additional optional node name.
// Backends map[cmtypes.AddrCluster]*Backend
Backends map[cmtypes.AddrCluster]*Backend
}
```

This struct is a Cilium specific transformation of Kubernetes `EndpointSlice`
data. It was designed to support both the legacy `Endpoints` API and the newer
`EndpointSlice` API. With the introduction of statedb and most consumers watching
statedb instead of relying on this resource, the only place actually using it
is `operator/watchers/service_sync.go` to export service data for clustermesh.

Since this resource is now primarily used for clustermesh exports, we propose
changing it to directly expose the slim Kubernetes `EndpointSlice` struct. The
new `ClusterServiceV2` format will embed these `EndpointSlice` objects directly,
avoiding the need for an intermediate Cilium-specific representation. This
approach prevents future divergence that could occur if the internal `Endpoints`
struct is later optimized or changed for other purposes.

The new `ClusterService` v2 struct would thus look like this:

```go
type ClusterServiceV2 struct {
Cluster string `json:"cluster"`
ClusterID uint32 `json:"clusterID"`
Namespace string `json:"namespace"`
Name string `json:"name"`
// Note that not every field from the EndpointSlice will
// be populated (for instance, fields from TypeMeta and most
// fields from ObjectMeta)
EndpointSlices []*slim_discovery_v1.EndpointSlice `json:"endpointslices"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly worried that using the EndpointSlice struct directly might limit us in making extensions easily. I suppose we can always embed EndpointSlice into a new struct later to add new fields while keeping the format mostly the same.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what do you mean by extension? Like something not from the Kubernetes API? One of the argument of directly embedding the EndpointSlice is to present the data (almost) un-transformed so that we can then change the data in any way we want like we would do by directly reading the Kubernetes API essentially.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking something like cilium/cilium#40190 or other properties we might want to set that are not part of EndpointSlice directly. Though as the issue suggests this would be done via annotations and we can carry those through with this design.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see so I think right now we sync labels and annotations of the Service in the ClusterService struct but those are (I am pretty sure?) actually unused afterwards. Like it's always the service annotations from the local kube-apiserver that are taken into account and not on the remote cluster usually.

But sharing service annotations in remote clusters is pretty easy so we could pretty much keep syncing those as it should be rather small and would be easier to introduce features relying on that later on!

}
```

By using the Kubernetes `EndpointSlice` format directly, both the loadbalancer
and clustermesh will consume similar data structures. This enables code reuse
for the agent-side ingestion logic. We can extract most of the code in
`pkg/loadbalancer/reflectors/k8s.go` into a shared package under
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely like the idea of sharing the ingestion logic. I anyway wanted to get rid of k8s.Backend and directly convert from EndpointSlice to the internal models. There's very little now stopping that.

`pkg/loadbalancer` that both the loadbalancer and clustermesh can use. This
shared code will handle buffering, coalescing, and statedb updates, ensuring
similar ingestion and optimization for both code paths. This alignment allows
us to introduce an ingestion buffer in the agent for clustermesh services,
similar to the loadbalancer reflector. We might tweak some parameters for
these two different use cases (max buffer size, timeouts, etc.), but the core
logic would be shared.

We should also find out which is the most efficient between having a single
buffer for all remote clusters or having a per-cluster buffer and its exact
parameters. This could be experimented with during implementation.

We expect performance to be closer to the k8s reflector. According to the
benchmarks in the Motivation section, the k8s reflector is currently about
10-600x faster than clustermesh v1 for updates when excluding JSON decoding.
The added churn from condition changes should thus remain manageable even at
scale, with much of the remaining overhead coming from JSON decoding and decompression.

### Compressing the clustermesh service data

We propose compressing the service data stored in etcd using zstd. For
large services with thousands of backends, we expect compression ratios
around 50x or higher for services beyond that scale. For example, the
current `ClusterService` v1 format for a 5,000 backend service and 2
ports compresses from 786.89 KiB down to 14.22 KiB with zstd. This
achieves a 55x compression ratio.

For instance, consider an 11 clusters mesh where each cluster has 1,000
nodes. With the same `ClusterService` in v1 format with 5,000 backends
and 2 ports, any update from any endpoint in this single service would
result in about 7.5 GiB of uncompressed data propagated across the mesh.
With zstd compression, this reduces to approximately 139 MiB. This is
especially important when service churn happens during service flaps or
even a simple Deployment rollout, which could potentially trigger these
updates hundreds of times.

The etcd 1.5 MiB per object limit is not a concern for the supported
scale. Kubernetes supports a maximum of 150k endpoints per cluster total
(across all services). Cilium's default configuration limits the number
of backends to 64k (both local cluster and clustermesh combined) via the
`bpf-lb-map-max` setting. These limits are not per service. Even if they
can be increased, individual service backends from a particular cluster
should remain well below the etcd object size limit even without
compression. However, compression is still critical for reducing control
plane network bandwidth needed to propagate service changes to all agents
in the mesh.

All clustermesh v2 service objects will be stored as raw
`zstd(JSON(ClusterServiceV2))` bytes under the `cilium/state/services/v2/`
key prefix. All other objects in etcd (including `ClusterService` v1) will
remain uncompressed. Services are the only objects in etcd that can grow
unbounded with cluster size. This makes them the primary candidate for
compression.

In our benchmarks, zstd decompression added minimal overhead compared
to JSON decoding (about 5%). Decompression and decoding are done
concurrently per remote cluster in the agent. This further limits the
performance impact.

### Rollout strategy

To introduce v2 of the clustermesh service data format, this CFP
proposes a global switch rather than per-cluster detection. This keeps
the transition simple. Per-cluster detection would add complexity
because we would need to handle downgrade and upgrade of remote clusters
at runtime.

With this approach, we would add a new temporary option
`clustermesh-service-v2-enabled`. This option will control whether the operator
and agent use the v1 or v2 format. The option will be disabled by default in
Cilium 1.20 and removed in Cilium 1.21. In Cilium 1.20, we would also
unconditionally export both the v1 and v2 format while `KVStoreMesh` mirrors
both. This means "double" etcd storage during the transition period, but agents
will watch only one version (based on their configuration). This means that the
network traffic will not be "doubled".

This gives a balance between keeping the change simple and ensuring
that users can upgrade without traffic disruptions. Cilium 1.20 will
essentially serve as a transition release where both formats are supported.
Users will be able to turn on `clustermesh-service-v2-enabled` early in Cilium
1.20, assuming all the clusters in their mesh already run Cilium 1.20 or higher.

To make this transition easier to understand, we could make Cilium
export its own version in its `CiliumClusterConfig`. We could prevent
connecting to remote clusters running Cilium 1.19 or lower when
`clustermesh-service-v2-enabled` is enabled. This would result in a hard
error when attempting to establish a connection to an incompatible
cluster. We could also add a warning when we connect to remote clusters
that run with more than one minor version difference. This is not
officially supported or tested in our CI.

## Impacts / Key Questions

### Impact: Service format breaking change

This change will introduce v2 of service data in etcd. As proposed, this
would introduce an incompatibility between clusters running Cilium 1.18
or lower and Cilium 1.20 or higher by default.

### Impact: Text format readability for debugging

If we compress data with zstd, it will be harder to inspect the content
of a service object in etcd for debugging. We think this is an acceptable
trade-off given the benefits for large services.

### Option 1: Use a slice approach
Comment thread
MrFreezeex marked this conversation as resolved.

We could also use a slice approach very similar to what Kubernetes has
done with `EndpointSlice` vs the original `Endpoints`.

#### Pros

* Consistent with the Kubernetes approach

#### Cons

* Requires more objects encoded in etcd and generates more churn that we
cannot easily coalesce across multiple slice objects
* Compression ratio might be worse than a single-object approach
* As a result of the two previous points, it would most likely be worse in
terms of network usage

### Option 2: Use protobuf encoding

#### Pros

* More efficient encoding than JSON in terms of size
* Faster decoding than JSON

#### Cons

* Readability for debugging would be worse
* zstd achieves worse compression ratios with protobuf than with JSON,
which reduces the effective size benefit from switching to protobuf
(although protobuf is still better)
* Work on `encoding/json/v2` in recent Go versions will reduce the JSON
decoding gap in the future. Data decoding is also done concurrently
per cluster, which further limits the impact of decoding speed.