Skip to content

Comments

CFP-41953: add ClusterMesh Service v2#77

Open
MrFreezeex wants to merge 5 commits intocilium:mainfrom
MrFreezeex:clustermesh-svc
Open

CFP-41953: add ClusterMesh Service v2#77
MrFreezeex wants to merge 5 commits intocilium:mainfrom
MrFreezeex:clustermesh-svc

Conversation

@MrFreezeex
Copy link
Member

@MrFreezeex MrFreezeex commented Oct 1, 2025

This CFP proposes introducing v2 of the ClusterMesh global service data format stored in etcd and a transition from cilium/state/services/v1/ to cilium/state/services/v2/.

The benchmarks in this CFP is based on this (wip/poc) code: https://github.com/MrFreezeex/cilium/tree/bench-clusterservice/bench

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@ledger.fr>
@MrFreezeex MrFreezeex marked this pull request as ready for review October 1, 2025 14:40
@MrFreezeex MrFreezeex changed the title CFP-41953: add ClusterMesh Service v2 CFP CFP-41953: add ClusterMesh Service v2 Oct 1, 2025
@MrFreezeex
Copy link
Member Author

cc @cilium/sig-clustermesh

@giorio94 giorio94 requested review from brb, giorio94 and marseel and removed request for brb October 1, 2025 15:25
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MrFreezeex, the CFP looks very well written to me, and the proposal overall looks reasonable. I've started taking a first look, and left a few initial comments inline.

The main factor which is holding me back about significant changes like this one is the risk of unexpected problems and regressions that may be introduced by the refactoring, and whether the benefits actually justify the associated risks. In this case, it would be probably good to get some feedback from people running Cluster Mesh at significant scale to understand the number of backends per service we could be talking about, to better gauge the overall benefits and how close we are to the current limits. Tagging @antonipp @oblazek @dctrwatson based on previous interactions, in case you have any feedback and/or thoughts in this respect.

I personally also feel the troubleshooting part to be important to clarify before committing to a change like this, to ensure that we can continue to both perform script testing around this area, as well as easily inspect the state to figure out inconsistencies if things don't work as expected. It doesn't look impossible to solve though.

Comment on lines 27 to 29
and there are still no backend conditions (or state) available, which prevents
the Cilium load balancer from excluding/phasing out some backends similarly to regular backends not
coming from a remote cluster.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is suboptimal from a functional point of view. However, it has the advantage of reducing the amount of events that need to be shared across the clustermesh, which would be instead potentially significantly higher when e.g., including the terminating transition. Not saying that we shouldn't go into that direction, but just that it may need some additional scalability considerations.


While we would like to add those fields to the `ClusterService` struct, we have a hard limit
of 1.5 MiB in etcd, and without a breaking change, adding these new fields to the current
`ClusterService` struct might limit the number of backends to fewer than ~10,000 backends
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have data whether this could be a problem in real scenarios, where one would want to have a global service with that many backends? My understanding is that very large services in plain K8s come mostly from DaemonSets, as on a 5K nodes cluster that would correspond to 5K backends. But that doesn't look a common use-case for global services, at least to me.

Copy link
Member Author

@MrFreezeex MrFreezeex Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed I was focusing mostly on the max number of endpoints but that's true that the maximum of nodes could be relevant too and could influence the number of backends per service ~.

Maybe it could comes from cases where you have many small pods (possibly influenced on languages/architectures where pods can not use a full node).

Also technically if your cluster is dual stack the service could reach this with a daemonset. And while I haven't said it in this sentence (as this is more an estimation) of the doc all of this is heavily influence by the number of ports each service have.

But yes I don't have data besides saying the limit out loud.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is one of the aspects where getting feedback from community members that run Cluster Mesh at large scales would be super useful, to be able to gauge how pressing the problem is.

Comment on lines 218 to 222
In order to introduce a "v2" of the ClusterMesh Service data format, this CFP is mainly
proposing to have a global switch rather than a per cluster detection. This is
mainly to keep the transition as simple as possible because a per cluster detection
could introduce more complexity as we would need to handle downgrade/upgrade of remote
clusters at runtime.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be personally in favor of a per-cluster detection, which should be easy enough leveraging capabilities. Basically the source cluster would advertise that it supports v2 services, and in that case the client would switch watching that prefix (obviously assuming it is capable of doing so). That would ensure that the new prefix is only ever watched only if actually exported, and also make the migration require less way minor versions. We could still have a (hidden) flag as an emergency button in case of unexpected issues though.

Copy link
Member Author

@MrFreezeex MrFreezeex Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems rather hard to implement to me. I think if we were to do that we would most likely advertise that in the cluster config. And because of the kvstore we might want to do this dynamically meaning watching the cluster config instead of just getting it at the start of the the agent/operator. We could perhaps not do it dynamically but it means that it would be dependent about the last re-connection to etcd and/or last time the agent/operator (re)started and could be different depending on the client. If doing it dynamically there might be some complication during the transition when the cluster have not fully finished to export all its service in the v2 format though.

One other aspect is that for EndpointSliceSync I would want to do a new algorithm leveraging especially the new EndpointSlice name and while it's technically not required it could simplify things a bit to have a global switch for all clusters rather than using the old or new logic depending what format of data a cluster export (which might even dynamically change).

So overall I am not certain that it's worth it compared to have a global switch which would do "expect that all clusters you are connected to export v2 and use v2" (+ possibly adding some guardrails around that by doing some version check or possibly capability check when the cluster connect). Maybe you have some other considerations that would simplify this though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And because of the kvstore we might want to do this dynamically meaning watching the cluster config instead of just getting it at the start of the the agent/operator. We could perhaps not do it dynamically but it means that it would be dependent about the last re-connection to etcd and/or last time the agent/operator (re)started and could be different depending on the client. If doing it dynamically there might be some complication during the transition when the cluster have not fully finished to export all its service in the v2 format though.

I'd definitely go for a static approach in case. Basically just checking upon connection and watching either one path or the other based on that. I agree that the trickiest part would be in case of kvstore mode, as there we don't use the sync canary, and the cluster config is written before that synchronization completed the first time. But we could definitely work-around that, either having the clients always wait for the sync canary for v2 services (as it is written in all cases), or implementing some ad-hoc mechanism.

One other aspect is that for EndpointSliceSync I would want to do a new algorithm leveraging

Ah, yeah, in that case having a global switch would indeed be easier to completely change the logic there. The other possibility could be introducing the new algorithm only after fully migrating to v2, which would be likely similar to the flags approach in terms of timeline (i.e., 1.20 assuming that the initial bits get in for v1.19).

So overall I am not certain that it's worth it compared to have a global switch which would do "expect that all clusters you are connected to export v2 and use v2"

I'd need to think a bit more about this, but if we go for a global switch I wonder whether we shouldn't even have a flag at that point (or have it hidden just as a circuit breaker). Basically in the first version we'd write to both paths, in the next ones switch the clients to read from v2, and the next next one completely drop the support for v1. The main problem with an explicit flag is that people may then toggle it for whatever reason, and we risk having to commit supporting both versions for more time than expected.

}

// ClusterService represents a service definition within a cluster
message ClusterService {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest considering adding a free-form field (e.g., string to string map) that could be used by implementations to extend the behavior and perform experiments fine-tuning the behavior, before committing to an actual API change. Basically an equivalent to how annotations are often used in Kubernetes. Another alternative could be leveraging google.protobuf.Any as e.g., done here, but I'm unsure whether the extra complexity is actually worth it, and how that would play with possible troubleshooting tooling.

Copy link
Member Author

@MrFreezeex MrFreezeex Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just be able to leverage the protobuf tag for experimentation, we would just need to then keep a list of reserved tag id when removing them and either not ever reusing those or keep them unused long enough so that it's doesn't becomes a problem IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there are definitely multiple possibilities, and the IDs space of protobuf is not a problem from that point of view. Mostly wondering whether to proactively foresee for a global map field right now, or defer that to later 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm do you have some use case that you would like to adress using that? Since all protobuf fields are optional here it seems to me that should be quite similar 🤔. If you have a use case in mind I would be happy to add that ofc

Second iteration of the ClusterMesh Service CFP. This commit is meant to
be squashed before the PR is merged.

This new iteration change the following things:
- adds zstd compression for the existing JSON format
- Rename various fields to EndpointSlice instead of Endpoints
- Change the address to string. While this is less efficient on
  uncompressed data it appears to be more efficient while compressing
  with zstd. It would also be more readable when encoding to JSON
- Add some quick note on the plan to address debugging point by
  using protonjson to allow dumping the data to JSON (and yaml) and
  testing in hive script with JSON

Signed-off-by: Arthur Outhenin-Chalandre <git@mrfreezeex.fr>
@MrFreezeex
Copy link
Member Author

Hi everyone 👋 sorry for the wait, I finally found time to investigate a bit more here and pushed the second iteration of this CFP.

You can find more details on the commit description but apart from addressing some points raised by @giorio94 (thanks again!). I realized that it is more efficient to encode the address with a string type if compressing via ztsd while the uncompressed data is heavier. And the second important point is about the debug-ability by allowing to dump the data in json/yaml and have some hive script command to test with the JSON format. Those would most likely makes some generic kvstore code to have some special condition for those which doesn't look that great but at least the UX would be good. If you have some suggestion about that (and other things ofc) feel free!

@MrFreezeex MrFreezeex requested a review from giorio94 November 2, 2025 18:39
@marseel
Copy link

marseel commented Dec 2, 2025

side note: I think there are multiple valid concerns that this CFP describes, but it's a bit unclear to me what is the user story that we are trying to solve. Within motivation sections we have:

  • No backends conditions for load balancing - I think this needs a bit more description why we need them, what case do we want to improve/solve with it (maybe there is another way?)
  • Simplification of EndpointSliceSync logic by adding EndpointSlice name
  • 1.5 MiB limit in Etcd that limits size of service to less then ~10k backends if we extend ClusterService with the above proposals
  • Reducing in-cluster network traffic between Cilium Agent and Clustermeshe's Etcd.

I thought about it a little bit more and what I am concerned mostly about added complexity.
I do agree that reduction in network bandwidth is nice, lowering likelihood of hitting 1.5 MB limit is improvement as well.

However, when I think about it from high-level design perspective I would start with following questions:

  • Why do we do it only for services and not all other resources? Having the same mechanism (whatever we choose, like compression or protobuf) for other resources would decrease complexity of the entire solution. I would really prefer not to have protobuf for one resource and json for other resources as that would definitely increase complexity and it would be harder to maintain.
  • I do believe we are mixing (just a little bit) two different problems together. One is size of events / network bandwith while other is rate of events (part "Add backend conditions (or state) to ClusterMesh Services"). I think these two problems may have two different solutions that are independent. I do not think your proposed solution fully addresses problem of increased rate of events with added backend conditions. For large services, we can expect constant churn of backends (crashing/restarting pods etc). Reducing amount of data that is transferred is definitely improvement, but I would be actually also concerned with Cilium Agent ability to reconcile it for large services.
  • If we are concerned about 1.5 MB limit, we are improving it, but not solving it. We could actually go totally different way of capping total number of endpoints exposed or introducing slicing. Note that default size of eBPF map for service backends is 65k entries across all clusters for all services in total. I would expect this limit to be affecting users more often rather than single service size.

For backend conditions part itself, why can't we have per service condition and filter that before writing them to Etcd?
AFAIK, high-level routing decision for services is:

  • route to ready backends
  • if no ready backends, route to terminating backends
  • if no terminating and ready backends, route to non ready backends

It seems to me that if one cluster exposes both ready and terminating backends, remote clusters would actually only use ready backends, or did I miss something? That would mean we can just expose terminating or ready backends only instead 🤔 ?

Some alternative proposals:

  • Slice backends just like EndpointSlice + maybe additional compression for all reasources. Just slicing without compression reduced amount of data transferred significantly + reduces overhead of processing within Cilium Agent (less backends to update/process per event etc). Also, if we go with compression, less data to decompress/lower overhead per event etc.
  • Cap max number of backends per Service (for example like 1k) + maybe additional compression for all resources.
    I actually somehow like capping max number of backends per Service even though it doesn't sounds that great initially. By default, eBPF maps responsible for Service Backends have size of ~65k entries.

I do not think considering 10k or 50k backends of service makes sense without addressing eBPF map sizing as well 🤔

@MrFreezeex
Copy link
Member Author

MrFreezeex commented Dec 2, 2025

Hi @marseel 👋 thanks for looking into this!

  • Why do we do it only for services and not all other resources? Having the same mechanism (whatever we choose, like compression or protobuf) for other resources would decrease complexity of the entire solution. I would really prefer not to have protobuf for one resource and json for other resources as that would definitely increase complexity and it would be harder to maintain.

The main reason is that it's the only resources with unbounded size of entries in it/unbounded size. It's also a specific resources to clustermesh, IIRC the only other resource only used for clustermesh (and not kvstore in general) is the serviceexport one specific to the MCS-API implementation so the fact that it's not used by kvtore should simplifies a bit the transition. That being said I agree that having one encoding mechanism would be best in general if we would be able to.

  • I do believe we are mixing (just a little bit) two different problems together. One is size of events / network bandwith while other is rate of events (part "Add backend conditions (or state) to ClusterMesh Services"). I think these two problems may have two different solutions that are independent. I do not think your proposed solution fully addresses problem of increased rate of events with added backend conditions. For large services, we can expect constant churn of backends (crashing/restarting pods etc). Reducing amount of data that is transferred is definitely improvement, but I would be actually also concerned with Cilium Agent ability to reconcile it for large services.

Right my main point in the CFP regarding rate of events is by saying that decompressing/preparing for statedb insertion is two times faster with protobuf, but that might indeed be not sufficient. I was thinking about having a "buffer" like the Cilium loadbalancer code is doing when reading EndpointSlice/Service change but on the export side instead. IIRC the existing buffer before importing is 500 events or 500ms before taking those changes into account and essentially coalescing multiple updates. If we were to do something equivalent when exporting the "ClusterService v2" it should make this more approachable I think ~. I thought about this recently and was thinking that it could be something done separately from this CFP/later on, but I could add it here too!

EDIT: also note that I haven't benchmarked it but I think there are ways to optimize further the insertion of cluster services into statedb which might push further than two time faster the decompression/insertion into statedb vs currently.

  • If we are concerned about 1.5 MB limit, we are improving it, but not solving it. We could actually go totally different way of capping total number of endpoints exposed or introducing slicing. Note that default size of eBPF map for service backends is 65k entries across all clusters for all services in total. I would expect this limit to be affecting users more often rather than single service size.

It kind of solves it because the compression makes the object really smaller and you won't conceivably not attain this limit within what Kubernetes/Cilium support, I wasn't aware of the 65k backends limit and was more counting about Kubernetes supporting a max of 150k backend overall. But indeed accounting for the 65k backends overall, a theoretical limit around 10k could be just ok. My angle is more how to add the new fields while not regressing in a way we don't want to than adopting precisely protobuf so I would be more than happy with alternative solutions as well (or if the limit with the added field is deemed ok we could even just add those fields now and optimize it later?)!

For backend conditions part itself, why can't we have per service condition and filter that before writing them to Etcd? AFAIK, high-level routing decision for services is:

  • route to ready backends
  • if no ready backends, route to terminating backends
  • if no terminating and ready backends, route to non ready backends

It seems to me that if one cluster exposes both ready and terminating backends, remote clusters would actually only use ready backends, or did I miss something? That would mean we can just expose terminating or ready backends only instead 🤔 ?

The high-level routing seems correct, not entirely sure about if the last point is correct "if no terminating and ready backends, route to non ready backends" I would have maybe think that it would never route new connection to non ready backend but I don't know whether it does that or not. IIUC there is also something about not removing existing entries in terminating/non ready for existing connection which might prevent us from removing non ready backends. It would also removes those from EndpointSliceSync but this part might just be ok!

I do not think considering 10k or 50k backends of service makes sense without addressing eBPF map sizing as well 🤔

Indeed! Although it might be simpler to make the ClusterMesh almost unbounded and that the eBPF map would be the only limit though. There might be scenario where going higher than 10k on one service on a cluster in your mesh while the rest don't go over 65k overall. I agree that this looks unlikely though 😅. This seems that eBPF map can be easily tuned by the user so we could just add something in the documentation that in case you have more than 65k backends in all your mesh you should consider bumping this limit 🤔 (and it would most likely be even worthwhile to do this with the current documentation/not part of this CFP!).

Some alternative proposals:

  • Slice backends just like EndpointSlice + maybe additional compression for all reasources. Just slicing without compression reduced amount of data transferred significantly + reduces overhead of processing within Cilium Agent (less backends to update/process per event etc). Also, if we go with compression, less data to decompress/lower overhead per event etc.
  • Cap max number of backends per Service (for example like 1k) + maybe additional compression for all resources.
    I actually somehow like capping max number of backends per Service even though it doesn't sounds that great initially. By default, eBPF maps responsible for Service Backends have size of ~65k entries.

1K might be a bit too small but we could indeed have a hard and properly defined limit per service per cluster! And I am open to the slice approach too!

@MrFreezeex MrFreezeex closed this Dec 2, 2025
@MrFreezeex MrFreezeex reopened this Dec 2, 2025
@giorio94
Copy link
Member

giorio94 commented Dec 3, 2025

Thanks @MrFreezeex for iterating on the CFP, and sorry for the log delay.

I think there are multiple valid concerns that this CFP describes, but it's a bit unclear to me what is the user story that we are trying to solve. [...] I am concerned mostly about added complexity

I personally second this. Overall your proposal makes sense to me, but I'm struggling a bit at finding a clear motivation that justifies the increased complexity and risk of regression intrinsic in a significant change like this.

I would really prefer not to have protobuf for one resource and json for other resources as that would definitely increase complexity and it would be harder to maintain.

Agreed. I'm personally not a big fan of protobuf in this context as it introduces the requirement of knowing the schema to be able to decode the content from etcd, which is not the case currently.

IIRC the only other resource only used for clustermesh (and not kvstore in general) is the serviceexport one specific to the MCS-API implementation so the fact that it's not used by kvtore should simplifies a bit the transition. That being said I agree that having one encoding mechanism would be best in general if we would be able to.

This is simpler if we only want to switch to compressing the json objects, as we could easily support both options, with the plain uncompressed version being still used in kvstore mode, and the compressed one by clustermesh-apiserver and kvstoremesh, making use of capabilities to decide the format. The logic to handle compression/decompression should be fairly trivial anyways. I wouldn't change all resources to protobuf instead, as keeping both variants at the same time is likely to add way more complexity (and the risk of divergences would grow over time).

For large services, we can expect constant churn of backends (crashing/restarting pods etc). Reducing amount of data that is transferred is definitely improvement, but I would be actually also concerned with Cilium Agent ability to reconcile it for large services.

Right my main point in the CFP regarding rate of events is by saying that decompressing/preparing for statedb insertion is two times faster with protobuf, but that might indeed be not sufficient.

I wonder if a practical way to approach this could be to extend the cmapisrv-mock component to better simulate service and backend churn, and then run a few scale tests with that, similarly to what we already do in https://github.com/cilium/cilium/blob/main/.github/workflows/scale-test-clustermesh.yaml. That would help to get a better understanding on where bottleneck possibly are, and how much impact the actual parsing of json objects is. Maybe we discover that the bottleneck is somewhere else, and simply switching to protobuf would not help. Or maybe we see that it is the actual problem, which would be a stronger motivation for going in that direction.

IIRC the existing buffer before importing is 500 events or 500ms before taking those changes into account and essentially coalescing multiple updates.

This is definitely a possibility that makes sense to explore and validate with the scale tests. We already have some intrinsic coalescencing given the back pressure introduced by etcd client rate limiting (both in the clustermesh-apiserver and at the kvstoremesh level), but an explicit rate limiting for that may help even further.

@giorio94
Copy link
Member

giorio94 commented Dec 3, 2025

For backend conditions part itself, why can't we have per service condition and filter that before writing them to Etcd?
AFAIK, high-level routing decision for services is:

route to ready backends
if no ready backends, route to terminating backends
if no terminating and ready backends, route to non ready backends

It seems to me that if one cluster exposes both ready and terminating backends, remote clusters would actually only use ready backends, or did I miss something? That would mean we can just expose terminating or ready backends only instead 🤔 ?

I think (but I'm not totally sure) that in general the reason for keeping the terminating backends in endpointslices is to preserve already existing connections to them until fully drained, while having new connections be load balanced to ready backends only. From this point of view the current behavior of global services is suboptimal, because we don't respect them. OTOH, it is a bit unclear to me whether adding that support would justify the extra churn, considering that the propagation of information may be (possibly significantly) delayed anyways especially in high scale environments. Again, user feedback and/or scale testing may help to provide guidance here.

@MrFreezeex
Copy link
Member Author

MrFreezeex commented Dec 3, 2025

I personally second this. Overall your proposal makes sense to me, but I'm struggling a bit at finding a clear motivation that justifies the increased complexity and risk of regression intrinsic in a significant change like this.

I think one of the main motivation is that there is some fields that would be quite nice to add and that we are in a situation where we can't really comfortably add those fields (and any fields that are per backend in general) without having some important regressions related to the rate of updates by adding conditions, increasing the object size and hitting the 1.5MiB per object limit and overall the bandwith needed which would be severely impacting by both rate of updates and object size. Even without accounting those new fields to be added, improving some of those parameters would most also be a motivation for this!

Also related to object format, our current ClusterService struct seems fitted to how Cilium works internally when ClusterMesh was first introduced (I think?) and we might be better of fitting to something that looks more to Kubernetes resources "struct" so that we should be safe from any future refactoring from the load balancer structs as those are only used internally. This would ensure we can adopt any important new fields in the EndpointSlice struct while just having a similar conversion to the Cilium load balancer internal struct as what is Cilium is doing when reading from the kube-apiserver.

This is simpler if we only want to switch to compressing the json objects, as we could easily support both options, with the plain uncompressed version being still used in kvstore mode, and the compressed one by clustermesh-apiserver and kvstoremesh, making use of capabilities to decide the format.

Hmm maybe yes, we could even have some logic to detect if something is compressed or not and somehow transparently decompress it or use it as is (this only a food for thought, might not be ideal in our case). I am not sure if it's worth compressing any other objects as I think they are all rather small uncompressed anyway? Maybe for CES if it makes its way to the kvstore with a format containing multiple endpoints, I don't think there is an initiative started for this (and no idea if it's something relevant too).

In case we actually change the service format / do a service v2 it would be relatively easy to have the v2 data compressed (independently about using protobuf or json). In case we want to do the same for the rest of the objects we would most likely need to check whether it's worth or not with smaller objects 🤔. It could even be a separate initiative ~.

I wonder if a practical way to approach this could be to extend the cmapisrv-mock component to better simulate service and backend churn, and then run a few scale tests with that, similarly to what we already do in https://github.com/cilium/cilium/blob/main/.github/workflows/scale-test-clustermesh.yaml. That would help to get a better understanding on where bottleneck possibly are, and how much impact the actual parsing of json objects is. Maybe we discover that the bottleneck is somewhere else, and simply switching to protobuf would not help. Or maybe we see that it is the actual problem, which would be a stronger motivation for going in that direction.

Yep indeed that sounds like it would be worth to test something like this! I am wondering if it won't be simpler to test with two actual clusters and just creating a service with like 5k endpoints or 10k and change some small things at a high rate to simulate pods failing readiness check and checking how the agent behave and most likely some profiling info there. Integrating something like that sounds better long term but it might be simpler for me who doesn't know too much about the existing scale test while quickly discovering if there is a bottleneck here. Would be nice to have a test like this integrated in the scale test long term ofc though!

I think (but I'm not totally sure) that in general the reason for keeping the terminating backends in endpointslices is to preserve already existing connections to them until fully drained, while having new connections be load balanced to ready backends only.

Yep I understood that too! I don't know what ready=false means vs terminated=false means though but this is most likely what @marseel described in its message (or at least very close to that 😅!).

OTOH, it is a bit unclear to me whether adding that support would justify the extra churn, considering that the propagation of information may be (possibly significantly) delayed anyways especially in high scale environments. Again, user feedback and/or scale testing may help to provide guidance here.

If I didn't missed any code that would ignore those, the current behavior we just do not filter out any ready=false or terminated=true and treat it like a fully active backends. So this means that even for very low scale deployment, you might have some deployment in a cluster where pods never became ready and we would treat it as ready pods in other clusters. This seems a pretty big deal and break many assumptions around Services handling :/. Ideally if we were to run some kubernetes net conformance test over 2 clusters somehow (which sounds hard and most likely not be happening) we should aim to theoritically pass it and IMO if our architecture can't scale with those assumption we should try to reasonably change something to fix that!

I am not convinced that most users are aware of the current limitation at all. We most likely have somes users that are not aware of this and would consider this limitation a "deal breaker" for Cluster Mesh!


So action items for me:

  • Discover if having high rate churn of large service JSON object is a problem currently or not
  • If yes check to what point we currently coalesce and determine if export coalescing would help

It seems both of you are not convinced by protobuf so it would most likely come down to some subset of the following things if we keep all backends in one object:

  • Decide if we want to add conditions in backends (and endpointslice name)
  • Decide if we want to change the format of the ClusterService (most likely something similar to what this CFP is proposing but encoded in JSON instead of protobuf)
  • Decide if we want to compress the object (and if we include in the scope all other type of resources or not)
  • Decide if we just hard limit the number of Endpoints to something that would work with our architecture
  • If we change something major (new struct format/compression) decide how to handle the transition

And there is also the slice approach that would solve some of the points above but introduce other things (like buffering when reading the slice mainly).

Let me know your thought on that, there's most likely some of the points above that we can discuss before that I get the data of service churn and coalescing!

@giorio94
Copy link
Member

giorio94 commented Dec 4, 2025

I think one of the main motivation is that there is some fields that would be quite nice to add and that we are in a situation where we can't really comfortably add those fields (and any fields that are per backend in general) without having some important regressions related to the rate of updates by adding conditions, increasing the object size and hitting the 1.5MiB per object limit and overall the bandwith needed which would be severely impacting by both rate of updates and object size. Even without accounting those new fields to be added, improving some of those parameters would most also be a motivation for this!

Sure, I agree with all of these reasons in theory. I think we are just lacking enough data at the moment to confirm whether (and which) would be a blocker, and the amount of benefits we could achieve changing the format. Basically justifying the extra complexity and risk of regressions with clear advantages. I'd personally suggest to focus mostly relatively low (e.g., <25 backends), medium (e.g., 25-250 backends) and high (e.g., 250-2500 backends) scale, and not that much on super-high scale (e.g., > 5k backends) in a first phase.

I am not sure if it's worth compressing any other objects as I think they are all rather small uncompressed anyway?

I agree there would not be a clear advantage in compressing the other objects, given that they are fairly small (and their size is bounded). If we were to do that, it would be mostly for consistency (depending on the implementation it may be convenient to do the same for all resources), and not really for the size benefits (assuming that the overhead would be negligible, otherwise that would be probably off the table).

Maybe for CES if it makes its way to the kvstore with a format containing multiple endpoints, I don't think there is an initiative started for this [...].

Yep, not that I'm aware of.

Yep indeed that sounds like it would be worth to test something like this! I am wondering if it won't be simpler to test with two actual clusters and just creating a service with like 5k endpoints or 10k and change some small things at a high rate to simulate pods failing readiness check and checking how the agent behave and most likely some profiling info there. Integrating something like that sounds better long term but it might be simpler for me who doesn't know too much about the existing scale test while quickly discovering if there is a bottleneck here. Would be nice to have a test like this integrated in the scale test long term ofc though!

I don't have a strong preference, it could be an extension of the cmapisrv-mock component (it already supports mocking service and backend objects here, but it would likely need some tuning and also require creating the service objects on the main cluster --- we can chat about how this works offline in case), or something else (for instance the loadbalancer package has some benchmarks that could be interesting to look into). IMO the most important part is that they are reproducible so that we can keep using them in the future to track the improvements (or possible regressions), and that they perform the E2E operations to track all bottlenecks (at least in the context of a single component --- agent or clustermesh-apiserver).

the current behavior we just do not filter out any ready=false or terminated=true and treat it like a fully active backends. [...] This seems a pretty big deal and break many assumptions around Services handling

Ah, right, I didn't consider that one. Yep, I agree this is definitely a bigger limitation compared to the handling of the terminating condition, and we should improve on that. We may be able get pretty far with filtering at the source as Marcel mentioned though (one case to pay attention to is if the service has publishNotReadyAddresses set). Probably not as perfect as propagating all conditions (which we may still want to do longer term), but it also requires very little changes.

@MrFreezeex
Copy link
Member Author

MrFreezeex commented Dec 5, 2025

I ran a small, hacky benchmark which is not that easily repeatable (like less easy than running go test or a CI for sure) but helped discover a few things. I created one global service and 5k "dummy" backends/50 full EndpointSlices (with IPs that do not point to anything and are not linked to any pods). Each endpoint has 2 ports and single stack IPv4, which resulted in the ClusterService weighing ~450KiB (note that I don't think there was any zone info in my kind cluster). The setup was a pretty standard kind-clustermesh dev setup on my laptop.

I also had a small script that wrote a specific label with a counter every 100ms to force the ClusterService to be regenerated and for the cilium-agent to watch it, trigger the full JSON decoding/statedb insert, and simulate some churn that could be caused by readiness. I also added some debug prints with the service name and my service label counter in operator/watchers/service_sync.go and in pkg/clustermesh/service_merger.go in MergeExternalServiceUpdate.

I noticed that the agents were processing each update sequentially and that processing was slower than getting the next updates, which means the agent was gradually lagging behind. It looks like we would need to add a queue or buffer before somewhere in case we want to coalesce updates (and not only rely on etcd rate limit).

I collected some pprof from one agent pulling the clustermesh services:

The most significant processing is in statedb. I compared the codepath from the k8s reflector https://github.com/cilium/cilium/blob/9beb3286c4d2a64fe6fbd0c197d2483b9ff76307/pkg/loadbalancer/reflectors/k8s.go#L340 vs clustermesh https://github.com/cilium/cilium/blob/9beb3286c4d2a64fe6fbd0c197d2483b9ff76307/pkg/clustermesh/service_merger.go#L84. It seems the main differences are that in the k8s reflector there's something that keeps previous backends outside statedb and cleans them before insertion, while in clustermesh we orphan backends afterwards by "querying" statedb. The k8s reflector also uses sequences while in clustermesh that's not the case (and it seems that the previous point helps with that). Those two things might help with memory usage (and also CPU-wise, since ~25% of the agent time in my "benchmark" is from GC). The k8s reflector also writes a single transaction for multiple service updates with its buffer that triggers upserts only after 500ms or 500 services updated, but the Commit functions only account for 2.43% of my CPU pprof so it might not be that important (maybe it affects other things in the flamegraph though?). So it seems we have some possibilities to improve the statedb usage, I think? I'm not really an expert in this codepath though, and I might not have understood everything. Anyway, this is almost unrelated to this CFP but at least helps with understanding where some of our bottlenecks are for very large services. The k8s reflector code also uses the EndpointSlice name to compare the previous and current backend state, it could probably be done a bit differently but it could mean having more similar code (or common code even?) and having this might thus help in other path than EndpointSliceSync.

JSON decoding is also responsible for about 11.44% directly, which is about 1/5 of statedb processing. So while it's true that it isn't the most impactful thing it isn't small either and it might also be a bit more significant if we optimize statedb usage. I think our current concurrency model is that we have essentially one goroutine per cluster which should make JSON decoding concurrent per cluster and might make it less important to fix and even there is the new encoding/json/v2 which is apparently significantly faster that also make it less important.

About concurrency, IIUC there are locks when writing to statedb, so it might not help that much outside of decoding unfortunately (and would explain why the k8s reflector is single-threaded). Without much coalescing, it seems quite feasible to get into a state where you would lag behind if you have multiple large (like ~1000 backends) services having some churn at the same time.

So overall, I think that we can't really handle large service with a good amount of churn (in my test 10 per seconds) currently, and whether we are filtering out ready=false backends or introducing conditions shouldn't change churn that much I think? This is not something that can't be fixed though, but it seems that we would need some optimizations before doing any of that (at least some coalescing) while having proper conditions should make us fully conformant to drain connections.

On the slice vs non slice front, I think it would mainly be a question of where do we want to coalesce. If we have slices, it means we can't really coalesce meaningfully on the export path. While if we don't have slices, we can easily coalesce on the export path which would also help with the etcd rate limit and network bandwidth directly (if we choose to compress, network bandwidth might not be not that meaningful though). For both we can also coalesce at the import path to have single statedb commit for multiple services I don't know if that matters a lot or if the rest of the statedb usage optimization would be already enough 🤔. Also in case we want to coalesce in both export and import we might have to be careful about not introducing too much latency (like waiting 500ms on export and 500ms on import could be too much). Regarding EndpointSliceSync, queuing every EndpointSlice from a Service on a non slice approach should be just fine so it might not be super meaningful there.

EDIT: actually the export side is fine since the etcd rate lmit and the workqueue on the syncstore should effectively coalesce things!

@MrFreezeex
Copy link
Member Author

Ok! So I made a proper benchmark available here https://github.com/MrFreezeex/cilium/tree/bench-clustermesh/pkg/clustermesh/benchmark. I based it on the loadbalancer benchmark (thanks @giorio94 for the pointer) so that I can more easily compare the loadbalancer k8s reflector and clustermesh and also run it more easily than having actual kubernetes clusters. It confirms some of the point I discovered with the hacky benchmark above!

Following these new insights and our discussion I rewrote almost the entire CFP, it now focuses a lot more on aligning with the loadbalancer k8s reflector as there are some major performance difference in our current statedb usage. Let me know what do you think about the new approach!

@MrFreezeex MrFreezeex force-pushed the clustermesh-svc branch 6 times, most recently from 0b64939 to f388fb4 Compare December 7, 2025 20:21
@MrFreezeex MrFreezeex force-pushed the clustermesh-svc branch 2 times, most recently from 2bbe250 to 6a09eb4 Compare December 7, 2025 21:00
Rewrote the CFP to remove protobuf and focus on JSON format based on
slim endpointslice. It also includes statedb performance consideration
after having done end to end benchmarks.

Signed-off-by: Arthur Outhenin-Chalandre <git@mrfreezeex.fr>
@giorio94
Copy link
Member

Ok! So I made a proper benchmark available here https://github.com/MrFreezeex/cilium/tree/bench-clustermesh/pkg/clustermesh/benchmark. I based it on the loadbalancer benchmark (thanks @giorio94 for the pointer) so that I can more easily compare the loadbalancer k8s reflector and clustermesh and also run it more easily than having actual kubernetes clusters. It confirms some of the point I discovered with the hacky benchmark above!

Thanks! I haven't managed to look into all the details yet, but I think it would be valuable to get some profiling data from the benchmarks, and then involve @joamaki in the discussion if the statedb usage turns out being the biggest bottleneck in this context.

The shorter transition period will help supporting both codepath for a
shorter period. It also won't affect users that rely on default setting
(and stay within the supported ground of keeping one minor version
skew).

Now that we are targeting 1.20 it also leaves us more time to have
potentially less rough edges for the first version where this is
supported (even if disabled by default).

Signed-off-by: Arthur Outhenin-Chalandre <git@mrfreezeex.fr>
@MrFreezeex
Copy link
Member Author

MrFreezeex commented Dec 15, 2025

Here are two pprof collected by the new clustermesh benchmark for 1000 bakends and 5000 backends:

@MrFreezeex
Copy link
Member Author

MrFreezeex commented Dec 23, 2025

I added a new mode to the benchmark to "skip" json decoding (precompute everything before the benchmark since there is a big part of the cpu profile that include GC it's a bit unclear what was the impact of JSON decoding (and validating the ClusterService struct which is skipped alongside).

So the impact of skipping JSON decoding is quite big but still the majority is not from JSON decoding (and validating data also technically since this part is skipped alongside). And also as the current version of the CFP says there is encoding/json/v2 that will reduce this in the future + it's something that we can easily have concurrency on.

Still even with that, if we have the same code as statedb, service churn with a small number of endpoint condition changing (only one in the loadbalancer benchmark test that I present in the CFP ~) will most likely get dominated by JSON decoding which is not necessarily a problem but I am just stating that out loud (as keeping JSON have the benefit of reducing some QPS on etcd + that JSON is easy to debug/similar to the rest of our data mostly).

30 svc 1000 backends with JSON decoding

https://pprof.me/8878f995d8c41fc5e414c6ca3762a114/

Memory statistics from N=10 iterations:
Min: Allocated 236338kB in total,  813933 objects /  45888kB still reachable (per service: 27131 objs, 8067014B alloc, 1566315B in-use)
Avg: Allocated 237801kB in total,  840312 objects /  47651kB still reachable (per service: 28010 objs, 8116954B alloc, 1626520B in-use)
Max: Allocated 247698kB in total, 1015440 objects /  59049kB still reachable (per service: 33848 objs, 8454767B alloc, 2015546B in-use)

Insert statistics from N=10 iterations:
Min: Reconciled 30 objects in 944.912023ms (31.497067ms per service /     32 services per second)
Avg: Reconciled 30 objects in 977.09357ms (32.569785ms per service /     31 services per second)
Max: Reconciled 30 objects in 1.014913917s (33.830463ms per service /     30 services per second)

Churn statistics from N=10 iterations (re-update same ClusterServices):
Min: Reconciled 30 objects in 492.612172ms (16.420405ms per service /     61 services per second)
Avg: Reconciled 30 objects in 515.22946ms (17.174315ms per service /     58 services per second)
Max: Reconciled 30 objects in 553.640831ms (18.454694ms per service /     54 services per second)

Delete statistics from N=10 iterations:
Min: Reconciled 30 objects in 275.309671ms (9.176989ms per service /    109 services per second)
Avg: Reconciled 30 objects in 298.890577ms (9.963019ms per service /    100 services per second)
Max: Reconciled 30 objects in 313.160542ms (10.438684ms per service /     96 services per second)
30 svc 1000 backends without JSON decoding

https://pprof.me/e8bb2a3984295ddcfc4ea3bac1b4f864/

Memory statistics from N=10 iterations:
Min: Allocated 209487kB in total,  650934 objects /  33921kB still reachable (per service: 21697 objs, 7150501B alloc, 1157864B in-use)
Avg: Allocated 210930kB in total,  687804 objects /  36183kB still reachable (per service: 22926 objs, 7199768B alloc, 1235052B in-use)
Max: Allocated 221036kB in total,  885414 objects /  48677kB still reachable (per service: 29513 objs, 7544716B alloc, 1661540B in-use)

Insert statistics from N=10 iterations:
Min: Reconciled 30 objects in 687.289282ms (22.909642ms per service /     44 services per second)
Avg: Reconciled 30 objects in 721.372454ms (24.045748ms per service /     42 services per second)
Max: Reconciled 30 objects in 767.379002ms (25.5793ms per service /     39 services per second)

Churn statistics from N=10 iterations (re-update same ClusterServices):
Min: Reconciled 30 objects in 338.908581ms (11.296952ms per service /     89 services per second)
Avg: Reconciled 30 objects in 361.201401ms (12.040046ms per service /     83 services per second)
Max: Reconciled 30 objects in 407.997581ms (13.599919ms per service /     74 services per second)

Delete statistics from N=10 iterations:
Min: Reconciled 30 objects in 172.558172ms (5.751939ms per service /    174 services per second)
Avg: Reconciled 30 objects in 180.943315ms (6.031443ms per service /    166 services per second)
Max: Reconciled 30 objects in 194.16658ms (6.472219ms per service /    155 services per second)
30 svc 5000 backends with JSON decoding

https://pprof.me/28241dd7587d755f977461f353826682/

Memory statistics from N=10 iterations:
Min: Allocated 1275748kB in total, 4778851 objects / 253007kB still reachable (per service: 159295 objs, 43545557B alloc, 8635989B in-use)
Avg: Allocated 1281585kB in total, 4919531 objects / 262808kB still reachable (per service: 163984 objs, 43744774B alloc, 8970533B in-use)
Max: Allocated 1322661kB in total, 5100084 objects / 292074kB still reachable (per service: 170002 objs, 45146835B alloc, 9969462B in-use)

Insert statistics from N=10 iterations:
Min: Reconciled 30 objects in 5.355248905s (178.508296ms per service /      6 services per second)
Avg: Reconciled 30 objects in 5.491863163s (183.062105ms per service /      5 services per second)
Max: Reconciled 30 objects in 5.758326306s (191.94421ms per service /      5 services per second)

Churn statistics from N=10 iterations (re-update same ClusterServices):
Min: Reconciled 30 objects in 2.823847269s (94.128242ms per service /     11 services per second)
Avg: Reconciled 30 objects in 2.996277236s (99.875907ms per service /     10 services per second)
Max: Reconciled 30 objects in 3.077540301s (102.584676ms per service /     10 services per second)

Delete statistics from N=10 iterations:
Min: Reconciled 30 objects in 1.604537404s (53.48458ms per service /     19 services per second)
Avg: Reconciled 30 objects in 1.632018742s (54.400624ms per service /     18 services per second)
Max: Reconciled 30 objects in 1.672592405s (55.75308ms per service /     18 services per second)
30 svc 5000 backends without JSON decoding

https://pprof.me/fd9556556ce88bd6b497a6f1cf02ab48/

Memory statistics from N=10 iterations:
Min: Allocated 1148560kB in total, 3276330 objects / 170856kB still reachable (per service: 109211 objs, 39204181B alloc, 5831918B in-use)
Avg: Allocated 1154558kB in total, 3431878 objects / 180568kB still reachable (per service: 114395 objs, 39408932B alloc, 6163401B in-use)
Max: Allocated 1196340kB in total, 4452220 objects / 241774kB still reachable (per service: 148407 objs, 40835076B alloc, 8252556B in-use)

Insert statistics from N=10 iterations:
Min: Reconciled 30 objects in 4.040384584s (134.679486ms per service /      7 services per second)
Avg: Reconciled 30 objects in 4.12657837s (137.552612ms per service /      7 services per second)
Max: Reconciled 30 objects in 4.325916197s (144.197206ms per service /      7 services per second)

Churn statistics from N=10 iterations (re-update same ClusterServices):
Min: Reconciled 30 objects in 2.068078899s (68.935963ms per service /     15 services per second)
Avg: Reconciled 30 objects in 2.179198776s (72.639959ms per service /     14 services per second)
Max: Reconciled 30 objects in 2.249116659s (74.970555ms per service /     13 services per second)

Delete statistics from N=10 iterations:
Min: Reconciled 30 objects in 1.00006234s (33.335411ms per service /     30 services per second)
Avg: Reconciled 30 objects in 1.028080579s (34.269352ms per service /     29 services per second)
Max: Reconciled 30 objects in 1.096082688s (36.536089ms per service /     27 services per second)

@joamaki
Copy link

joamaki commented Dec 23, 2025

StateDB write performance can be improved a lot by batching the writes, e.g. doing more writes with a single WriteTxn as that can mutate the internal tree nodes instead of cloning them each time. This would need a similar construction as in pkg/loadbalancer/reflectors/k8s.go to collect bunch of events into a single buffer which is then processed in a single write transaction.

Here's the difference in throughput seen in StateDB benchmarks:

BenchmarkDB_WriteTxn_1-4                    	  636301	      1828 ns/op	    547104 objects/sec	    1072 B/op	      18 allocs/op
BenchmarkDB_WriteTxn_10-4                   	 1597690	       760.1 ns/op	   1315662 objects/sec	     455 B/op	       8 allocs/op
BenchmarkDB_WriteTxn_100-4                  	 2008012	       622.9 ns/op	   1605298 objects/sec	     424 B/op	       7 allocs/op

So even just 10 objects per WriteTxn gets more than 2x throughput.

There's probably optimizations that could be made for SetBackendsOfCluster if we batch things up. E.g. collecting services whose backends updated and refreshing the frontends of said services only once or doing the orphan cleanup only once at the end of the batch. And can we also reduce work by checking if the modification actually changes things? e.g. updateBackends currently unconditionally does an Insert but we could maybe check if we're actually modifying it and if it's the same we'll skip the Insert and we can also skip the refreshing of frontends at the end if we didn't touch backends.

@MrFreezeex
Copy link
Member Author

MrFreezeex commented Dec 23, 2025

StateDB write performance can be improved a lot by batching the writes [...]. So even just 10 objects per WriteTxn gets more than 2x throughput.

Oh nice! I didn't have the data, I was mostly seeing the orphan logic which is different from the k8s reflector which looks the most impactful but that's nice to know how much batching would be valuable here 👀

I think we would love to get a review/a deeper look on the CFP from you @joamaki whenever you have a chance 🙏 . One of the goal of the current version of CFP is also to have some shared logic between the k8s reflector. In the current version I am literally proposing to embed the slim EndpointSlice struct in a future "ClusterServiceV2' struct which should help a lot getting some shared code between the two while preventing future divergence. (There are also other reasons to do that, the most important one being to actually get the conditions fields so that we can correctly exclude ready=false backends and any other reasons you might want to exclude a backend)

@MrFreezeex
Copy link
Member Author

MrFreezeex commented Jan 16, 2026

Hi! FYI the fact that we didn't take into account ready=false, serving=false is a regression from 1.19 :(. I submitted this cilium/cilium#43807 just in time! It does make things slightly better than what I wrote in this CFP but IIUC it shouldn't fundamentally change the CFP content since it seems still helpful to retain the conditions to get all the trickier behavior from conditions working and also let the datapath phase out things correctly while retaining the endpoints with some conditions indicating that it won't serve new traffic.

@joamaki
Copy link

joamaki commented Jan 16, 2026

Hey haven't yet had time to dive into the CFP. I'll try to make time next week. I am thinking though that regardless of the CFP we should really look into doing equality checks in writer/writer.go to avoid insertions (and reconciliation!) if the frontend or backend did not change. This should speed things up quite a bit already. Though do need to benchmark whether the additional cost of comparing is worth the trade-off.

@joamaki
Copy link

joamaki commented Jan 16, 2026

Did a quick experiment to add in the equality checks for service/frontend/backend:
cilium/cilium#43811

Looks like the cost of the equality check isn't that high and the other unrelated changes were only to the orphan detection. Neither seemed to impact performance too badly.

@MrFreezeex would you be able to benchmark these changes? Would be quite interested in seeing what impact this would have.

@MrFreezeex
Copy link
Member Author

MrFreezeex commented Jan 16, 2026

@MrFreezeex would you be able to benchmark these changes? Would be quite interested in seeing what impact this would have.

It looks great!

Before your patch I got this from my benchmark with 1000 backends per service:

❯ go run ./cmd -services 30 -backends 1000 -skip-json-decoding
[...]
Memory statistics from N=10 iterations:
Min: Allocated 212339kB in total,  658009 objects /  34395kB still reachable (per service: 21933 objs, 7247868B alloc, 1174026B in-use)
Avg: Allocated 213773kB in total,  692506 objects /  36358kB still reachable (per service: 23083 objs, 7296817B alloc, 1241031B in-use)
Max: Allocated 224037kB in total,  885342 objects /  48676kB still reachable (per service: 29511 objs, 7647138B alloc, 1661485B in-use)

Insert statistics from N=10 iterations:
Min: Reconciled 30 objects in 693.383715ms (23.11279ms per service /     43 services per second)
Avg: Reconciled 30 objects in 736.174908ms (24.539163ms per service /     41 services per second)
Max: Reconciled 30 objects in 843.320071ms (28.110669ms per service /     36 services per second)

Churn statistics from N=10 iterations (re-update same ClusterServices):
Min: Reconciled 30 objects in 329.15643ms (10.971881ms per service /     91 services per second)
Avg: Reconciled 30 objects in 364.023099ms (12.134103ms per service /     82 services per second)
Max: Reconciled 30 objects in 449.269775ms (14.975659ms per service /     67 services per second)

Delete statistics from N=10 iterations:
Min: Reconciled 30 objects in 174.869765ms (5.828992ms per service /    172 services per second)
Avg: Reconciled 30 objects in 186.177964ms (6.205932ms per service /    161 services per second)
Max: Reconciled 30 objects in 207.065113ms (6.90217ms per service /    145 services per second)

After your patch:

❯ go run ./cmd -services 30 -backends 1000 -skip-json-decoding
[...]
Memory statistics from N=10 iterations:
Min: Allocated 129346kB in total,  605918 objects /  31326kB still reachable (per service: 20197 objs, 4415036B alloc, 1069265B in-use)
Avg: Allocated 130740kB in total,  672742 objects /  35122kB still reachable (per service: 22424 objs, 4462606B alloc, 1198847B in-use)
Max: Allocated 140848kB in total,  874991 objects /  47917kB still reachable (per service: 29166 objs, 4807644B alloc, 1635574B in-use)

Insert statistics from N=10 iterations:
Min: Reconciled 30 objects in 437.382759ms (14.579425ms per service /     69 services per second)
Avg: Reconciled 30 objects in 463.831118ms (15.461037ms per service /     65 services per second)
Max: Reconciled 30 objects in 547.857262ms (18.261908ms per service /     55 services per second)

Churn statistics from N=10 iterations (re-update same ClusterServices):
Min: Reconciled 30 objects in 65.554239ms (2.185141ms per service /    458 services per second)
Avg: Reconciled 30 objects in 76.882548ms (2.562751ms per service /    390 services per second)
Max: Reconciled 30 objects in 103.067485ms (3.435582ms per service /    291 services per second)

Delete statistics from N=10 iterations:
Min: Reconciled 30 objects in 154.563222ms (5.152107ms per service /    194 services per second)
Avg: Reconciled 30 objects in 173.200136ms (5.773337ms per service /    173 services per second)
Max: Reconciled 30 objects in 189.800691ms (6.326689ms per service /    158 services per second)

Here is the pprof in case you want it: https://pprof.me/c737434ef8025fd1d342c79623f64848/.

My "churn" test doesn't actually update any backend so it's probably not that realistic but it's still a 6x time speed up there which looks really good!

ClusterService should have excluded ready=false / serving=false backends
but there was a regression which is now fixed. Also fixing this CFP text
accordingly.

Signed-off-by: Arthur Outhenin-Chalandre <git@mrfreezeex.fr>
@joestringer joestringer requested a review from joamaki February 18, 2026 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants