Conversation
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@ledger.fr>
|
cc @cilium/sig-clustermesh |
giorio94
left a comment
There was a problem hiding this comment.
Thanks @MrFreezeex, the CFP looks very well written to me, and the proposal overall looks reasonable. I've started taking a first look, and left a few initial comments inline.
The main factor which is holding me back about significant changes like this one is the risk of unexpected problems and regressions that may be introduced by the refactoring, and whether the benefits actually justify the associated risks. In this case, it would be probably good to get some feedback from people running Cluster Mesh at significant scale to understand the number of backends per service we could be talking about, to better gauge the overall benefits and how close we are to the current limits. Tagging @antonipp @oblazek @dctrwatson based on previous interactions, in case you have any feedback and/or thoughts in this respect.
I personally also feel the troubleshooting part to be important to clarify before committing to a change like this, to ensure that we can continue to both perform script testing around this area, as well as easily inspect the state to figure out inconsistencies if things don't work as expected. It doesn't look impossible to solve though.
| and there are still no backend conditions (or state) available, which prevents | ||
| the Cilium load balancer from excluding/phasing out some backends similarly to regular backends not | ||
| coming from a remote cluster. |
There was a problem hiding this comment.
I agree that this is suboptimal from a functional point of view. However, it has the advantage of reducing the amount of events that need to be shared across the clustermesh, which would be instead potentially significantly higher when e.g., including the terminating transition. Not saying that we shouldn't go into that direction, but just that it may need some additional scalability considerations.
|
|
||
| While we would like to add those fields to the `ClusterService` struct, we have a hard limit | ||
| of 1.5 MiB in etcd, and without a breaking change, adding these new fields to the current | ||
| `ClusterService` struct might limit the number of backends to fewer than ~10,000 backends |
There was a problem hiding this comment.
Do we have data whether this could be a problem in real scenarios, where one would want to have a global service with that many backends? My understanding is that very large services in plain K8s come mostly from DaemonSets, as on a 5K nodes cluster that would correspond to 5K backends. But that doesn't look a common use-case for global services, at least to me.
There was a problem hiding this comment.
Indeed I was focusing mostly on the max number of endpoints but that's true that the maximum of nodes could be relevant too and could influence the number of backends per service ~.
Maybe it could comes from cases where you have many small pods (possibly influenced on languages/architectures where pods can not use a full node).
Also technically if your cluster is dual stack the service could reach this with a daemonset. And while I haven't said it in this sentence (as this is more an estimation) of the doc all of this is heavily influence by the number of ports each service have.
But yes I don't have data besides saying the limit out loud.
There was a problem hiding this comment.
Yeah, I think this is one of the aspects where getting feedback from community members that run Cluster Mesh at large scales would be super useful, to be able to gauge how pressing the problem is.
| In order to introduce a "v2" of the ClusterMesh Service data format, this CFP is mainly | ||
| proposing to have a global switch rather than a per cluster detection. This is | ||
| mainly to keep the transition as simple as possible because a per cluster detection | ||
| could introduce more complexity as we would need to handle downgrade/upgrade of remote | ||
| clusters at runtime. |
There was a problem hiding this comment.
I'd be personally in favor of a per-cluster detection, which should be easy enough leveraging capabilities. Basically the source cluster would advertise that it supports v2 services, and in that case the client would switch watching that prefix (obviously assuming it is capable of doing so). That would ensure that the new prefix is only ever watched only if actually exported, and also make the migration require less way minor versions. We could still have a (hidden) flag as an emergency button in case of unexpected issues though.
There was a problem hiding this comment.
This seems rather hard to implement to me. I think if we were to do that we would most likely advertise that in the cluster config. And because of the kvstore we might want to do this dynamically meaning watching the cluster config instead of just getting it at the start of the the agent/operator. We could perhaps not do it dynamically but it means that it would be dependent about the last re-connection to etcd and/or last time the agent/operator (re)started and could be different depending on the client. If doing it dynamically there might be some complication during the transition when the cluster have not fully finished to export all its service in the v2 format though.
One other aspect is that for EndpointSliceSync I would want to do a new algorithm leveraging especially the new EndpointSlice name and while it's technically not required it could simplify things a bit to have a global switch for all clusters rather than using the old or new logic depending what format of data a cluster export (which might even dynamically change).
So overall I am not certain that it's worth it compared to have a global switch which would do "expect that all clusters you are connected to export v2 and use v2" (+ possibly adding some guardrails around that by doing some version check or possibly capability check when the cluster connect). Maybe you have some other considerations that would simplify this though?
There was a problem hiding this comment.
And because of the kvstore we might want to do this dynamically meaning watching the cluster config instead of just getting it at the start of the the agent/operator. We could perhaps not do it dynamically but it means that it would be dependent about the last re-connection to etcd and/or last time the agent/operator (re)started and could be different depending on the client. If doing it dynamically there might be some complication during the transition when the cluster have not fully finished to export all its service in the v2 format though.
I'd definitely go for a static approach in case. Basically just checking upon connection and watching either one path or the other based on that. I agree that the trickiest part would be in case of kvstore mode, as there we don't use the sync canary, and the cluster config is written before that synchronization completed the first time. But we could definitely work-around that, either having the clients always wait for the sync canary for v2 services (as it is written in all cases), or implementing some ad-hoc mechanism.
One other aspect is that for EndpointSliceSync I would want to do a new algorithm leveraging
Ah, yeah, in that case having a global switch would indeed be easier to completely change the logic there. The other possibility could be introducing the new algorithm only after fully migrating to v2, which would be likely similar to the flags approach in terms of timeline (i.e., 1.20 assuming that the initial bits get in for v1.19).
So overall I am not certain that it's worth it compared to have a global switch which would do "expect that all clusters you are connected to export v2 and use v2"
I'd need to think a bit more about this, but if we go for a global switch I wonder whether we shouldn't even have a flag at that point (or have it hidden just as a circuit breaker). Basically in the first version we'd write to both paths, in the next ones switch the clients to read from v2, and the next next one completely drop the support for v1. The main problem with an explicit flag is that people may then toggle it for whatever reason, and we risk having to commit supporting both versions for more time than expected.
| } | ||
|
|
||
| // ClusterService represents a service definition within a cluster | ||
| message ClusterService { |
There was a problem hiding this comment.
I'd suggest considering adding a free-form field (e.g., string to string map) that could be used by implementations to extend the behavior and perform experiments fine-tuning the behavior, before committing to an actual API change. Basically an equivalent to how annotations are often used in Kubernetes. Another alternative could be leveraging google.protobuf.Any as e.g., done here, but I'm unsure whether the extra complexity is actually worth it, and how that would play with possible troubleshooting tooling.
There was a problem hiding this comment.
I think we should just be able to leverage the protobuf tag for experimentation, we would just need to then keep a list of reserved tag id when removing them and either not ever reusing those or keep them unused long enough so that it's doesn't becomes a problem IMO.
There was a problem hiding this comment.
Yeah, there are definitely multiple possibilities, and the IDs space of protobuf is not a problem from that point of view. Mostly wondering whether to proactively foresee for a global map field right now, or defer that to later 🤔
There was a problem hiding this comment.
Hmm do you have some use case that you would like to adress using that? Since all protobuf fields are optional here it seems to me that should be quite similar 🤔. If you have a use case in mind I would be happy to add that ofc
Second iteration of the ClusterMesh Service CFP. This commit is meant to be squashed before the PR is merged. This new iteration change the following things: - adds zstd compression for the existing JSON format - Rename various fields to EndpointSlice instead of Endpoints - Change the address to string. While this is less efficient on uncompressed data it appears to be more efficient while compressing with zstd. It would also be more readable when encoding to JSON - Add some quick note on the plan to address debugging point by using protonjson to allow dumping the data to JSON (and yaml) and testing in hive script with JSON Signed-off-by: Arthur Outhenin-Chalandre <git@mrfreezeex.fr>
|
Hi everyone 👋 sorry for the wait, I finally found time to investigate a bit more here and pushed the second iteration of this CFP. You can find more details on the commit description but apart from addressing some points raised by @giorio94 (thanks again!). I realized that it is more efficient to encode the address with a string type if compressing via ztsd while the uncompressed data is heavier. And the second important point is about the debug-ability by allowing to dump the data in json/yaml and have some hive script command to test with the JSON format. Those would most likely makes some generic kvstore code to have some special condition for those which doesn't look that great but at least the UX would be good. If you have some suggestion about that (and other things ofc) feel free! |
|
side note: I think there are multiple valid concerns that this CFP describes, but it's a bit unclear to me what is the user story that we are trying to solve. Within motivation sections we have:
I thought about it a little bit more and what I am concerned mostly about added complexity. However, when I think about it from high-level design perspective I would start with following questions:
For backend conditions part itself, why can't we have per service condition and filter that before writing them to Etcd?
It seems to me that if one cluster exposes both ready and terminating backends, remote clusters would actually only use ready backends, or did I miss something? That would mean we can just expose terminating or ready backends only instead 🤔 ? Some alternative proposals:
I do not think considering 10k or 50k backends of service makes sense without addressing eBPF map sizing as well 🤔 |
|
Hi @marseel 👋 thanks for looking into this!
The main reason is that it's the only resources with unbounded size of entries in it/unbounded size. It's also a specific resources to clustermesh, IIRC the only other resource only used for clustermesh (and not kvstore in general) is the serviceexport one specific to the MCS-API implementation so the fact that it's not used by kvtore should simplifies a bit the transition. That being said I agree that having one encoding mechanism would be best in general if we would be able to.
Right my main point in the CFP regarding rate of events is by saying that decompressing/preparing for statedb insertion is two times faster with protobuf, but that might indeed be not sufficient. I was thinking about having a "buffer" like the Cilium loadbalancer code is doing when reading EndpointSlice/Service change but on the export side instead. IIRC the existing buffer before importing is 500 events or 500ms before taking those changes into account and essentially coalescing multiple updates. If we were to do something equivalent when exporting the "ClusterService v2" it should make this more approachable I think ~. I thought about this recently and was thinking that it could be something done separately from this CFP/later on, but I could add it here too! EDIT: also note that I haven't benchmarked it but I think there are ways to optimize further the insertion of cluster services into statedb which might push further than two time faster the decompression/insertion into statedb vs currently.
It kind of solves it because the compression makes the object really smaller and you won't conceivably not attain this limit within what Kubernetes/Cilium support, I wasn't aware of the 65k backends limit and was more counting about Kubernetes supporting a max of 150k backend overall. But indeed accounting for the 65k backends overall, a theoretical limit around 10k could be just ok. My angle is more how to add the new fields while not regressing in a way we don't want to than adopting precisely protobuf so I would be more than happy with alternative solutions as well (or if the limit with the added field is deemed ok we could even just add those fields now and optimize it later?)!
The high-level routing seems correct, not entirely sure about if the last point is correct "if no terminating and ready backends, route to non ready backends" I would have maybe think that it would never route new connection to non ready backend but I don't know whether it does that or not. IIUC there is also something about not removing existing entries in terminating/non ready for existing connection which might prevent us from removing non ready backends. It would also removes those from EndpointSliceSync but this part might just be ok!
Indeed! Although it might be simpler to make the ClusterMesh almost unbounded and that the eBPF map would be the only limit though. There might be scenario where going higher than 10k on one service on a cluster in your mesh while the rest don't go over 65k overall. I agree that this looks unlikely though 😅. This seems that eBPF map can be easily tuned by the user so we could just add something in the documentation that in case you have more than 65k backends in all your mesh you should consider bumping this limit 🤔 (and it would most likely be even worthwhile to do this with the current documentation/not part of this CFP!).
1K might be a bit too small but we could indeed have a hard and properly defined limit per service per cluster! And I am open to the slice approach too! |
|
Thanks @MrFreezeex for iterating on the CFP, and sorry for the log delay.
I personally second this. Overall your proposal makes sense to me, but I'm struggling a bit at finding a clear motivation that justifies the increased complexity and risk of regression intrinsic in a significant change like this.
Agreed. I'm personally not a big fan of protobuf in this context as it introduces the requirement of knowing the schema to be able to decode the content from etcd, which is not the case currently.
This is simpler if we only want to switch to compressing the json objects, as we could easily support both options, with the plain uncompressed version being still used in kvstore mode, and the compressed one by clustermesh-apiserver and kvstoremesh, making use of capabilities to decide the format. The logic to handle compression/decompression should be fairly trivial anyways. I wouldn't change all resources to protobuf instead, as keeping both variants at the same time is likely to add way more complexity (and the risk of divergences would grow over time).
I wonder if a practical way to approach this could be to extend the
This is definitely a possibility that makes sense to explore and validate with the scale tests. We already have some intrinsic coalescencing given the back pressure introduced by etcd client rate limiting (both in the clustermesh-apiserver and at the kvstoremesh level), but an explicit rate limiting for that may help even further. |
I think (but I'm not totally sure) that in general the reason for keeping the terminating backends in endpointslices is to preserve already existing connections to them until fully drained, while having new connections be load balanced to ready backends only. From this point of view the current behavior of global services is suboptimal, because we don't respect them. OTOH, it is a bit unclear to me whether adding that support would justify the extra churn, considering that the propagation of information may be (possibly significantly) delayed anyways especially in high scale environments. Again, user feedback and/or scale testing may help to provide guidance here. |
I think one of the main motivation is that there is some fields that would be quite nice to add and that we are in a situation where we can't really comfortably add those fields (and any fields that are per backend in general) without having some important regressions related to the rate of updates by adding conditions, increasing the object size and hitting the 1.5MiB per object limit and overall the bandwith needed which would be severely impacting by both rate of updates and object size. Even without accounting those new fields to be added, improving some of those parameters would most also be a motivation for this! Also related to object format, our current ClusterService struct seems fitted to how Cilium works internally when ClusterMesh was first introduced (I think?) and we might be better of fitting to something that looks more to Kubernetes resources "struct" so that we should be safe from any future refactoring from the load balancer structs as those are only used internally. This would ensure we can adopt any important new fields in the EndpointSlice struct while just having a similar conversion to the Cilium load balancer internal struct as what is Cilium is doing when reading from the kube-apiserver.
Hmm maybe yes, we could even have some logic to detect if something is compressed or not and somehow transparently decompress it or use it as is (this only a food for thought, might not be ideal in our case). I am not sure if it's worth compressing any other objects as I think they are all rather small uncompressed anyway? Maybe for CES if it makes its way to the kvstore with a format containing multiple endpoints, I don't think there is an initiative started for this (and no idea if it's something relevant too). In case we actually change the service format / do a service v2 it would be relatively easy to have the v2 data compressed (independently about using protobuf or json). In case we want to do the same for the rest of the objects we would most likely need to check whether it's worth or not with smaller objects 🤔. It could even be a separate initiative ~.
Yep indeed that sounds like it would be worth to test something like this! I am wondering if it won't be simpler to test with two actual clusters and just creating a service with like 5k endpoints or 10k and change some small things at a high rate to simulate pods failing readiness check and checking how the agent behave and most likely some profiling info there. Integrating something like that sounds better long term but it might be simpler for me who doesn't know too much about the existing scale test while quickly discovering if there is a bottleneck here. Would be nice to have a test like this integrated in the scale test long term ofc though!
Yep I understood that too! I don't know what ready=false means vs terminated=false means though but this is most likely what @marseel described in its message (or at least very close to that 😅!).
If I didn't missed any code that would ignore those, the current behavior we just do not filter out any ready=false or terminated=true and treat it like a fully active backends. So this means that even for very low scale deployment, you might have some deployment in a cluster where pods never became ready and we would treat it as ready pods in other clusters. This seems a pretty big deal and break many assumptions around Services handling :/. Ideally if we were to run some kubernetes net conformance test over 2 clusters somehow (which sounds hard and most likely not be happening) we should aim to theoritically pass it and IMO if our architecture can't scale with those assumption we should try to reasonably change something to fix that! I am not convinced that most users are aware of the current limitation at all. We most likely have somes users that are not aware of this and would consider this limitation a "deal breaker" for Cluster Mesh! So action items for me:
It seems both of you are not convinced by protobuf so it would most likely come down to some subset of the following things if we keep all backends in one object:
And there is also the slice approach that would solve some of the points above but introduce other things (like buffering when reading the slice mainly). Let me know your thought on that, there's most likely some of the points above that we can discuss before that I get the data of service churn and coalescing! |
Sure, I agree with all of these reasons in theory. I think we are just lacking enough data at the moment to confirm whether (and which) would be a blocker, and the amount of benefits we could achieve changing the format. Basically justifying the extra complexity and risk of regressions with clear advantages. I'd personally suggest to focus mostly relatively low (e.g., <25 backends), medium (e.g., 25-250 backends) and high (e.g., 250-2500 backends) scale, and not that much on super-high scale (e.g., > 5k backends) in a first phase.
I agree there would not be a clear advantage in compressing the other objects, given that they are fairly small (and their size is bounded). If we were to do that, it would be mostly for consistency (depending on the implementation it may be convenient to do the same for all resources), and not really for the size benefits (assuming that the overhead would be negligible, otherwise that would be probably off the table).
Yep, not that I'm aware of.
I don't have a strong preference, it could be an extension of the
Ah, right, I didn't consider that one. Yep, I agree this is definitely a bigger limitation compared to the handling of the terminating condition, and we should improve on that. We may be able get pretty far with filtering at the source as Marcel mentioned though (one case to pay attention to is if the service has |
|
I ran a small, hacky benchmark which is not that easily repeatable (like less easy than running I also had a small script that wrote a specific label with a counter every 100ms to force the ClusterService to be regenerated and for the cilium-agent to watch it, trigger the full JSON decoding/statedb insert, and simulate some churn that could be caused by readiness. I also added some debug prints with the service name and my service label counter in I noticed that the agents were processing each update sequentially and that processing was slower than getting the next updates, which means the agent was gradually lagging behind. It looks like we would need to add a queue or buffer before somewhere in case we want to coalesce updates (and not only rely on etcd rate limit). I collected some pprof from one agent pulling the clustermesh services:
The most significant processing is in statedb. I compared the codepath from the k8s reflector https://github.com/cilium/cilium/blob/9beb3286c4d2a64fe6fbd0c197d2483b9ff76307/pkg/loadbalancer/reflectors/k8s.go#L340 vs clustermesh https://github.com/cilium/cilium/blob/9beb3286c4d2a64fe6fbd0c197d2483b9ff76307/pkg/clustermesh/service_merger.go#L84. It seems the main differences are that in the k8s reflector there's something that keeps previous backends outside statedb and cleans them before insertion, while in clustermesh we orphan backends afterwards by "querying" statedb. The k8s reflector also uses sequences while in clustermesh that's not the case (and it seems that the previous point helps with that). Those two things might help with memory usage (and also CPU-wise, since ~25% of the agent time in my "benchmark" is from GC). The k8s reflector also writes a single transaction for multiple service updates with its buffer that triggers upserts only after 500ms or 500 services updated, but the Commit functions only account for 2.43% of my CPU pprof so it might not be that important (maybe it affects other things in the flamegraph though?). So it seems we have some possibilities to improve the statedb usage, I think? I'm not really an expert in this codepath though, and I might not have understood everything. Anyway, this is almost unrelated to this CFP but at least helps with understanding where some of our bottlenecks are for very large services. The k8s reflector code also uses the EndpointSlice name to compare the previous and current backend state, it could probably be done a bit differently but it could mean having more similar code (or common code even?) and having this might thus help in other path than EndpointSliceSync. JSON decoding is also responsible for about 11.44% directly, which is about 1/5 of statedb processing. So while it's true that it isn't the most impactful thing it isn't small either and it might also be a bit more significant if we optimize statedb usage. I think our current concurrency model is that we have essentially one goroutine per cluster which should make JSON decoding concurrent per cluster and might make it less important to fix and even there is the new About concurrency, IIUC there are locks when writing to statedb, so it might not help that much outside of decoding unfortunately (and would explain why the k8s reflector is single-threaded). Without much coalescing, it seems quite feasible to get into a state where you would lag behind if you have multiple large (like ~1000 backends) services having some churn at the same time. So overall, I think that we can't really handle large service with a good amount of churn (in my test 10 per seconds) currently, and whether we are filtering out ready=false backends or introducing conditions shouldn't change churn that much I think? This is not something that can't be fixed though, but it seems that we would need some optimizations before doing any of that (at least some coalescing) while having proper conditions should make us fully conformant to drain connections. On the slice vs non slice front, I think it would mainly be a question of where do we want to coalesce. If we have slices, it means we can't really coalesce meaningfully on the export path. While if we don't have slices, we can easily coalesce on the export path which would also help with the etcd rate limit and network bandwidth directly (if we choose to compress, network bandwidth might not be not that meaningful though). For both we can also coalesce at the import path to have single statedb commit for multiple services I don't know if that matters a lot or if the rest of the statedb usage optimization would be already enough 🤔. Also in case we want to coalesce in both export and import we might have to be careful about not introducing too much latency (like waiting 500ms on export and 500ms on import could be too much). Regarding EndpointSliceSync, queuing every EndpointSlice from a Service on a non slice approach should be just fine so it might not be super meaningful there. EDIT: actually the export side is fine since the etcd rate lmit and the workqueue on the syncstore should effectively coalesce things! |
|
Ok! So I made a proper benchmark available here https://github.com/MrFreezeex/cilium/tree/bench-clustermesh/pkg/clustermesh/benchmark. I based it on the loadbalancer benchmark (thanks @giorio94 for the pointer) so that I can more easily compare the loadbalancer k8s reflector and clustermesh and also run it more easily than having actual kubernetes clusters. It confirms some of the point I discovered with the hacky benchmark above! Following these new insights and our discussion I rewrote almost the entire CFP, it now focuses a lot more on aligning with the loadbalancer k8s reflector as there are some major performance difference in our current statedb usage. Let me know what do you think about the new approach! |
0b64939 to
f388fb4
Compare
2bbe250 to
6a09eb4
Compare
Rewrote the CFP to remove protobuf and focus on JSON format based on slim endpointslice. It also includes statedb performance consideration after having done end to end benchmarks. Signed-off-by: Arthur Outhenin-Chalandre <git@mrfreezeex.fr>
6a09eb4 to
941acfd
Compare
Thanks! I haven't managed to look into all the details yet, but I think it would be valuable to get some profiling data from the benchmarks, and then involve @joamaki in the discussion if the statedb usage turns out being the biggest bottleneck in this context. |
The shorter transition period will help supporting both codepath for a shorter period. It also won't affect users that rely on default setting (and stay within the supported ground of keeping one minor version skew). Now that we are targeting 1.20 it also leaves us more time to have potentially less rough edges for the first version where this is supported (even if disabled by default). Signed-off-by: Arthur Outhenin-Chalandre <git@mrfreezeex.fr>
|
Here are two pprof collected by the new clustermesh benchmark for 1000 bakends and 5000 backends: |
|
I added a new mode to the benchmark to "skip" json decoding (precompute everything before the benchmark since there is a big part of the cpu profile that include GC it's a bit unclear what was the impact of JSON decoding (and validating the ClusterService struct which is skipped alongside). So the impact of skipping JSON decoding is quite big but still the majority is not from JSON decoding (and validating data also technically since this part is skipped alongside). And also as the current version of the CFP says there is Still even with that, if we have the same code as statedb, service churn with a small number of endpoint condition changing (only one in the loadbalancer benchmark test that I present in the CFP ~) will most likely get dominated by JSON decoding which is not necessarily a problem but I am just stating that out loud (as keeping JSON have the benefit of reducing some QPS on etcd + that JSON is easy to debug/similar to the rest of our data mostly). 30 svc 1000 backends with JSON decodinghttps://pprof.me/8878f995d8c41fc5e414c6ca3762a114/ 30 svc 1000 backends without JSON decodinghttps://pprof.me/e8bb2a3984295ddcfc4ea3bac1b4f864/ 30 svc 5000 backends with JSON decodinghttps://pprof.me/28241dd7587d755f977461f353826682/ 30 svc 5000 backends without JSON decodinghttps://pprof.me/fd9556556ce88bd6b497a6f1cf02ab48/ |
|
StateDB write performance can be improved a lot by batching the writes, e.g. doing more writes with a single Here's the difference in throughput seen in StateDB benchmarks: So even just 10 objects per There's probably optimizations that could be made for |
Oh nice! I didn't have the data, I was mostly seeing the orphan logic which is different from the k8s reflector which looks the most impactful but that's nice to know how much batching would be valuable here 👀 I think we would love to get a review/a deeper look on the CFP from you @joamaki whenever you have a chance 🙏 . One of the goal of the current version of CFP is also to have some shared logic between the k8s reflector. In the current version I am literally proposing to embed the slim EndpointSlice struct in a future "ClusterServiceV2' struct which should help a lot getting some shared code between the two while preventing future divergence. (There are also other reasons to do that, the most important one being to actually get the conditions fields so that we can correctly exclude ready=false backends and any other reasons you might want to exclude a backend) |
|
Hi! FYI the fact that we didn't take into account ready=false, serving=false is a regression from 1.19 :(. I submitted this cilium/cilium#43807 just in time! It does make things slightly better than what I wrote in this CFP but IIUC it shouldn't fundamentally change the CFP content since it seems still helpful to retain the conditions to get all the trickier behavior from conditions working and also let the datapath phase out things correctly while retaining the endpoints with some conditions indicating that it won't serve new traffic. |
|
Hey haven't yet had time to dive into the CFP. I'll try to make time next week. I am thinking though that regardless of the CFP we should really look into doing equality checks in |
|
Did a quick experiment to add in the equality checks for service/frontend/backend: Looks like the cost of the equality check isn't that high and the other unrelated changes were only to the orphan detection. Neither seemed to impact performance too badly. @MrFreezeex would you be able to benchmark these changes? Would be quite interested in seeing what impact this would have. |
It looks great! Before your patch I got this from my benchmark with 1000 backends per service: After your patch: Here is the pprof in case you want it: https://pprof.me/c737434ef8025fd1d342c79623f64848/. My "churn" test doesn't actually update any backend so it's probably not that realistic but it's still a 6x time speed up there which looks really good! |
ClusterService should have excluded ready=false / serving=false backends but there was a regression which is now fixed. Also fixing this CFP text accordingly. Signed-off-by: Arthur Outhenin-Chalandre <git@mrfreezeex.fr>
This CFP proposes introducing v2 of the ClusterMesh global service data format stored in etcd and a transition from
cilium/state/services/v1/tocilium/state/services/v2/.The benchmarks in this CFP is based on this (wip/poc) code: https://github.com/MrFreezeex/cilium/tree/bench-clusterservice/bench