Conversation
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
handling Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
fvaleri
left a comment
There was a problem hiding this comment.
@ppatierno thanks for the proposal and examples.
I left few comments, but the approach LGTM.
Signed-off-by: Paolo Patierno <ppatierno@live.com>
formatting purposes Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
fvaleri
left a comment
There was a problem hiding this comment.
LGTM. Thanks for addressing my comments.
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
| * ... (other reconcile operations) ... | ||
| * **KRaft quorum reconciliation**: analyze current quorum state, unregister and register controllers as needed (typically unregisters controllers being scaled down). | ||
| * scale down controllers. | ||
| * ... (other reconcile operations) ... |
There was a problem hiding this comment.
should we explain why there are other reconciliations in between so we understand the reason for the order of the reconciliation steps?
There was a problem hiding this comment.
which "other reconciliations" are you referring to? This line "(other reconcile operations)" is referring to the usual operations we already have in the current reconciliation process.
There was a problem hiding this comment.
I was mostly trying to understand whether there are other existing reconciliation steps that need to come before or after the quorum reconciliation step aside from the scale down and scale up. I wonder also if there is a way to simplify this flow further, without having to do the full quorum reconciliation twice. For example, do it at the beginning once or do at the end once or if that would actually make things more complicated. If that makes it more complicated, I'm happy to go with the current proposal. I understand we do it twice so that we can do registration/unregistration and scale up/down in one reconciliation, but the reconciliation is idempotent and the steps may not complete in one reconciliation anyway if there is any failure/error so the simplification might be worth doing. I will leave it up to you, as I'm not against how it is currently proposed.
There was a problem hiding this comment.
@tinaselenge thanks for the feedback, following the rationale and story behind this approach ...
At the beginning of prototyping this, I had two different methods: unregisterControllers to come before scale down and registerControllers to come after scale up. The I decided to consolidate into a single reconcile KRaft quorum because they looked quite similar in terms of logic but ... the thing is that it still remains the fact that unregistration has to be done before scaling down (otherwise you can brake the quorum if scaling down before unregistration), and registration has to be done after scale up (because the controllers don't exist yet if you think to register them beforehand).
Now, if we want to have just one place to reconcile I see:
- at the beginning: it works out of box for scaling down (unregistration first) but not for scale up. It means that when you are scaling up controllers in one reconciliation, the registration will happen later on in a subsequent reconciliation. It would work, as you mentioned it's idempotent. But it will happen with some delay (depending on the reconciliation period, 2 mins by default). Unless some state changed in the Kafka CR and a new reconciliation will be triggered immediately.
- at the end: it works out of box for scaling up (registration of newly added controllers) but it doesn't for scale down. As mentioned, you cannot scale down without unregistering controllers first. To have quorum reconciliation at the end only, we should add a logic to block the scale down if there are still registered controllers (something similat to what we have to block scale down if brokers are hosting partitions). I can have a think about it.
There was a problem hiding this comment.
@tinaselenge I thoughts about the idea of having the KRaft quorum reconciliation just at the end, but it can't work easily in case of disk metadata change. Let's consider this scenario:
- The
KafkaRollerdetects there is a disk change (It has to unregister old controller and register the new one) - It runs the unregistration first but ... it FAILS
- The overall operator reconciliation fails
Now, with the current implementation:
- On next reconciliation, at the beginning, the call to
KRaftQuorumReconcilerretries to unregister the failed one, then registering the new one - WARN: this is needed because otherwise the KafkaRoller can't proceed with the next controller to be rolled
But with KRaftQuorumReconciler at the end only:
- On next reconciliation, at the beginning, there is NO call to
KRaftQuorumReconcilerand the flow goes directly toKafkaRollerwithout unregistering the failed one. - The
KafkaRollerwill just start to go through the other controllers to be rolled but without unregistering the already rolled one first.
NOTE: the same scenario could happen with the unregistration working but the registration failing on first reconciliation.
When there is a metadata disk change we have following constrained flow:
For each controller in the list:
- roll the controller
- controller starts and new metadata disk is formatted (it's needed for reading the meta.properties during KRaft Quorum reconciliation)
- KRaft Quorum reconciliation on single node (unregister old incarnation, register new incarnation)
We CAN'T proceed with the new controller to be rolled until the previous one is gone through the above flow.
Signed-off-by: Paolo Patierno <ppatierno@live.com>
PaulRMellor
left a comment
There was a problem hiding this comment.
It’s clear a lot of careful research has gone into this proposal. I’ve left suggestions to help with clarity and reduce any potential ambiguities, with some queries, but the overall direction looks good to me.
Signed-off-by: Paolo Patierno <ppatierno@live.com>
im-konge
left a comment
There was a problem hiding this comment.
Very thorough proposal, thank you very much, LGTM.
tinaselenge
left a comment
There was a problem hiding this comment.
Thanks for the proposal Paolo. Well detailed and thought through. I have just one comment regarding the reconciliation flow to see what you think. Otherwise, I'm happy with this proposal.
see-quick
left a comment
There was a problem hiding this comment.
Overall looks good to me, I have verified your PoC (while I had some issues there I eventually resolve them all by myself I think nothing related to your code).
I tried also multiple cases all with separated roles (I didn't try combined mode):
- scale-up
- scale-down
- scale-down to 1 controller
- scale up from 3 to 5 controllers and immediately go to 3 again (not waiting for readiness of having 5 controllers)
- disk failure (i.e., I tried also to delete PVC if operator would be able to handle it ...)
Also I tried some niche cases ... but:
- during scale-up operator crash (so basically when I saw in CO operator
Registering controllerlog I deleted the CO pod so I can see if CO could handle such faillover) - similarly with scale-down
All went well without any issues. Eventually, It came to my mind to check that there is no data loss during exchange of messages during scaling-up and scaling-down controllers (using perf Kafka client but I think that we can test when you would create a PR (or I find some more time on this ...).
Anyway thanks for working on this and +1 for me @ppatierno 👍.
scholzj
left a comment
There was a problem hiding this comment.
I left some comments. Also, some points not covered:
- It sounds like the dynamic qourum does not allow backup and recovery through volume snapshots unless you manage to recover the original names?
- How does one change the address of the controller node? (e-g. to be not
node:9091anymore butnode:9089) - How does one change the security configuration of the controller nodes?
Overall, I have to think whether I'm -1 or -0.9999 on this. That is not fault of this proposal. But I think this is not well done in Kafka and has terrible UX that is not ready for any automation. Funny enough, you start the proposal by talking about parity with ZooKeeper. I wish Kafka Lerner from that. And while implementing this proposal might work for most situations, it creates a terrible UX and it is not clear if there is any way to improvement. So I'm not convinced that it is actually better than the current limitations of the static qourum.
| A possible way for scaling controllers is: | ||
|
|
||
| * pause the cluster reconciliation. | ||
| * delete the controllers' `StrimziPodSet`(s) so that the controller pods are deleted. | ||
| * update the number of replicas within the `KafkaNodePool`(s) custom resource(s) related to controllers (increase or decrease). | ||
| * unpause the cluster reconciliation. |
There was a problem hiding this comment.
I don't think this is needed. You can just change it and once it gets stuck rolling the controllers you can just roll the remaining nodes manually.
There was a problem hiding this comment.
Yeah, I was mentioning a "possible way" but even what you are describing should do the same without the need to pause/unpause. I will change it if you like, no strong opinion. It's going to be just an example of how it (doesn't) work today.
There was a problem hiding this comment.
Actually, scale-up seems to work fine. Scale-down require manual deletion of one or more pods.
It might be also worth mentioning the node unregistration issues discussed today on Slack as that is IMHO much bigger issue here if it cannot be worked around.
There was a problem hiding this comment.
It might be also worth mentioning the node unregistration issues discussed today on Slack as that is IMHO much bigger issue here if it cannot be worked around.
The discussion was about a cluster using static quorum and a user trying to scale controllers which is officially not supported in Apache Kafka. Controllers registration/unregistration doesn't apply when you have static quorum so I can't understand what you are referring to.
There was a problem hiding this comment.
You are talking here about what the static quorum cannot do. So my comment is that you make some parts of it make much more complicated then they really are but at the same time you are skipping some important parts.
|
|
||
| KRaft dynamic controller quorum enables the controllers scaling without downtime. | ||
| In general, this operation can be useful for increasing the number of controllers when needed (or, of course, decreasing them). | ||
| Also, sometimes, replacing a controller because of a disk or hardware failure is something that can be done by scaling. |
There was a problem hiding this comment.
I do not think you need scaling when you have broken disk or bare-metal server:
- You can easily just fix the existing controller (e.g. by having it scheduled on a new node or start with a new disk)
- Solving this by scaling does nto make sense as you would have the broken node missing (of you scale-down and remove it) or still there breaking your reconciliation. And it will screw up your qourum.
There was a problem hiding this comment.
The KIP seemed to say that to replace the disk of a controller node you would need to load some state onto the disk, so my understanding from the KIP is that if you have a hardware failure the dynamic quorum would allow you to remove that node and add a new one with a fresh disk without having to load that state somehow, is that right @ppatierno ?
There was a problem hiding this comment.
It's true that a disk failure can be easily fixed by adding a new disk and restarting the controller. The KIP-853 explains it here how. @katheris, maybe this is the part you are referring to.
But it's also true that you can scale up your quorum with a new controller and then scale down the failed one (actually removing it from the quorum). Of course, it seems too much compared to just replace the disk. But it's feasible. Anyway, I agree and will update the sentence by not mentioning the scaling but mentioning that the disk replacement works because the dynamic quorum with the corresponding unregistration/registration controller operations.
There was a problem hiding this comment.
So, the point is ...
- Disk replacement works fine with static quorum because you just replace it I guess.
- Disk replacement might work fine with a dynamic quorum, but we should not force users to scale up and down for it. The cloud native way is to fix the disk in the existing Pod. So how will that be handled in this proposal? I assume it is more similar to the JBOD storage handling and to moving metadata between volumes?
There was a problem hiding this comment.
I removed the sentence about scaling for fixing disk failures.
So how will that be handled in this proposal? I assume it is more similar to the JBOD storage handling and to moving metadata between volumes?
I think it's already explained across the proposal. For example, the user updates the KafkaNodePool by adding a new JBOD disk and marking it as the one to host metadata (user also removes the old disk). The change will trigger a controller rolling (as it happens already today) handled by the KafkaRoller and it will be in charge of unregistering the controller with the old directory ID and registering the same controller with the new directory ID.
| Beyond these operational benefits, dynamic quorum support is critical for Apache Kafka's strategic direction and production readiness. | ||
| With ZooKeeper support officially removed in Apache Kafka 4.0, KRaft is now the only metadata management option for Kafka clusters. | ||
| This makes dynamic quorum capabilities essential rather than optional and it's a fundamental requirement for KRaft to be truly production-ready and not a regression from ZooKeeper-based deployments. | ||
| However, it's important to note that while the core dynamic quorum functionality was introduced in Kafka 3.9.0 (via KIP-853), certain bug fixes, improvements and new features like quorum migration from static to dynamic are only available starting from Apache Kafka 4.1.0. |
There was a problem hiding this comment.
Are you saying that Strimzi was not production ready since 0.46 on? Are you saying that Kafka was not production ready at least in 3.9 and 4.0 with Kraft when it did not worked properly?
There was a problem hiding this comment.
I am just saying that Kafka 3.9 added the support for dynamic quorum but that version didn't have the support for migration from static to dynamic which was added in Kafka 4.0. This version also brought some improvements and bug fixes in the dynamic quorum itself.
There was a problem hiding this comment.
You are saying that dynamic quorum support is critical for Apache Kafka's production readiness. So yes, I think you are saying that KRaft in Strimzi is not production ready. IIRC the dynamic quorum did not really work in practice 3.9.0 and 4.0.0. But to be clear, I do not think it is critical for production readiness. So it does not matter that much whether it worked or not.
| When upgrading the Strimzi operator to a release supporting the dynamic quorum, existing clusters that use the static quorum will be automatically migrated to use dynamic quorum. | ||
| Of course, new Apache Kafka clusters are also deployed by using the dynamic quorum. | ||
|
|
||
| ### Downgrade |
There was a problem hiding this comment.
I'm confused. You say that downgrade is not possible. But you pretend that it is fine. That seems inconsistent.
There was a problem hiding this comment.
I mean that downgrading to "static quorum" is not possible. When the cluster is migrated automatically to "dynamic quorum" but then you downgrade to the previous operator, the Kafka cluster will still continue to use dynamic quorum despite the nodes will be configured with a "voters" configuration because of course an old operator doesn't know about the "controllers bootstrap". But once the kraft.version is set to 1 (dynamic quorum) it can't come back to 0 (static quorum). So where do you see I was not clear in the description?
| It doesn't work well in a scenario where the Apache Kafka cluster has combined-nodes and we want to scale down controllers by removing the "controller" role from one or more of them. | ||
| In this case, the node is not actually shut down and removed forever but it's rolled with a new configuration (i.e. the "controller" role is removed). | ||
| But in a Kubernetes-based environment where the rolling of a pod is driven by the platform, there is no opportunity to execute a step (like the controller unregistration) between the shutdown and startup. | ||
| To overcome this issue, the best approach would be unregister the controller first and then rolling the node as broker only, but as mentioned before it would join again right after the unregistration. |
There was a problem hiding this comment.
Why can't you:
- Roll the pod to be in broker role only
- Unregister it from KRaft voters?
Given it is not controller anymore, it should not auto-join anymore, or? If it does, that seems like a major bug in Kafka.
There was a problem hiding this comment.
If it does, that seems like a major bug in Kafka.
Kind of ... yes, here is what I opened https://issues.apache.org/jira/browse/KAFKA-19867
The discussion ended with the need of a re-design of auto-join which is not planned right now. @showuon can add more details to it.
katheris
left a comment
There was a problem hiding this comment.
Thanks @ppatierno I added a bunch of comments, but I don't have any big objections to the proposal as it is, more just clarifying questions.
The two areas that gave me "pause for thought" are in needing to wait for multiple reconciliations to fully register or unregister nodes, and needing to have different behaviour in terms of whether we generate the directory ids for initial startup vs scaling. However based on my understanding of the Kafka behaviour I understand why you have made the proposal you have and don't have immediate suggestions for alternatives that would be better or easier to follow.
I will take a little longer before approving just to think it over some more and take a look at the PoC for how complex the code will be from an understanding and maintenance PoV.
|
|
||
| KRaft dynamic controller quorum enables the controllers scaling without downtime. | ||
| In general, this operation can be useful for increasing the number of controllers when needed (or, of course, decreasing them). | ||
| Also, sometimes, replacing a controller because of a disk or hardware failure is something that can be done by scaling. |
There was a problem hiding this comment.
The KIP seemed to say that to replace the disk of a controller node you would need to load some state onto the disk, so my understanding from the KIP is that if you have a hardware failure the dynamic quorum would allow you to remove that node and add a new one with a fresh disk without having to load that state somehow, is that right @ppatierno ?
| controllers: | ||
| - id: 3 | ||
| directoryId: "r9VGTiw2QUCoDUadCt6q2g" | ||
| - id: 4 | ||
| directoryId: "r8E7BKRbR0SQxSIFro6wjg" | ||
| - id: 5 | ||
| directoryId: "Anjr8banTey_LqVZsppzYg" |
There was a problem hiding this comment.
I'm not aware of the push-back, are you able to expand or link to some discussion for me to see the context?
| * Existing cluster, scale up: | ||
| * broker and controller needs the `--no-initial-controllers` option. | ||
|
|
||
| The proposal is for the Strimzi Cluster Operator to manage the list of current controllers, with their corresponding directory IDs, in the `Kafka` custom resource status. |
There was a problem hiding this comment.
This is a minor point, but in terms of addressing the requirement of passing the controllers list to the nodes it might make sense to mention the ConfigMap first, then mention that this data will also be stored in the status to be used for disaster recovery and for users to understand the set of registered voters more easily.
There was a problem hiding this comment.
The thing is that the content of the controllers list for the ConfigMap comes from the "controller" field within the status if it's not a new Kafka cluster. This is the reason why I am mentioning such status first.
| - Split quorum risk: When multiple controllers are scaled up simultaneously and formatted with `--initial-controllers` including all newly added controllers, they can form a separate competing quorum independent of the existing voters if election timeouts expire before they fetch metadata from the active quorum. This violates Raft consensus safety guarantees by creating two independent quorums operating on the same cluster. | ||
| - Undocumented behavior: This approach is not documented in the official Apache Kafka documentation or KIP-853. Relying on undocumented behavior creates a risk that future Kafka versions could change or break this functionality without notice. | ||
|
|
||
| For these reasons, the proposal follows the official documented approach: using `--initial-controllers` only for initial cluster bootstrap and `--no-initial-controllers` for scale-up operations. |
There was a problem hiding this comment.
The reasons listed for discarding this option are pretty weak.
-
Unnecessary checkpoint file: This is pretty much irrelevant as the file is tiny. -
Split quorum risk: When formatting you should only add up to (n/2)-1 new controllers. That type of limitation is relatively common in distributed system so I would not consider it blocking. -
Undocumented behavior: As we discussed privately, this behavior works and we can document it in Apache Kafka if that would help.
Are there other technical reasons?
There was a problem hiding this comment.
Yeah, that was pretty much my take as well here.
There was a problem hiding this comment.
Unnecessary checkpoint file: This is pretty much irrelevant as the file is tiny.
I said this several times during my investigation during our discussions.
Undocumented behavior: As we discussed privately, this behavior works and we can document it in Apache Kafka if that would help.
Tbh it's not exactly what we agreed. We had several discussions, I also sent an email about documenting this usage (no answer for a month, then only Luke replied), we also worked together on fixing a test and adding a new one about this scenario. But offline we seemed to agree with @showuon and his investigation so that my PR was closed and my email was answered. You can find references here:
https://lists.apache.org/thread/62p9svvdzgpgjd7o3kws3f917w6voqqn
apache/kafka#21507
If things are different or you reached a different conclusion, I don't understand why not raising it during our previous discussions.
Also what does it mean for the Kafka project? Documenting that you can do it but with the limitation of adding up to (n/2)-1 only? It looks to me more stretching a way to do the formatting than documenting a working solution.
There was a problem hiding this comment.
Mickael is correct, we can avoid the "split brain" issue by limiting the number to be scaled up/down one time. But I think that also need to add other logic to do the limitation and we have the same result here. So I think @ppatierno can add some more wordings here to emphasize the reason why you don't want to go this way. You must have some reason, like the UX is not good for users if users want to add the controller from 1 -> 3, which will be rejected, ... etc, to persuade readers.
@ppatierno Any chance you have answers to these points? I think they are pretty clear as well.
To be honest, I do not think we have to be a passive receiver of every bad design Kafka comes up with. We need to take a more active role in shaping the future. I think we are completely failing at it as Strimzi by completely ignoring what is going on in Kafka. And it is really sad to see that the things I raised as a problem over a year ago have not changed, and that we just seem to have decided to accept it. (And yes, I can use excuses as sabbatical, burn-out, and whatever. But this is obviously my personal failure 😞) And while I understand that the dynamic quorum configuration is an important change, I think the dynamic quorum as designed could also be a major blocker for future Strimzi development. So I think what we should be asking here is not whether this is the best we could do with the Kafka design. We should be asking whether this is a good Strimzi design. And if we don't think it is a good Strimzi design, we should go looking for a better solution. And maybe that means a different approach, maybe that means we need to rework other things first, or maybe it means going to Kafka and changing how the Kafka design works. But saying that we should approve this just because this is the least worse from the bad options is shortsighted. |
Signed-off-by: Paolo Patierno <ppatierno@live.com>
I think this is important to feed back your discoveries and pain points to Kafka. For some of these features, Strimzi is one of the first project to test them to the limit. From the ling discussions, here are some of the pain points I noted. I'm sure I missed plenty, and may have misunderstood some, so please correct me and include plenty of details as I don't have the Strimzi expertise to necessarily understand why it's a issue/pain points.
@scholzj What are the other "bad Kafka designs" you encountered? (Only considering dynamic quorum 😉, we can discuss others at another time) |
|
@mimaison I think it is hard to sum up into a few points:
I obviously do not understand Kafka internals. So I might not understand some motivations. But the Kubernetes landscape is dynamic and is constantly changing. And I'm seriously concerned that these design decisions might hinder Strimzi development because despite being called dynamic, they will make any changes almost impossible to orchestrate. |
|
Hi @scholzj,
Wdyt? |
|
@ppatierno It is not just about DNS changes but also authentication changes, port changes, etc. |
This proposal is about adding the support for dynamic quorum and controllers scaling to the Strimzi Cluster Operator.
It replaces #190.
I have been already working on a POC to validate what is currently written within this proposal.
I also added some scenarios of dynamic quorum and controller scaling usage with both happy paths and failures.
It is also possible to try it by deploying a Strimzi Cluster Operator but using the following images in the Deployment file:
quay.io/ppatierno/operator:dynamic-quorumquay.io/ppatierno/kafka:dynamic-quorum-kafka-4.1.1quay.io/ppatierno/kafka:dynamic-quorum-kafka-4.2.0