Skip to content

Dynamic controller quorum#203

Open
ppatierno wants to merge 21 commits intostrimzi:mainfrom
ppatierno:dynamic-controller-quorum
Open

Dynamic controller quorum#203
ppatierno wants to merge 21 commits intostrimzi:mainfrom
ppatierno:dynamic-controller-quorum

Conversation

@ppatierno
Copy link
Copy Markdown
Member

This proposal is about adding the support for dynamic quorum and controllers scaling to the Strimzi Cluster Operator.
It replaces #190.
I have been already working on a POC to validate what is currently written within this proposal.
I also added some scenarios of dynamic quorum and controller scaling usage with both happy paths and failures.
It is also possible to try it by deploying a Strimzi Cluster Operator but using the following images in the Deployment file:

  • operator quay.io/ppatierno/operator:dynamic-quorum
  • Kafka 4.1.1 quay.io/ppatierno/kafka:dynamic-quorum-kafka-4.1.1
  • Kafka 4.2.0 quay.io/ppatierno/kafka:dynamic-quorum-kafka-4.2.0

Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
handling

Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md
Comment thread 132-dynamic-controller-quorum.md
Copy link
Copy Markdown
Member

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ppatierno thanks for the proposal and examples.

I left few comments, but the approach LGTM.

Comment thread 132-dynamic-controller-quorum.md
Comment thread 132-dynamic-controller-quorum.md
Comment thread 132-dynamic-controller-quorum.md
Signed-off-by: Paolo Patierno <ppatierno@live.com>
formatting purposes

Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Copy link
Copy Markdown
Member

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for addressing my comments.

Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Comment thread 132-dynamic-controller-quorum.md
Comment thread 132-dynamic-controller-quorum.md Outdated
* ... (other reconcile operations) ...
* **KRaft quorum reconciliation**: analyze current quorum state, unregister and register controllers as needed (typically unregisters controllers being scaled down).
* scale down controllers.
* ... (other reconcile operations) ...
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we explain why there are other reconciliations in between so we understand the reason for the order of the reconciliation steps?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which "other reconciliations" are you referring to? This line "(other reconcile operations)" is referring to the usual operations we already have in the current reconciliation process.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly trying to understand whether there are other existing reconciliation steps that need to come before or after the quorum reconciliation step aside from the scale down and scale up. I wonder also if there is a way to simplify this flow further, without having to do the full quorum reconciliation twice. For example, do it at the beginning once or do at the end once or if that would actually make things more complicated. If that makes it more complicated, I'm happy to go with the current proposal. I understand we do it twice so that we can do registration/unregistration and scale up/down in one reconciliation, but the reconciliation is idempotent and the steps may not complete in one reconciliation anyway if there is any failure/error so the simplification might be worth doing. I will leave it up to you, as I'm not against how it is currently proposed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tinaselenge thanks for the feedback, following the rationale and story behind this approach ...

At the beginning of prototyping this, I had two different methods: unregisterControllers to come before scale down and registerControllers to come after scale up. The I decided to consolidate into a single reconcile KRaft quorum because they looked quite similar in terms of logic but ... the thing is that it still remains the fact that unregistration has to be done before scaling down (otherwise you can brake the quorum if scaling down before unregistration), and registration has to be done after scale up (because the controllers don't exist yet if you think to register them beforehand).
Now, if we want to have just one place to reconcile I see:

  • at the beginning: it works out of box for scaling down (unregistration first) but not for scale up. It means that when you are scaling up controllers in one reconciliation, the registration will happen later on in a subsequent reconciliation. It would work, as you mentioned it's idempotent. But it will happen with some delay (depending on the reconciliation period, 2 mins by default). Unless some state changed in the Kafka CR and a new reconciliation will be triggered immediately.
  • at the end: it works out of box for scaling up (registration of newly added controllers) but it doesn't for scale down. As mentioned, you cannot scale down without unregistering controllers first. To have quorum reconciliation at the end only, we should add a logic to block the scale down if there are still registered controllers (something similat to what we have to block scale down if brokers are hosting partitions). I can have a think about it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tinaselenge I thoughts about the idea of having the KRaft quorum reconciliation just at the end, but it can't work easily in case of disk metadata change. Let's consider this scenario:

  • The KafkaRoller detects there is a disk change (It has to unregister old controller and register the new one)
  • It runs the unregistration first but ... it FAILS
  • The overall operator reconciliation fails

Now, with the current implementation:

  • On next reconciliation, at the beginning, the call to KRaftQuorumReconciler retries to unregister the failed one, then registering the new one
  • WARN: this is needed because otherwise the KafkaRoller can't proceed with the next controller to be rolled

But with KRaftQuorumReconciler at the end only:

  • On next reconciliation, at the beginning, there is NO call to KRaftQuorumReconciler and the flow goes directly to KafkaRoller without unregistering the failed one.
  • The KafkaRoller will just start to go through the other controllers to be rolled but without unregistering the already rolled one first.

NOTE: the same scenario could happen with the unregistration working but the registration failing on first reconciliation.

When there is a metadata disk change we have following constrained flow:

For each controller in the list:

  • roll the controller
  • controller starts and new metadata disk is formatted (it's needed for reading the meta.properties during KRaft Quorum reconciliation)
  • KRaft Quorum reconciliation on single node (unregister old incarnation, register new incarnation)

We CAN'T proceed with the new controller to be rolled until the previous one is gone through the above flow.

Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Copy link
Copy Markdown
Contributor

@PaulRMellor PaulRMellor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s clear a lot of careful research has gone into this proposal. I’ve left suggestions to help with clarity and reduce any potential ambiguities, with some queries, but the overall direction looks good to me.

Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Copy link
Copy Markdown
Member

@im-konge im-konge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very thorough proposal, thank you very much, LGTM.

Copy link
Copy Markdown
Contributor

@tinaselenge tinaselenge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the proposal Paolo. Well detailed and thought through. I have just one comment regarding the reconciliation flow to see what you think. Otherwise, I'm happy with this proposal.

Copy link
Copy Markdown
Member

@see-quick see-quick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, I have verified your PoC (while I had some issues there I eventually resolve them all by myself I think nothing related to your code).

I tried also multiple cases all with separated roles (I didn't try combined mode):

  1. scale-up
  2. scale-down
  3. scale-down to 1 controller
  4. scale up from 3 to 5 controllers and immediately go to 3 again (not waiting for readiness of having 5 controllers)
  5. disk failure (i.e., I tried also to delete PVC if operator would be able to handle it ...)

Also I tried some niche cases ... but:

  1. during scale-up operator crash (so basically when I saw in CO operator Registering controller log I deleted the CO pod so I can see if CO could handle such faillover)
  2. similarly with scale-down

All went well without any issues. Eventually, It came to my mind to check that there is no data loss during exchange of messages during scaling-up and scaling-down controllers (using perf Kafka client but I think that we can test when you would create a PR (or I find some more time on this ...).

Anyway thanks for working on this and +1 for me @ppatierno 👍.

Copy link
Copy Markdown
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. Also, some points not covered:

  • It sounds like the dynamic qourum does not allow backup and recovery through volume snapshots unless you manage to recover the original names?
  • How does one change the address of the controller node? (e-g. to be not node:9091 anymore but node:9089)
  • How does one change the security configuration of the controller nodes?

Overall, I have to think whether I'm -1 or -0.9999 on this. That is not fault of this proposal. But I think this is not well done in Kafka and has terrible UX that is not ready for any automation. Funny enough, you start the proposal by talking about parity with ZooKeeper. I wish Kafka Lerner from that. And while implementing this proposal might work for most situations, it creates a terrible UX and it is not clear if there is any way to improvement. So I'm not convinced that it is actually better than the current limitations of the static qourum.

Comment on lines +12 to +17
A possible way for scaling controllers is:

* pause the cluster reconciliation.
* delete the controllers' `StrimziPodSet`(s) so that the controller pods are deleted.
* update the number of replicas within the `KafkaNodePool`(s) custom resource(s) related to controllers (increase or decrease).
* unpause the cluster reconciliation.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is needed. You can just change it and once it gets stuck rolling the controllers you can just roll the remaining nodes manually.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was mentioning a "possible way" but even what you are describing should do the same without the need to pause/unpause. I will change it if you like, no strong opinion. It's going to be just an example of how it (doesn't) work today.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, scale-up seems to work fine. Scale-down require manual deletion of one or more pods.

It might be also worth mentioning the node unregistration issues discussed today on Slack as that is IMHO much bigger issue here if it cannot be worked around.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be also worth mentioning the node unregistration issues discussed today on Slack as that is IMHO much bigger issue here if it cannot be worked around.

The discussion was about a cluster using static quorum and a user trying to scale controllers which is officially not supported in Apache Kafka. Controllers registration/unregistration doesn't apply when you have static quorum so I can't understand what you are referring to.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are talking here about what the static quorum cannot do. So my comment is that you make some parts of it make much more complicated then they really are but at the same time you are skipping some important parts.

Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated

KRaft dynamic controller quorum enables the controllers scaling without downtime.
In general, this operation can be useful for increasing the number of controllers when needed (or, of course, decreasing them).
Also, sometimes, replacing a controller because of a disk or hardware failure is something that can be done by scaling.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think you need scaling when you have broken disk or bare-metal server:

  • You can easily just fix the existing controller (e.g. by having it scheduled on a new node or start with a new disk)
  • Solving this by scaling does nto make sense as you would have the broken node missing (of you scale-down and remove it) or still there breaking your reconciliation. And it will screw up your qourum.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KIP seemed to say that to replace the disk of a controller node you would need to load some state onto the disk, so my understanding from the KIP is that if you have a hardware failure the dynamic quorum would allow you to remove that node and add a new one with a fresh disk without having to load that state somehow, is that right @ppatierno ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that a disk failure can be easily fixed by adding a new disk and restarting the controller. The KIP-853 explains it here how. @katheris, maybe this is the part you are referring to.
But it's also true that you can scale up your quorum with a new controller and then scale down the failed one (actually removing it from the quorum). Of course, it seems too much compared to just replace the disk. But it's feasible. Anyway, I agree and will update the sentence by not mentioning the scaling but mentioning that the disk replacement works because the dynamic quorum with the corresponding unregistration/registration controller operations.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the point is ...

  • Disk replacement works fine with static quorum because you just replace it I guess.
  • Disk replacement might work fine with a dynamic quorum, but we should not force users to scale up and down for it. The cloud native way is to fix the disk in the existing Pod. So how will that be handled in this proposal? I assume it is more similar to the JBOD storage handling and to moving metadata between volumes?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the sentence about scaling for fixing disk failures.

So how will that be handled in this proposal? I assume it is more similar to the JBOD storage handling and to moving metadata between volumes?

I think it's already explained across the proposal. For example, the user updates the KafkaNodePool by adding a new JBOD disk and marking it as the one to host metadata (user also removes the old disk). The change will trigger a controller rolling (as it happens already today) handled by the KafkaRoller and it will be in charge of unregistering the controller with the old directory ID and registering the same controller with the new directory ID.

Comment on lines +32 to +35
Beyond these operational benefits, dynamic quorum support is critical for Apache Kafka's strategic direction and production readiness.
With ZooKeeper support officially removed in Apache Kafka 4.0, KRaft is now the only metadata management option for Kafka clusters.
This makes dynamic quorum capabilities essential rather than optional and it's a fundamental requirement for KRaft to be truly production-ready and not a regression from ZooKeeper-based deployments.
However, it's important to note that while the core dynamic quorum functionality was introduced in Kafka 3.9.0 (via KIP-853), certain bug fixes, improvements and new features like quorum migration from static to dynamic are only available starting from Apache Kafka 4.1.0.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that Strimzi was not production ready since 0.46 on? Are you saying that Kafka was not production ready at least in 3.9 and 4.0 with Kraft when it did not worked properly?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just saying that Kafka 3.9 added the support for dynamic quorum but that version didn't have the support for migration from static to dynamic which was added in Kafka 4.0. This version also brought some improvements and bug fixes in the dynamic quorum itself.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are saying that dynamic quorum support is critical for Apache Kafka's production readiness. So yes, I think you are saying that KRaft in Strimzi is not production ready. IIRC the dynamic quorum did not really work in practice 3.9.0 and 4.0.0. But to be clear, I do not think it is critical for production readiness. So it does not matter that much whether it worked or not.

Comment thread 132-dynamic-controller-quorum.md
When upgrading the Strimzi operator to a release supporting the dynamic quorum, existing clusters that use the static quorum will be automatically migrated to use dynamic quorum.
Of course, new Apache Kafka clusters are also deployed by using the dynamic quorum.

### Downgrade
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. You say that downgrade is not possible. But you pretend that it is fine. That seems inconsistent.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that downgrading to "static quorum" is not possible. When the cluster is migrated automatically to "dynamic quorum" but then you downgrade to the previous operator, the Kafka cluster will still continue to use dynamic quorum despite the nodes will be configured with a "voters" configuration because of course an old operator doesn't know about the "controllers bootstrap". But once the kraft.version is set to 1 (dynamic quorum) it can't come back to 0 (static quorum). So where do you see I was not clear in the description?

Comment thread 132-dynamic-controller-quorum.md
Comment thread 132-dynamic-controller-quorum.md
Comment on lines +1363 to +1366
It doesn't work well in a scenario where the Apache Kafka cluster has combined-nodes and we want to scale down controllers by removing the "controller" role from one or more of them.
In this case, the node is not actually shut down and removed forever but it's rolled with a new configuration (i.e. the "controller" role is removed).
But in a Kubernetes-based environment where the rolling of a pod is driven by the platform, there is no opportunity to execute a step (like the controller unregistration) between the shutdown and startup.
To overcome this issue, the best approach would be unregister the controller first and then rolling the node as broker only, but as mentioned before it would join again right after the unregistration.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you:

  • Roll the pod to be in broker role only
  • Unregister it from KRaft voters?

Given it is not controller anymore, it should not auto-join anymore, or? If it does, that seems like a major bug in Kafka.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it does, that seems like a major bug in Kafka.

Kind of ... yes, here is what I opened https://issues.apache.org/jira/browse/KAFKA-19867
The discussion ended with the need of a re-design of auto-join which is not planned right now. @showuon can add more details to it.

Copy link
Copy Markdown
Member

@katheris katheris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ppatierno I added a bunch of comments, but I don't have any big objections to the proposal as it is, more just clarifying questions.

The two areas that gave me "pause for thought" are in needing to wait for multiple reconciliations to fully register or unregister nodes, and needing to have different behaviour in terms of whether we generate the directory ids for initial startup vs scaling. However based on my understanding of the Kafka behaviour I understand why you have made the proposal you have and don't have immediate suggestions for alternatives that would be better or easier to follow.

I will take a little longer before approving just to think it over some more and take a look at the PoC for how complex the code will be from an understanding and maintenance PoV.

Comment thread 132-dynamic-controller-quorum.md Outdated

KRaft dynamic controller quorum enables the controllers scaling without downtime.
In general, this operation can be useful for increasing the number of controllers when needed (or, of course, decreasing them).
Also, sometimes, replacing a controller because of a disk or hardware failure is something that can be done by scaling.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KIP seemed to say that to replace the disk of a controller node you would need to load some state onto the disk, so my understanding from the KIP is that if you have a hardware failure the dynamic quorum would allow you to remove that node and add a new one with a fresh disk without having to load that state somehow, is that right @ppatierno ?

Comment thread 132-dynamic-controller-quorum.md Outdated
Comment on lines +180 to +186
controllers:
- id: 3
directoryId: "r9VGTiw2QUCoDUadCt6q2g"
- id: 4
directoryId: "r8E7BKRbR0SQxSIFro6wjg"
- id: 5
directoryId: "Anjr8banTey_LqVZsppzYg"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of the push-back, are you able to expand or link to some discussion for me to see the context?

Comment thread 132-dynamic-controller-quorum.md
Comment thread 132-dynamic-controller-quorum.md
Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
* Existing cluster, scale up:
* broker and controller needs the `--no-initial-controllers` option.

The proposal is for the Strimzi Cluster Operator to manage the list of current controllers, with their corresponding directory IDs, in the `Kafka` custom resource status.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor point, but in terms of addressing the requirement of passing the controllers list to the nodes it might make sense to mention the ConfigMap first, then mention that this data will also be stored in the status to be used for disaster recovery and for users to understand the set of registered voters more easily.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is that the content of the controllers list for the ConfigMap comes from the "controller" field within the status if it's not a new Kafka cluster. This is the reason why I am mentioning such status first.

Comment thread 132-dynamic-controller-quorum.md Outdated
Comment thread 132-dynamic-controller-quorum.md Outdated
- Split quorum risk: When multiple controllers are scaled up simultaneously and formatted with `--initial-controllers` including all newly added controllers, they can form a separate competing quorum independent of the existing voters if election timeouts expire before they fetch metadata from the active quorum. This violates Raft consensus safety guarantees by creating two independent quorums operating on the same cluster.
- Undocumented behavior: This approach is not documented in the official Apache Kafka documentation or KIP-853. Relying on undocumented behavior creates a risk that future Kafka versions could change or break this functionality without notice.

For these reasons, the proposal follows the official documented approach: using `--initial-controllers` only for initial cluster bootstrap and `--no-initial-controllers` for scale-up operations.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasons listed for discarding this option are pretty weak.

  • Unnecessary checkpoint file: This is pretty much irrelevant as the file is tiny.

  • Split quorum risk: When formatting you should only add up to (n/2)-1 new controllers. That type of limitation is relatively common in distributed system so I would not consider it blocking.

  • Undocumented behavior: As we discussed privately, this behavior works and we can document it in Apache Kafka if that would help.

Are there other technical reasons?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that was pretty much my take as well here.

Copy link
Copy Markdown
Member Author

@ppatierno ppatierno Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary checkpoint file: This is pretty much irrelevant as the file is tiny.

I said this several times during my investigation during our discussions.

Undocumented behavior: As we discussed privately, this behavior works and we can document it in Apache Kafka if that would help.

Tbh it's not exactly what we agreed. We had several discussions, I also sent an email about documenting this usage (no answer for a month, then only Luke replied), we also worked together on fixing a test and adding a new one about this scenario. But offline we seemed to agree with @showuon and his investigation so that my PR was closed and my email was answered. You can find references here:

https://lists.apache.org/thread/62p9svvdzgpgjd7o3kws3f917w6voqqn
apache/kafka#21507

If things are different or you reached a different conclusion, I don't understand why not raising it during our previous discussions.

Also what does it mean for the Kafka project? Documenting that you can do it but with the limitation of adding up to (n/2)-1 only? It looks to me more stretching a way to do the formatting than documenting a working solution.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mickael is correct, we can avoid the "split brain" issue by limiting the number to be scaled up/down one time. But I think that also need to add other logic to do the limitation and we have the same result here. So I think @ppatierno can add some more wordings here to emphasize the reason why you don't want to go this way. You must have some reason, like the UX is not good for users if users want to add the controller from 1 -> 3, which will be rejected, ... etc, to persuade readers.

@scholzj
Copy link
Copy Markdown
Member

scholzj commented Apr 1, 2026

  • It sounds like the dynamic qourum does not allow backup and recovery through volume snapshots unless you manage to recover the original names?
  • How does one change the address of the controller node? (e-g. to be not node:9091 anymore but node:9089)
  • How does one change the security configuration of the controller nodes?

@ppatierno Any chance you have answers to these points? I think they are pretty clear as well.

The two areas that gave me "pause for thought" are in needing to wait for multiple reconciliations to fully register or unregister nodes, and needing to have different behaviour in terms of whether we generate the directory ids for initial startup vs scaling. However based on my understanding of the Kafka behaviour I understand why you have made the proposal you have and don't have immediate suggestions for alternatives that would be better or easier to follow.

To be honest, I do not think we have to be a passive receiver of every bad design Kafka comes up with. We need to take a more active role in shaping the future. I think we are completely failing at it as Strimzi by completely ignoring what is going on in Kafka. And it is really sad to see that the things I raised as a problem over a year ago have not changed, and that we just seem to have decided to accept it. (And yes, I can use excuses as sabbatical, burn-out, and whatever. But this is obviously my personal failure 😞)

And while I understand that the dynamic quorum configuration is an important change, I think the dynamic quorum as designed could also be a major blocker for future Strimzi development. So I think what we should be asking here is not whether this is the best we could do with the Kafka design. We should be asking whether this is a good Strimzi design. And if we don't think it is a good Strimzi design, we should go looking for a better solution. And maybe that means a different approach, maybe that means we need to rework other things first, or maybe it means going to Kafka and changing how the Kafka design works. But saying that we should approve this just because this is the least worse from the bad options is shortsighted.

Signed-off-by: Paolo Patierno <ppatierno@live.com>
@mimaison
Copy link
Copy Markdown
Contributor

mimaison commented Apr 1, 2026

To be honest, I do not think we have to be a passive receiver of every bad design Kafka comes up with. We need to take a more active role in shaping the future.

I think this is important to feed back your discoveries and pain points to Kafka. For some of these features, Strimzi is one of the first project to test them to the limit.

From the ling discussions, here are some of the pain points I noted. I'm sure I missed plenty, and may have misunderstood some, so please correct me and include plenty of details as I don't have the Strimzi expertise to necessarily understand why it's a issue/pain points.

  • Different commands to format controllers depending if it's the initial quorum or joining an existing quorum
  • Impossibility to rely on the current autojoin feature for new controllers
  • Difficulty keeping track of all the directory ids

@scholzj What are the other "bad Kafka designs" you encountered? (Only considering dynamic quorum 😉, we can discuss others at another time)

@scholzj
Copy link
Copy Markdown
Member

scholzj commented Apr 7, 2026

@mimaison I think it is hard to sum up into a few points:

  • Lack of idempotency is a major issue from my perspective (i.e. Different commands to format controllers depending on whether it's the initial quorum or joining an existing quorum, but the whole storage formatting seems a bit strange as well, although at least with static quorum it has a flag to make it idempotent)
  • Not being ready for remotely orchestrated ephemeral container-based infrastructure
    • This includes the directory IDs, which manifest as a bad user experience, but the root cause I think is not expecting that there is no human operating things manually and there are no servers that are there all the time and can be accessed at any point
    • But I think it also includes overreliance on a proprietary commit log instead of a simple configuration file that anyone can edit
    • As well as overreliance on the Kafka protocol to bootstrap a Kafka cluster. (Paolo did not answer my questions because he maybe does not know. But how to I unpack snapshots of an old cluster after a disaster in a new environment with new addresses? From how it was described here, it seems pretty hard task as well). I wonder why can't you load it from file and
    • I do not think these work well on a legacy infrastructures either because things like datacenter re-architecture or renaming the whole DNS trees etc. are not that uncommon operations. And these will be hard as well compared to for example, just _changing the hostnames in the config file and restarting things.
    • To a large extent, I think this is a repeat of the general issues with dynamic configuration and how it struggles with multiple sources of truth.
  • Frankly, I do wonder if the dynamic quorum isn't too dynamic. I do not think the real problem is to change everything without a single restart. We just need to add and remove controller nodes in an existing cluster. And if it needs me to roll every controller twice when scaling from 3->5 nodes because I have to go from 3->4 and then from 4->5, each time with a slightly different list of nodes ala ZooKeeper, I would be happy to take it because I do not think this is an everyday operation.

I obviously do not understand Kafka internals. So I might not understand some motivations. But the Kubernetes landscape is dynamic and is constantly changing. And I'm seriously concerned that these design decisions might hinder Strimzi development because despite being called dynamic, they will make any changes almost impossible to orchestrate.

@ppatierno
Copy link
Copy Markdown
Member Author

Hi @scholzj,
thank you for explaining what are the issues you see with the current status of the dynamic quorum within Apache Kafka and its impact on the current proposal.
I finally was able to test the scenarios with DNS changes (this is why I didn't reply yet ;-)) and it turned out that currently it doesn't work. I opened a JIRA for that in the Kafka upstream project. You can find details here https://issues.apache.org/jira/browse/KAFKA-20427.
Of course, I would agree it can be considered a blocker for this proposal.
Said that, before moving forward with additional changes both in Kafka and Strimzi (within my POC), I would like to understand what are your main concerns and hard stops to move forward with the proposal.
Am I right to think that if the following bullet points are covered we could move forward with this work within Strimzi?

  • fix the issue described in KAFKA-20427 about controllers not recovering the quorum on hostname changes.
  • using the same formatting command without distinguish between a new cluster or a scaling operation (relying always on --initial-controllers flag)
  • because of the previous point, limiting the scaling to one controller at the time (to avoid split quorum). Also reflecting in the Kafka CR that a controller, despite is running as a pod, could not be a voter yet because not registered (it's still catching up). Finally, adding a rolling of controllers when the bootstrap servers configuration is update (even if that doesn't really matter).
  • remove the usage of directory ID (we are still investigating on this and not sure if it's possible to remove it. It could anyway take time, nd I would discuss about still keeping workaround within the proposal for now)

Wdyt?

@scholzj
Copy link
Copy Markdown
Member

scholzj commented Apr 10, 2026

@ppatierno It is not just about DNS changes but also authentication changes, port changes, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants