fix: Correctly handle consumer errors and removal due to pulsar cluster restarts by unconditionally update_topics in MultiTopicConsumer#384
Conversation
…er restarts by unconditionally update_topics in MultiTopicConsumer While testing streamnative#379 it was found in that cases when the entire pulsar broker was restarted, we'd hit a similar case: - Instead of hitting Some(Err(e) at https://github.com/streamnative/pulsar-rs/blob/a14e8a15144a48d7d97b20c6a7fc637cbf5d780b/src/consumer/multi.rs#L381-L391 - We hit None at https://github.com/streamnative/pulsar-rs/blob/a14e8a15144a48d7d97b20c6a7fc637cbf5d780b/src/consumer/multi.rs#L377-L379 This means we still rip out the topic out of the MultiTopicConsumer forever. I originally was going to apply a similar fix to streamnative#379, but review of the code made me realize that there was code in: https://github.com/streamnative/pulsar-rs/blob/a14e8a15144a48d7d97b20c6a7fc637cbf5d780b/src/consumer/multi.rs#L145C12-L145C25 That actually handle keeping the initial topics around by recreating them. It however, wasn't being called because the call was guarded by a topic_regex existence check. I believe it is necessary for this code to be unconditionally called, because errors in consumers can drop them from a MultiTopicConsumer and we need to recreate. Local testing confirmed that it works, with 5 full restarts of our staging pulsar cluster while under loading not hanging up once.
|
@BewareMyPower - You reviewed my previous PR over this issue. I believe this is the real proper fix. |
|
@mdeltito - FYI |
|
@chamons Excellent find! This clears up my confusion as well - the removal of topics on fatal errors makes a lot more sense when there is a consistent, out-of-band mechanism for ensuring a connection/consumer exists for all topics. I suppose this does mean for the (likely much more common) case of static topics, we end up hitting the broker a lot more often to fetch the partitioned topic metadata. Unclear whether that is a "problem" or not; there is not much to go off in the PR or commit that introduced the original change, as I'm sure you saw. |
|
That is concerning that it was originally removed for what appears to be performance reasons (?), however if any broker issue can cause the topics to be removed then I think was strictly need to do those refreshes, else you can drop some (or in one case locally, all) topics out of a multi consumer. A once a 60 second query for all of the topics can't be that bad, right? (Famous last words) |
|
Totally agree, this behavior and erring the side of being chatty with the broker seems preferable to the current state.
I too want to believe 😄 This also handles the case where a topic for a static subscription has the number of partitions increased. For our use-case we are either forced to use a regex subscription and leverage that refresh interval to pick up the new partition(s), or restart consumers with a static subscription. |
That is a very good point, that can happen at an arbitrary point, is non-error, and we need to handle as well. |
|
FYI, Pulsar Java client has two configs for such periodical tasks:
Though it seems reasonable to use a single configuration for it |
While testing #379 it was found in that cases when the entire pulsar broker was restarted, we'd hit a similar case:
pulsar-rs/src/consumer/multi.rs
Lines 381 to 391 in a14e8a1
pulsar-rs/src/consumer/multi.rs
Lines 377 to 379 in a14e8a1
This means we still rip out the topic out of the MultiTopicConsumer forever.
I originally was going to apply a similar fix to #379, but review of the code made me realize that there was code in:
https://github.com/streamnative/pulsar-rs/blob/a14e8a15144a48d7d97b20c6a7fc637cbf5d780b/src/consumer/multi.rs#L145C12-L145C25
That actually handle keeping the initial topics around by recreating them.
It however, wasn't being called because the call was guarded by a topic_regex existence check.
I believe it is necessary for this code to be unconditionally called, because errors in consumers can drop them from a MultiTopicConsumer and we need to recreate.
Local testing confirmed that it works, with 5 full restarts of our staging pulsar cluster while under loading not hanging up once.