-
|
We recently had a production incident with Ozone SummaryOur ozone cluster is running with kube; we have a bunch of kube nodes, each node has one S3g and one DN running. Some kube nodes additionally have one OM or one SCM instance running. We have three OMs: om0, om1, and om2. So, for some reason, one of the kube nodes with S3g, DN, and om1 (leader) running went into a non-Ready state for a few minutes (so om1 was still running but didn't serve any traffic). That caused om2 to take over the leadership. A few seconds later, om1 returned to the cluster. InvestigationLater, the investigation showed the following:
I believe our suboptimal configuration caused these race conditions; however, they may still occur even with the default configuration. To reproduce my prod issue, I created a small tool (actually a test) that runs the mock OMs (om0, om1, om2), mimics my prod failover to om1->om2, and then bombards it with requests, printing results to the console. So, om2 (the leader) is never tried at all. Could anyone please have a look at my points above and comment on them? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 2 replies
-
|
maxFailovers limitation of retry policy should apply to one request. By going through the GrpcOmTransport.java and GrpcOMFailoverProxyProvider.java roughly, I believe there is improvement room for the retry behavior here, just as @greenwich mentioned. @greenwich , could you like to submit a PR if you have the fix? |
Beta Was this translation helpful? Give feedback.
-
|
cc @rakeshadr |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @greenwich for the fix. Please feel free to reopen in case anything else arises pertaining to this. |
Beta Was this translation helpful? Give feedback.
@ChenSammi, PR: #9546