S3g GrpcOmTransport failover issues #9477

greenwich · 2025-12-11T09:00:48Z

greenwich
Dec 11, 2025

We recently had a production incident with Ozone
2.0 that caused one of the S3G instances to become stuck and unable to fail over to a new leader OM.

Summary

Our ozone cluster is running with kube; we have a bunch of kube nodes, each node has one S3g and one DN running. Some kube nodes additionally have one OM or one SCM instance running. We have three OMs: om0, om1, and om2.

So, for some reason, one of the kube nodes with S3g, DN, and om1 (leader) running went into a non-Ready state for a few minutes (so om1 was still running but didn't serve any traffic). That caused om2 to take over the leadership. A few seconds later, om1 returned to the cluster.
All S3gs failed over to the new OM leader, except one, which stuck in that failover attempts mode. Restarting that failing S3g helped resolve the issues.

Investigation

Later, the investigation showed the following:

Cluster had a very low (non-default) setting that made it quickly exhaust its failover limits

"ozone.client.wait.between.retries.millis": "250"
"ozone.client.failover.max.attempts": "16"

hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/protocolPB/GrpcOmTransport.java line 93: private int failoverCount = 0; - All threads share this counter; it never resets.

Also, in GrpcOmTransport.shouldRetry(258) we run action = retryPolicy.shouldRetry((Exception)ex, 0, failoverCount++, true); Is it intentional? Is it safe to do that?
Next in OMFailoverProxyProviderBase.getRetryPolicy.getRetryAction, we still use that global failoverCount checking if (failovers < maxFailovers)(258), which always returns return RetryAction.FAIL;(263) once we reached the maxFailovers
Shouldn't we have the failoverCount per request or per thread instead of making it a global flag? Or should we reset it?

I am not 100% sure about the next one here:

I believe that might be a race condition with performFailoverDone set in hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/ha/OMFailoverProxyProviderBase.java private boolean performFailoverDone (91), for example (see the inline comments) GrpcOmTransport.shouldRetry:

  private boolean shouldRetry(Exception ex, int expectedFailoverCount) {
    boolean retry = false;
    RetryPolicy.RetryAction action = null;
    try {
      action = retryPolicy.shouldRetry((Exception)ex, 0, failoverCount++, true);
      LOG.debug("grpc failover retry action {}", action.action);
// HERE, after shouldRetry but before performFailover is called, another thread can call shouldRetry and mess up the performFailoverDone flag
      if (action.action == RetryPolicy.RetryAction.RetryDecision.FAIL) {
        retry = false;
        LOG.error("Retry request failed. Action : {}, {}",
            action.action, ex.toString());
      } else {
        if (action.action == RetryPolicy.RetryAction.RetryDecision.RETRY ||
            (action.action == RetryPolicy.RetryAction.RetryDecision
                .FAILOVER_AND_RETRY)) {
          if (action.delayMillis > 0) {
            try {
              Thread.sleep(action.delayMillis); 
//HERE during this sleep, another thread can call shouldRetry and mess up the performFailoverDone flag
            } catch (Exception e) {
              LOG.error("Error trying sleep thread for {}", action.delayMillis);
            }
          }
          // switch om host to current proxy OMNodeId
          if (syncFailoverCount.get() == expectedFailoverCount) {
            omFailoverProxyProvider.performFailover(null);
            syncFailoverCount.getAndIncrement();
          } else {
            LOG.warn("A failover has occurred since the start of current" +
                " thread retry, NOT failover using current proxy");
          }
          host.set(omFailoverProxyProvider
              .getGrpcProxyAddress(
                  omFailoverProxyProvider.getCurrentProxyOMNodeId()));
          retry = true;
        }
      }
    } catch (Exception e) {
      LOG.error("Failed failover exception {}", e);
    }
    return retry;
  }

I believe our suboptimal configuration caused these race conditions; however, they may still occur even with the default configuration.

To reproduce my prod issue, I created a small tool (actually a test) that runs the mock OMs (om0, om1, om2), mimics my prod failover to om1->om2, and then bombards it with requests, printing results to the console.
Results are interesting:

--- Request Distribution ---
Total requests: 515
OMs that received requests: 2/3
  om0  :   55 requests ( 10.7%) SKEWED!
  om1  :  460 requests ( 89.3%) SKEWED!
  om2  :    0 requests (  0.0%) NEVER TRIED!

So, om2 (the leader) is never tried at all.
RunGrpcFailoverTest.java

Could anyone please have a look at my points above and comment on them?

Answered by greenwich

Dec 23, 2025

@ChenSammi, PR: #9546

View full answer

ChenSammi · 2025-12-16T09:20:48Z

ChenSammi
Dec 16, 2025
Collaborator

maxFailovers limitation of retry policy should apply to one request. By going through the GrpcOmTransport.java and GrpcOMFailoverProxyProvider.java roughly, I believe there is improvement room for the retry behavior here, just as @greenwich mentioned.

@greenwich , could you like to submit a PR if you have the fix?