The Problem
One of the steps of GracefulMasterTakeover is making the master instance read-only:
|
log.Infof("GracefulMasterTakeover: Will set %+v as read_only", clusterMaster.Key) |
|
if clusterMaster, err = inst.SetReadOnly(&clusterMaster.Key, true); err != nil { |
|
return nil, nil, err |
|
} |
If the process fails right after that, for example here, the replicaset will be intact, though master will remain in read-only status.
The Proposed Solution
-
Ensure the following code or similar one is executed before any return nil, nil, err and after the master is set to read-only:
|
if topologyRecovery.SuccessorKey == nil { |
|
// Promotion fails. |
|
// Undo setting read-only on original master. |
|
inst.SetReadOnly(&clusterMaster.Key, false) |
|
return nil, nil, fmt.Errorf("GracefulMasterTakeover: Recovery attempted yet no replica promoted; err=%+v", err) |
|
} |
-
Add PostUnsuccessfulGracefulTakeoverProcesses config entry and execute it if graceful takeover was not successful, similar to other takeover/failover processes. This will allow users to add their own hooks to check the master status and update it if needed.
Could you please suggest which of the two solutions (or both) is better to implement, or propose the other way to work around the issue?
The Problem
One of the steps of GracefulMasterTakeover is making the
masterinstance read-only:orchestrator/go/logic/topology_recovery.go
Lines 2170 to 2173 in 181f94a
If the process fails right after that, for example here, the replicaset will be intact, though
masterwill remain in read-only status.The Proposed Solution
Ensure the following code or similar one is executed before any
return nil, nil, errand after the master is set to read-only:orchestrator/go/logic/topology_recovery.go
Lines 2192 to 2197 in 181f94a
Add
PostUnsuccessfulGracefulTakeoverProcessesconfig entry and execute it if graceful takeover was not successful, similar to other takeover/failover processes. This will allow users to add their own hooks to check the master status and update it if needed.Could you please suggest which of the two solutions (or both) is better to implement, or propose the other way to work around the issue?