Skip to content

Conversation

@ita-sammann
Copy link

@ita-sammann ita-sammann commented Dec 5, 2025

  • Implement "safe" mode to prevent writing data to wrong replicaset when vshard rebalance is in progress.
  • Auto switch to safe mode when rebalance process starts.
  • Manual return to fast mode.

I didn't forget about

  • Tests
  • Changelog
  • Documentation

ita-sammann and others added 5 commits December 5, 2025 13:42
Co-authored-by: Satbek Turganbayev <s.turganbayev@corp.mail.ru>
In fast mode ref/unref do nothing.
In safe mode they call bucket_refro,bucket_refrw,
bucket_unrefo,bucket_unrefrw.

Also added ref error handle from storages.
On ref error router will reset bucket, change replicaset
for single operations and retry request.
* add access to _bucket space in privillage tests
* fix select_readview_test
* enable double buckets test
* add safe_mode_disable for cartirdge reload test: test_storage
* add safe_mode_disable for test_any_vshard_call_timeout
--- on fast mode it does nothing.
--- default is fast mode.

--- bucket_refw and bucket_unrefrw must be called in one transaction in order to prevent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
--- bucket_refw and bucket_unrefrw must be called in one transaction in order to prevent
--- bucket_ref/bucket_unref must be called in one transaction in order to prevent

--- set safe mode func
--- from rebalance.safe_mode_status
function M.set_safe_mode_status(safe_mode_status)
M.safe_mode_status = safe_mode_status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be this can be removed because there are rebalance.register_safe_mode_enable_hook, rebalance.register_safe_mode_disable_hook functions

call_on_storage = call_on_storage_safe

for fb_id, fb in pairs(fiber.info()) do
if string.find(fb.name, CRUD_CALL_FUNC_NAME .. '/fast/') then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string.startswith might be faster, or even

if safe_mode_allowed_fast_methods[fb.name] == true then

Comment on lines +60 to +90
local function safe_mode_enable()
if not box.info.ro then
box.space[SETTINGS_SPACE_NAME]:replace{ 'safe_mode', true }
for _, trig in pairs(box.space._bucket:on_replace()) do
if trig == safe_mode_trigger then
box.space._bucket:on_replace(nil, safe_mode_trigger)
end
end
end
M.safe_mode = true

for hook, _ in pairs(M.safe_mode_enable_hooks) do
hook()
end

log.info('Rebalance safe mode enabled')
end

local function safe_mode_disable()
if not box.info.ro then
box.space[SETTINGS_SPACE_NAME]:replace{ 'safe_mode', false }
box.space._bucket:on_replace(safe_mode_trigger)
end
M.safe_mode = false

for hook, _ in pairs(M.safe_mode_disable_hooks) do
hook()
end

log.info('Rebalance safe mode disabled')
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be add some comments on trigger add and delete. Why we adding and deleting it.
Someone might decide that the trigger needs to be enabled in safe_mode_enable.

Comment on lines +194 to +198
M.router = {
cache_clear = router_cache_clear,
cache_length = router_cache_length,
cache_last_clear_ts = router_cache_last_clear_ts,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need tests for cache_clear scenario

pgroup_duplicates.test_duplicates = function(g)
t.skip_if(
not (
utils.tarantool_version_at_least(3, 1) and (g.params.backend == "config")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: implement test for other backend types. 2.11 and cartridge

},
get = {
call = function (g, key)
return g.router:call('crud.get', { 'customers', key })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mode='write' in options. to make call to the master

end
end

if g.params and g.params.safe_mode ~= nil then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should check bootstrap finish (etalon buckets count on storages, with active statuses) before applying fast, slow mode.

Copy link
Contributor

@vakhov vakhov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I’ve gone through the changes and left a few small questions.

Comment on lines +49 to +59
if string.find(fb.name, CRUD_CALL_FUNC_NAME .. '/fast/') then
local do_kill = true
for _, allowed_method in ipairs(safe_mode_allowed_fast_methods) do
if fb.name == allowed_method then
do_kill = false
break
end
end
if do_kill then
fiber.kill(fb_id)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could make this check a bit stricter, e.g. by checking for a prefix (something like fb.name:startswith(CRUD_CALL_FIBER_NAME .. '/fast/')), to minimize the risk of accidental matches? We could also adding a small debug log with the number of killed fibers to make debugging.

Comment on lines +34 to +36
if (op == 'INSERT' and new.status == vshard_consts.BUCKET.RECEIVING) or
(op == 'REPLACE' and new.status == vshard_consts.BUCKET.SENDING) then
box.broadcast(SAFE_MOD_ENABLE_EVENT, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safe mode is enabled only on RECEIVING/SENDING. Could there be other transitions we should handle too? Could there be a situation where a bucket is initially created in SENDING or goes through other statuses that we don’t catch here?

Comment on lines +23 to +27
local err = M.BucketRefError:new(M.BucketRefError:new(
"failed bucket_ref: %s, bucket_id: %s",
vshard_ref_err.name,
bucket_id
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like a single call would be enough:

Suggested change
local err = M.BucketRefError:new(M.BucketRefError:new(
"failed bucket_ref: %s, bucket_id: %s",
vshard_ref_err.name,
bucket_id
))
local err = M.BucketRefError:new(
"failed bucket_ref: %s, bucket_id: %s",
vshard_ref_err.name,
bucket_id
)

local ref_ok, bucket_refrw_err = safe_methods.bucket_refrw(bucket_id)
if not ref_ok then

table.insert(bucket_ref_errs, bucket_refrw_err.bucket_ref_errs[1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the inner error has no bucket_ref_errs, this will hit a nil index. Maybe worth guarding?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants