-
Notifications
You must be signed in to change notification settings - Fork 15
[TNTP-2109] Switch to safe mode on vshard rebalance #467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Co-authored-by: Satbek Turganbayev <s.turganbayev@corp.mail.ru>
In fast mode ref/unref do nothing. In safe mode they call bucket_refro,bucket_refrw, bucket_unrefo,bucket_unrefrw. Also added ref error handle from storages. On ref error router will reset bucket, change replicaset for single operations and retry request.
* add access to _bucket space in privillage tests * fix select_readview_test * enable double buckets test * add safe_mode_disable for cartirdge reload test: test_storage * add safe_mode_disable for test_any_vshard_call_timeout
| --- on fast mode it does nothing. | ||
| --- default is fast mode. | ||
|
|
||
| --- bucket_refw and bucket_unrefrw must be called in one transaction in order to prevent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| --- bucket_refw and bucket_unrefrw must be called in one transaction in order to prevent | |
| --- bucket_ref/bucket_unref must be called in one transaction in order to prevent |
| --- set safe mode func | ||
| --- from rebalance.safe_mode_status | ||
| function M.set_safe_mode_status(safe_mode_status) | ||
| M.safe_mode_status = safe_mode_status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be this can be removed because there are rebalance.register_safe_mode_enable_hook, rebalance.register_safe_mode_disable_hook functions
| call_on_storage = call_on_storage_safe | ||
|
|
||
| for fb_id, fb in pairs(fiber.info()) do | ||
| if string.find(fb.name, CRUD_CALL_FUNC_NAME .. '/fast/') then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string.startswith might be faster, or even
if safe_mode_allowed_fast_methods[fb.name] == true then| local function safe_mode_enable() | ||
| if not box.info.ro then | ||
| box.space[SETTINGS_SPACE_NAME]:replace{ 'safe_mode', true } | ||
| for _, trig in pairs(box.space._bucket:on_replace()) do | ||
| if trig == safe_mode_trigger then | ||
| box.space._bucket:on_replace(nil, safe_mode_trigger) | ||
| end | ||
| end | ||
| end | ||
| M.safe_mode = true | ||
|
|
||
| for hook, _ in pairs(M.safe_mode_enable_hooks) do | ||
| hook() | ||
| end | ||
|
|
||
| log.info('Rebalance safe mode enabled') | ||
| end | ||
|
|
||
| local function safe_mode_disable() | ||
| if not box.info.ro then | ||
| box.space[SETTINGS_SPACE_NAME]:replace{ 'safe_mode', false } | ||
| box.space._bucket:on_replace(safe_mode_trigger) | ||
| end | ||
| M.safe_mode = false | ||
|
|
||
| for hook, _ in pairs(M.safe_mode_disable_hooks) do | ||
| hook() | ||
| end | ||
|
|
||
| log.info('Rebalance safe mode disabled') | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be add some comments on trigger add and delete. Why we adding and deleting it.
Someone might decide that the trigger needs to be enabled in safe_mode_enable.
| M.router = { | ||
| cache_clear = router_cache_clear, | ||
| cache_length = router_cache_length, | ||
| cache_last_clear_ts = router_cache_last_clear_ts, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need tests for cache_clear scenario
| pgroup_duplicates.test_duplicates = function(g) | ||
| t.skip_if( | ||
| not ( | ||
| utils.tarantool_version_at_least(3, 1) and (g.params.backend == "config") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: implement test for other backend types. 2.11 and cartridge
| }, | ||
| get = { | ||
| call = function (g, key) | ||
| return g.router:call('crud.get', { 'customers', key }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mode='write' in options. to make call to the master
| end | ||
| end | ||
|
|
||
| if g.params and g.params.safe_mode ~= nil then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should check bootstrap finish (etalon buckets count on storages, with active statuses) before applying fast, slow mode.
vakhov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I’ve gone through the changes and left a few small questions.
| if string.find(fb.name, CRUD_CALL_FUNC_NAME .. '/fast/') then | ||
| local do_kill = true | ||
| for _, allowed_method in ipairs(safe_mode_allowed_fast_methods) do | ||
| if fb.name == allowed_method then | ||
| do_kill = false | ||
| break | ||
| end | ||
| end | ||
| if do_kill then | ||
| fiber.kill(fb_id) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could make this check a bit stricter, e.g. by checking for a prefix (something like fb.name:startswith(CRUD_CALL_FIBER_NAME .. '/fast/')), to minimize the risk of accidental matches? We could also adding a small debug log with the number of killed fibers to make debugging.
| if (op == 'INSERT' and new.status == vshard_consts.BUCKET.RECEIVING) or | ||
| (op == 'REPLACE' and new.status == vshard_consts.BUCKET.SENDING) then | ||
| box.broadcast(SAFE_MOD_ENABLE_EVENT, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Safe mode is enabled only on RECEIVING/SENDING. Could there be other transitions we should handle too? Could there be a situation where a bucket is initially created in SENDING or goes through other statuses that we don’t catch here?
| local err = M.BucketRefError:new(M.BucketRefError:new( | ||
| "failed bucket_ref: %s, bucket_id: %s", | ||
| vshard_ref_err.name, | ||
| bucket_id | ||
| )) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like a single call would be enough:
| local err = M.BucketRefError:new(M.BucketRefError:new( | |
| "failed bucket_ref: %s, bucket_id: %s", | |
| vshard_ref_err.name, | |
| bucket_id | |
| )) | |
| local err = M.BucketRefError:new( | |
| "failed bucket_ref: %s, bucket_id: %s", | |
| vshard_ref_err.name, | |
| bucket_id | |
| ) |
| local ref_ok, bucket_refrw_err = safe_methods.bucket_refrw(bucket_id) | ||
| if not ref_ok then | ||
|
|
||
| table.insert(bucket_ref_errs, bucket_refrw_err.bucket_ref_errs[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the inner error has no bucket_ref_errs, this will hit a nil index. Maybe worth guarding?
I didn't forget about