DAOS-18487 rebuild: don't wait for discard#17621
Conversation
|
Errors are Unable to load ticket data |
|
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17621/1/execution/node/301/log |
- pool_discard doesn't wait for completion of discard anymore - Make sure no concurrent discards Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
dd4a7f9 to
df0f726
Compare
|
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17621/2/execution/node/304/log |
|
Test stage Build on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17621/2/execution/node/320/log |
|
Test stage Build on EL 9.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17621/2/execution/node/314/log |
|
Test stage Build on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17621/2/execution/node/408/log |
| D_INFO(DF_UUID " XXX: discard is already in progress, \n", DP_UUID(arg->pool_uuid)); | ||
| ds_pool_put(pool); | ||
| D_GOTO(out, rc = -DER_BUSY); | ||
| } |
There was a problem hiding this comment.
Need to query if rebuild is already running as well.
src/pool/srv_target.c
Outdated
| } | ||
|
|
||
| pool->sp_discard_status = 0; | ||
| rc = dss_ult_execute(ds_pool_tgt_discard_ult, arg, NULL, NULL, DSS_XS_SYS, 0, 0); |
There was a problem hiding this comment.
Changing this to dss_ult_create() looks more straightforward then creating a ULT in ds_pool_collective() callback. Or simply call ds_pool_collective() here instead of creating a new ULT.
There was a problem hiding this comment.
I tended to do that but just wanted to minimize the change for now, using dss_pool_collective() requires change to dss_pool_collective to count ULTs being created, and add eventual too.
Let's just do this for validation, and I will clean it up if it can help
There was a problem hiding this comment.
hmm, found another issue, might have to change it now.
There was a problem hiding this comment.
The minimal change is to simply replace above dss_ult_execute() to dss_ult_create(), isn't it?
src/pool/srv_target.c
Outdated
| struct d_backoff_seq backoff_seq; | ||
| int rc; | ||
|
|
||
| D_ASSERTF(!ds_pool_is_rebuilding(pool), DF_UUID " is already being reintegrated!\n", |
There was a problem hiding this comment.
It's not from this patch, but writing 'sp_rebuild_scan' from A xstream, but reading it from another xstream for barrier purpose looks not correct.
There was a problem hiding this comment.
same as above, I would be very cautious about adding assertion, but this is for debugging & validation.
|
|
||
| static int | ||
| static void | ||
| pool_child_discard(void *data) |
There was a problem hiding this comment.
It's better to add D_INFO messages before & after discard, so that we can tell if there is any unexpected discard in following tests.
src/pool/srv_target.c
Outdated
| } | ||
|
|
||
| pool->sp_need_discard = 1; | ||
| if (atomic_fetch_add(&pool->sp_need_discard, 1) > 1) { |
There was a problem hiding this comment.
atomic_fetch_add()
Atomically replaces the value pointed by obj with the result of addition of arg to the old value of obj, and returns the value obj held previously.
So should be if (atomic_fetch_add(, 1) > 0)?
There was a problem hiding this comment.
hmm, you are right, let me fix it. There is no such call like test_and_set, otherwise I'd just use that.
src/pool/srv_target.c
Outdated
| */ | ||
| ds_pool_collective(arg->pool_uuid, ex_status, pool_child_discard_async, arg, 0, true); | ||
|
|
||
| ref = atomic_fetch_sub(&pool->sp_need_discard, 1); |
There was a problem hiding this comment.
atomicaly subtracts a value from an atomic object and returns the previous value
so should assert(ref >= 1)?
src/pool/srv_target.c
Outdated
| addr.pta_target = dss_get_module_info()->dmi_tgt_id; | ||
| if (!pool_target_addr_found(&arg->tgt_list, &addr)) { | ||
| D_DEBUG(DB_TRACE, "skip discard %u/%u.\n", addr.pta_rank, | ||
| addr.pta_target); |
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
src/pool/srv_target.c
Outdated
| } | ||
|
|
||
| pool->sp_discard_status = 0; | ||
| rc = dss_ult_execute(ds_pool_tgt_discard_ult, arg, NULL, NULL, DSS_XS_SYS, 0, 0); |
There was a problem hiding this comment.
The minimal change is to simply replace above dss_ult_execute() to dss_ult_create(), isn't it?
src/pool/srv_target.c
Outdated
|
|
||
| /* XXX just return EAGAIN/EPERM? */ | ||
| D_ASSERTF(!ds_pool_is_rebuilding(pool), DF_UUID " is already being reintegrated!\n", | ||
| DP_UUID(arg->pool_uuid)); |
There was a problem hiding this comment.
This assert could be triggered in current implementation. I think we'd just return error when rebuild is going.
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
|
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17621/4/execution/node/1352/log |
|
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17621/5/testReport/ |
|
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17621/5/execution/node/1035/log |
Steps for the author:
After all prior steps are complete: