Skip to content

DAOS-18487 rebuild: don't wait for discard#17621

Draft
gnailzenh wants to merge 4 commits intomasterfrom
liang/rebuild/b_discard_serialize
Draft

DAOS-18487 rebuild: don't wait for discard#17621
gnailzenh wants to merge 4 commits intomasterfrom
liang/rebuild/b_discard_serialize

Conversation

@gnailzenh
Copy link
Contributor

  • pool_discard doesn't wait for completion of discard anymore
  • Make sure no concurrent discards

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions
Copy link

Errors are Unable to load ticket data
https://daosio.atlassian.net/browse/DAOS-18487

@daosbuild3
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17621/1/execution/node/301/log

- pool_discard doesn't wait for completion of discard anymore
- Make sure no concurrent discards

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
@gnailzenh gnailzenh force-pushed the liang/rebuild/b_discard_serialize branch from dd4a7f9 to df0f726 Compare February 28, 2026 11:54
@daosbuild3
Copy link
Collaborator

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17621/2/execution/node/304/log

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

D_INFO(DF_UUID " XXX: discard is already in progress, \n", DP_UUID(arg->pool_uuid));
ds_pool_put(pool);
D_GOTO(out, rc = -DER_BUSY);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to query if rebuild is already running as well.

}

pool->sp_discard_status = 0;
rc = dss_ult_execute(ds_pool_tgt_discard_ult, arg, NULL, NULL, DSS_XS_SYS, 0, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this to dss_ult_create() looks more straightforward then creating a ULT in ds_pool_collective() callback. Or simply call ds_pool_collective() here instead of creating a new ULT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tended to do that but just wanted to minimize the change for now, using dss_pool_collective() requires change to dss_pool_collective to count ULTs being created, and add eventual too.
Let's just do this for validation, and I will clean it up if it can help

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, found another issue, might have to change it now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimal change is to simply replace above dss_ult_execute() to dss_ult_create(), isn't it?

struct d_backoff_seq backoff_seq;
int rc;

D_ASSERTF(!ds_pool_is_rebuilding(pool), DF_UUID " is already being reintegrated!\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not from this patch, but writing 'sp_rebuild_scan' from A xstream, but reading it from another xstream for barrier purpose looks not correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, I would be very cautious about adding assertion, but this is for debugging & validation.


static int
static void
pool_child_discard(void *data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add D_INFO messages before & after discard, so that we can tell if there is any unexpected discard in following tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add.

}

pool->sp_need_discard = 1;
if (atomic_fetch_add(&pool->sp_need_discard, 1) > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomic_fetch_add()
Atomically replaces the value pointed by obj with the result of addition of arg to the old value of obj, and returns the value obj held previously.
So should be if (atomic_fetch_add(, 1) > 0)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, you are right, let me fix it. There is no such call like test_and_set, otherwise I'd just use that.

*/
ds_pool_collective(arg->pool_uuid, ex_status, pool_child_discard_async, arg, 0, true);

ref = atomic_fetch_sub(&pool->sp_need_discard, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atomicaly subtracts a value from an atomic object and returns the previous value
so should assert(ref >= 1)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

addr.pta_target = dss_get_module_info()->dmi_tgt_id;
if (!pool_target_addr_found(&arg->tgt_list, &addr)) {
D_DEBUG(DB_TRACE, "skip discard %u/%u.\n", addr.pta_rank,
addr.pta_target);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ds_pool_child_put

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, thanks

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
@gnailzenh gnailzenh requested a review from wangshilong March 1, 2026 12:00
Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
}

pool->sp_discard_status = 0;
rc = dss_ult_execute(ds_pool_tgt_discard_ult, arg, NULL, NULL, DSS_XS_SYS, 0, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimal change is to simply replace above dss_ult_execute() to dss_ult_create(), isn't it?


/* XXX just return EAGAIN/EPERM? */
D_ASSERTF(!ds_pool_is_rebuilding(pool), DF_UUID " is already being reintegrated!\n",
DP_UUID(arg->pool_uuid));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert could be triggered in current implementation. I think we'd just return error when rebuild is going.

Signed-off-by: Liang Zhen <gnailzenh@gmail.com>
@daosbuild3
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17621/4/execution/node/1352/log

@daosbuild3
Copy link
Collaborator

@daosbuild3
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants