Skip to content

DAOS-19036 dtx: handle DTX race issues - b28#18458

Open
Nasf-Fan wants to merge 1 commit into
release/2.8from
Nasf-Fan/DAOS-19036_1_b28
Open

DAOS-19036 dtx: handle DTX race issues - b28#18458
Nasf-Fan wants to merge 1 commit into
release/2.8from
Nasf-Fan/DAOS-19036_1_b28

Conversation

@Nasf-Fan

@Nasf-Fan Nasf-Fan commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Mainly including the following fixes:

  1. When DTX leader switch, it is possible that the old DTX leader wanted to abort such DTX but not completed before its eviction. And then the new DTX leader may re-execute related modification successfully and try to commit such DTX. If without control, it is possible that those in-flight DTX ABORT RPC from the old DTX leader may abort the DTX that is to be committed by the new DTX leader, then break DTX semantics.

    The patch adds @Version parameter when abort DTX: when new DTX leader handles resent RPC from client, related DTX version will be refreshed if it has been prepared by old DTX leader; anytime when abort DTX locally, the logic will compare the version from ABORT request with related DTX version and skip stale ABORT RPC.

  2. vos_dtx_load_mbs() maybe triggered before related DTX prepared locally. Under such case, related MBS information is empty. We need to handle such case to avoid segmentation fault.

  3. Handle race between DTX resync and IO handler for resent RPC.

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Ticket title is 'Argonne Daos_user : Engine ranks 590, 593, and 596 entered Errored state unexpectedly'
Status is 'In Progress'
Labels: 'ALCF'
https://daosio.atlassian.net/browse/DAOS-19036

@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-19036_1_b28 branch from 14bee27 to 69b2720 Compare June 8, 2026 06:14
@daosbuild3

Copy link
Copy Markdown
Collaborator

Mainly including the following fixes:

1. When DTX leader switch, it is possible that the old DTX leader
   wanted to abort such DTX but not completed before its eviction.
   And then the new DTX leader may re-execute related modification
   successfully and try to commit such DTX. If without control, it
   is possible that those in-flight DTX ABORT RPC from the old DTX
   leader may abort the DTX that is to be committed by the new DTX
   leader, then break DTX semantics.

   The patch adds @Version parameter when abort DTX: when new DTX
   leader handles resent RPC from client, related DTX version will
   be refreshed if it has been prepared by old DTX leader; anytime
   when abort DTX locally, the logic will compare the version from
   ABORT request with related DTX version and skip stale ABORT RPC.

2. vos_dtx_load_mbs() maybe triggered before related DTX prepared
   locally. Under such case, related MBS information is empty. We
   need to handle such case to avoid segmentation fault.

3. Handle race between DTX resync and IO handler for resent RPC.

Signed-off-by: Fan Yong <fan.yong@hpe.com>
@Nasf-Fan Nasf-Fan force-pushed the Nasf-Fan/DAOS-19036_1_b28 branch from 69b2720 to 33f4da0 Compare June 9, 2026 05:22
@Nasf-Fan Nasf-Fan marked this pull request as ready for review June 9, 2026 05:23
@Nasf-Fan Nasf-Fan requested review from a team as code owners June 9, 2026 05:23
@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18458/3/testReport/

@Nasf-Fan

Copy link
Copy Markdown
Contributor Author

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18458/3/testReport/

test_dfuse_daos_build_wb failed for DAOS-19005, not related with the patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants