Skip to content

Fix late selective_channel retry after EndRPC#3359

Open
hjwsm1989 wants to merge 1 commit into
apache:masterfrom
hjwsm1989:codex/selective-channel-late-subdone-fix
Open

Fix late selective_channel retry after EndRPC#3359
hjwsm1989 wants to merge 1 commit into
apache:masterfrom
hjwsm1989:codex/selective-channel-late-subdone-fix

Conversation

@hjwsm1989

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: #3358

Problem Summary:
This fixes a selective_channel race where a late SubDone::Run() may re-enter
retry/backup after the main RPC has already entered EndRPC().

The crash report in #3358 shows that a late sub-call callback can still flow into:

  • Controller::OnVersionedRPCReturned()
  • Controller::IssueRPC()
  • schan::Sender::IssueRPC()

after the main controller has already started tearing down its state.

That leaves selective_channel vulnerable to retrying on partially torn-down
state, including the previously observed null balancer path.

What is changed and the side effects?

Changed:

  • mark the controller as "ending RPC" at the beginning of EndRPC()
  • ignore late SubDone callbacks once the main RPC is already ending
  • keep a defensive null check in schan::Sender::IssueRPC()

This keeps the retry/backup state machine from re-entering after teardown has
started, and also preserves a hard guard at the selective_channel boundary.

Test:
Added a regression test that:

  • uses SelectiveChannel
  • enables backup request and retry
  • lets the main RPC time out first
  • lets delayed sub-calls finish later

This reproduces the late callback window and verifies it no longer re-enters
retry/backup after timeout.

Also re-ran related selective/backup request tests.

Side effects:

  • Performance effects:

  • Breaking backward compatibility:


Check List:

Ignore late selective_channel SubDone callbacks once the main RPC enters
EndRPC, and keep a defensive balancer null check in Sender::IssueRPC.
Add a regression test covering timeout plus delayed sub-call completion.
@hjwsm1989 hjwsm1989 force-pushed the codex/selective-channel-late-subdone-fix branch from c2c7d03 to e27a7e9 Compare June 25, 2026 06:01
@hjwsm1989

Copy link
Copy Markdown
Contributor Author

@chenBright hello, please help review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant