Describe the bug
We hit a crash in selective_channel on a path where a late SubDone::Run() re-enters the retry/backup flow after Controller::_lb has already been reset.
This report is self-contained on purpose because the original investigation notes are on an internal link that is not publicly accessible.
The crash is not a use-after-free of Controller, Sender, SelectiveChannel, or SubDone themselves. The immediate problem is that _main_cntl->_lb becomes null, while a late callback still goes into IssueRPC() and dereferences the null ChannelBalancer*.
Stack shape
The stack we observed is effectively:
SubDone::Run
-> Controller::OnVersionedRPCReturned
-> Controller::IssueRPC
-> schan::Sender::IssueRPC
-> static_cast<ChannelBalancer*>(_main_cntl->_lb.get())
-> ChannelBalancer::SelectChannel(this=0x0)
-> SharedLoadBalancer::SelectServer(this=0x0)
To Reproduce
Why this looks like a real race in current master
Based on current master behavior in controller.cpp and selective_channel.cpp:
Controller::EndRPC() resets _lb before _done->Run()
Controller::OnVersionedRPCReturned() may directly call IssueRPC() when it decides to continue retry/backup
Controller::IssueRPC() checks _sender, but does not check _lb
schan::Sender::IssueRPC() directly reads _main_cntl->_lb.get() and uses it as ChannelBalancer*
This creates a race window:
One sub-RPC finishes and drives the main RPC into EndRPC()
EndRPC() does _lb.reset()
Another sub-RPC finishes slightly later
Its SubDone::Run() still enters OnVersionedRPCReturned()
Retry/backup is still considered
IssueRPC() is called
Sender::IssueRPC() reads _main_cntl->_lb.get() == nullptr
Null dereference follows
Trigger conditions
The crash becomes much easier to hit when all of the following are true:
all sub-channels fail to establish connections
backup_request_ms > 0
max_retry > 0
multiple sub-RPCs fail or time out nearly at the same time
In our production case this happened under a network failure window where all candidate peers became unreachable, while backup requests and retries were both enabled.
Related issues / PRs checked
I checked the existing public reports and found related-but-different items:
#2454: DynamicPartitionChannel thrift协议使用backup request产生coredump
#1656: Crashed when retry many times
PR #3294: fix(selective_channel): isolate backup responses and add race regression test
Those appear to be different failure modes. In particular, PR #3294 looks like a backup response isolation fix, not a fix for _lb lifetime vs late SubDone retry/backup re-entry.
Suggested fixes
The lowest-risk defensive fix seems to be:
In schan::Sender::IssueRPC(), check _main_cntl->_lb.get() immediately after loading it, and fail gracefully if it is null
Possible additional fixes
In Controller::OnVersionedRPCReturned(), short-circuit retry/backup if _lb == nullptr
Reconsider whether _lb.reset() in Controller::EndRPC() should happen later, e.g. after _done->Run() or together with final teardown
Expected behavior
Late SubDone callbacks should not be able to re-enter retry/backup in a way that dereferences a null ChannelBalancer.
Versions
OS:rocky 9.1
Compiler:gcc 12
brpc: 1.17.0
protobuf:
Additional context/screenshots
Describe the bug
We hit a crash in
selective_channelon a path where a lateSubDone::Run()re-enters the retry/backup flow afterController::_lbhas already been reset.This report is self-contained on purpose because the original investigation notes are on an internal link that is not publicly accessible.
The crash is not a use-after-free of
Controller,Sender,SelectiveChannel, orSubDonethemselves. The immediate problem is that_main_cntl->_lbbecomes null, while a late callback still goes intoIssueRPC()and dereferences the nullChannelBalancer*.Stack shape
The stack we observed is effectively:
To Reproduce
Why this looks like a real race in current master
Based on current master behavior in controller.cpp and selective_channel.cpp:
Controller::EndRPC() resets _lb before _done->Run()
Controller::OnVersionedRPCReturned() may directly call IssueRPC() when it decides to continue retry/backup
Controller::IssueRPC() checks _sender, but does not check _lb
schan::Sender::IssueRPC() directly reads _main_cntl->_lb.get() and uses it as ChannelBalancer*
This creates a race window:
One sub-RPC finishes and drives the main RPC into EndRPC()
EndRPC() does _lb.reset()
Another sub-RPC finishes slightly later
Its SubDone::Run() still enters OnVersionedRPCReturned()
Retry/backup is still considered
IssueRPC() is called
Sender::IssueRPC() reads _main_cntl->_lb.get() == nullptr
Null dereference follows
Trigger conditions
The crash becomes much easier to hit when all of the following are true:
all sub-channels fail to establish connections
backup_request_ms > 0
max_retry > 0
multiple sub-RPCs fail or time out nearly at the same time
In our production case this happened under a network failure window where all candidate peers became unreachable, while backup requests and retries were both enabled.
Related issues / PRs checked
I checked the existing public reports and found related-but-different items:
#2454: DynamicPartitionChannel thrift协议使用backup request产生coredump
#1656: Crashed when retry many times
PR #3294: fix(selective_channel): isolate backup responses and add race regression test
Those appear to be different failure modes. In particular, PR #3294 looks like a backup response isolation fix, not a fix for _lb lifetime vs late SubDone retry/backup re-entry.
Suggested fixes
The lowest-risk defensive fix seems to be:
In schan::Sender::IssueRPC(), check _main_cntl->_lb.get() immediately after loading it, and fail gracefully if it is null
Possible additional fixes
In Controller::OnVersionedRPCReturned(), short-circuit retry/backup if _lb == nullptr
Reconsider whether _lb.reset() in Controller::EndRPC() should happen later, e.g. after _done->Run() or together with final teardown
Expected behavior
Late SubDone callbacks should not be able to re-enter retry/backup in a way that dereferences a null ChannelBalancer.
Versions
OS:rocky 9.1
Compiler:gcc 12
brpc: 1.17.0
protobuf:
Additional context/screenshots