selective_channel: late SubDone may re-enter retry/backup after _lb.reset() and crash on null ChannelBalancer

**Describe the bug**
We hit a crash in `selective_channel` on a path where a late `SubDone::Run()` re-enters the retry/backup flow after `Controller::_lb` has already been reset.

This report is self-contained on purpose because the original investigation notes are on an internal link that is not publicly accessible.

The crash is not a use-after-free of `Controller`, `Sender`, `SelectiveChannel`, or `SubDone` themselves. The immediate problem is that `_main_cntl->_lb` becomes null, while a late callback still goes into `IssueRPC()` and dereferences the null `ChannelBalancer*`.

### Stack shape

The stack we observed is effectively:

```text
SubDone::Run
  -> Controller::OnVersionedRPCReturned
    -> Controller::IssueRPC
      -> schan::Sender::IssueRPC
        -> static_cast<ChannelBalancer*>(_main_cntl->_lb.get())
          -> ChannelBalancer::SelectChannel(this=0x0)
            -> SharedLoadBalancer::SelectServer(this=0x0)
```

**To Reproduce**
Why this looks like a real race in current master
Based on current master behavior in controller.cpp and selective_channel.cpp:
Controller::EndRPC() resets _lb before _done->Run()
Controller::OnVersionedRPCReturned() may directly call IssueRPC() when it decides to continue retry/backup
Controller::IssueRPC() checks _sender, but does not check _lb
schan::Sender::IssueRPC() directly reads _main_cntl->_lb.get() and uses it as ChannelBalancer*

This creates a race window:
One sub-RPC finishes and drives the main RPC into EndRPC()
EndRPC() does _lb.reset()
Another sub-RPC finishes slightly later
Its SubDone::Run() still enters OnVersionedRPCReturned()
Retry/backup is still considered
IssueRPC() is called
Sender::IssueRPC() reads _main_cntl->_lb.get() == nullptr
Null dereference follows

**Trigger conditions**
The crash becomes much easier to hit when all of the following are true:
all sub-channels fail to establish connections
backup_request_ms > 0
max_retry > 0
multiple sub-RPCs fail or time out nearly at the same time
In our production case this happened under a network failure window where all candidate peers became unreachable, while backup requests and retries were both enabled.

**Related issues / PRs checked**
I checked the existing public reports and found related-but-different items:
#2454: DynamicPartitionChannel thrift协议使用backup request产生coredump
#1656: Crashed when retry many times
PR #3294: fix(selective_channel): isolate backup responses and add race regression test
Those appear to be different failure modes. In particular, PR #3294 looks like a backup response isolation fix, not a fix for _lb lifetime vs late SubDone retry/backup re-entry.

**Suggested fixes**
The lowest-risk defensive fix seems to be:
In schan::Sender::IssueRPC(), check _main_cntl->_lb.get() immediately after loading it, and fail gracefully if it is null

**Possible additional fixes**
In Controller::OnVersionedRPCReturned(), short-circuit retry/backup if _lb == nullptr
Reconsider whether _lb.reset() in Controller::EndRPC() should happen later, e.g. after _done->Run() or together with final teardown

**Expected behavior**
Late SubDone callbacks should not be able to re-enter retry/backup in a way that dereferences a null ChannelBalancer.


**Versions**
OS:rocky 9.1
Compiler:gcc 12
brpc: 1.17.0
protobuf: 

**Additional context/screenshots**



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

selective_channel: late SubDone may re-enter retry/backup after _lb.reset() and crash on null ChannelBalancer #3358

Stack shape

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

selective_channel: late SubDone may re-enter retry/backup after _lb.reset() and crash on null ChannelBalancer #3358

Description

Stack shape

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions