Skip to content

selective_channel: late SubDone may re-enter retry/backup after _lb.reset() and crash on null ChannelBalancer #3358

Description

@hjwsm1989

Describe the bug
We hit a crash in selective_channel on a path where a late SubDone::Run() re-enters the retry/backup flow after Controller::_lb has already been reset.

This report is self-contained on purpose because the original investigation notes are on an internal link that is not publicly accessible.

The crash is not a use-after-free of Controller, Sender, SelectiveChannel, or SubDone themselves. The immediate problem is that _main_cntl->_lb becomes null, while a late callback still goes into IssueRPC() and dereferences the null ChannelBalancer*.

Stack shape

The stack we observed is effectively:

SubDone::Run
  -> Controller::OnVersionedRPCReturned
    -> Controller::IssueRPC
      -> schan::Sender::IssueRPC
        -> static_cast<ChannelBalancer*>(_main_cntl->_lb.get())
          -> ChannelBalancer::SelectChannel(this=0x0)
            -> SharedLoadBalancer::SelectServer(this=0x0)

To Reproduce
Why this looks like a real race in current master
Based on current master behavior in controller.cpp and selective_channel.cpp:
Controller::EndRPC() resets _lb before _done->Run()
Controller::OnVersionedRPCReturned() may directly call IssueRPC() when it decides to continue retry/backup
Controller::IssueRPC() checks _sender, but does not check _lb
schan::Sender::IssueRPC() directly reads _main_cntl->_lb.get() and uses it as ChannelBalancer*

This creates a race window:
One sub-RPC finishes and drives the main RPC into EndRPC()
EndRPC() does _lb.reset()
Another sub-RPC finishes slightly later
Its SubDone::Run() still enters OnVersionedRPCReturned()
Retry/backup is still considered
IssueRPC() is called
Sender::IssueRPC() reads _main_cntl->_lb.get() == nullptr
Null dereference follows

Trigger conditions
The crash becomes much easier to hit when all of the following are true:
all sub-channels fail to establish connections
backup_request_ms > 0
max_retry > 0
multiple sub-RPCs fail or time out nearly at the same time
In our production case this happened under a network failure window where all candidate peers became unreachable, while backup requests and retries were both enabled.

Related issues / PRs checked
I checked the existing public reports and found related-but-different items:
#2454: DynamicPartitionChannel thrift协议使用backup request产生coredump
#1656: Crashed when retry many times
PR #3294: fix(selective_channel): isolate backup responses and add race regression test
Those appear to be different failure modes. In particular, PR #3294 looks like a backup response isolation fix, not a fix for _lb lifetime vs late SubDone retry/backup re-entry.

Suggested fixes
The lowest-risk defensive fix seems to be:
In schan::Sender::IssueRPC(), check _main_cntl->_lb.get() immediately after loading it, and fail gracefully if it is null

Possible additional fixes
In Controller::OnVersionedRPCReturned(), short-circuit retry/backup if _lb == nullptr
Reconsider whether _lb.reset() in Controller::EndRPC() should happen later, e.g. after _done->Run() or together with final teardown

Expected behavior
Late SubDone callbacks should not be able to re-enter retry/backup in a way that dereferences a null ChannelBalancer.

Versions
OS:rocky 9.1
Compiler:gcc 12
brpc: 1.17.0
protobuf:

Additional context/screenshots

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions