Skip to content

fix(protocol): fix panic caused by race condition during invoker destruction#3184

Open
HGD-coder wants to merge 5 commits intoapache:developfrom
HGD-coder:fix-3153
Open

fix(protocol): fix panic caused by race condition during invoker destruction#3184
HGD-coder wants to merge 5 commits intoapache:developfrom
HGD-coder:fix-3153

Conversation

@HGD-coder
Copy link

Description

Fixes #3153

1. The Issue (Race Condition)

As described in #3153, a panic (runtime error: invalid memory address) occurs under high concurrency when ReferenceConfig.Destroy() is triggered:

  1. Destroy() sets bi.url = nil to release resources.
  2. Concurrent calls to GetURL() return nil.
  3. Upper layers (Router, LoadBalance, Cluster, Filter) access methods on the returned nil URL (e.g., url.GetParam()), causing the panic.

2. Solution: Null Object Pattern

This PR introduces a Null Object Pattern to fix the panic safely:

  • Global Sentinel: Adds a thread-safe emptyURL initialized via common.NewURLWithOptions.
  • Safe Access: Modifies GetURL() to return emptyURL instead of nil when bi.url is destroyed.

3. Why we chose this solution (Comparison of Alternatives)

We evaluated three potential solutions. Here is why the Null Object Pattern is the best choice:

Option A: Remove bi.url = nil in Destroy()

  • Idea: Keep the old URL even after destruction.
  • Problem: This creates "Zombie Invokers". Upper layers would still see a valid Protocol/Host/Port and continue routing traffic to a closed client, causing use of closed network connection errors. It also prevents the URL (and its large Map) from being GC'd, causing potential memory leaks.

Option B: Add if url != nil checks at every call site

  • Idea: Modify all callers (Router, Cluster, LoadBalance, etc.) to check for nil.
  • Problem (Unfeasible):
    • "Whac-A-Mole" Risk: GetURL() is a fundamental method used in 50+ files. Missing just one spot means the panic persists.
    • Maintenance Nightmare: It forces every future developer to remember "Check for nil after GetURL", increasing cognitive load.

Option C: Null Object Pattern (This PR)

  • Result: GetURL() never returns nil.
  • Benefit: Callers don't need to change. The emptyURL returns empty parameters, so upper layers (like Router) will naturally filter out this destroyed invoker (e.g., matching params fails gracefully).
  • Cost: Negligible memory (~429 Bytes singleton, verified by benchmark).

4. Verification

Approach Bug #3153 Reproduction Result
Null Object (This PR) Passed (No Panic) Safe & Clean. Logic remains correct (invoker is treated as empty/unavailable).
Remove url=nil Passed (No Panic) Unsafe. Risk of traffic routing to closed connections (Zombies).

5. Request for Review

I have verified that this change passes the reproduction script for #3153.
Could reviewers please help check if introducing this emptyURL (Null Object) might have any unexpected side effects on other modules (e.g., Service Discovery or Metrics)?

Checklist

  • I confirm the target branch is develop
  • Code has passed local testing
  • I have added tests that prove my fix is effective or that my feature works

@codecov-commenter
Copy link

codecov-commenter commented Jan 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 47.83%. Comparing base (489ee6a) to head (d70ed40).
⚠️ Report is 23 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3184      +/-   ##
===========================================
+ Coverage    47.80%   47.83%   +0.02%     
===========================================
  Files          460      463       +3     
  Lines        33011    33726     +715     
===========================================
+ Hits         15782    16133     +351     
- Misses       15963    16295     +332     
- Partials      1266     1298      +32     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Alanxtl Alanxtl linked an issue Jan 29, 2026 that may be closed by this pull request
1 task
@Alanxtl Alanxtl added ☢️ Bug 3.3.2 version 3.3.2 labels Jan 29, 2026
@Alanxtl
Copy link
Contributor

Alanxtl commented Jan 29, 2026

加一下bug发生场景的test case

@Alanxtl
Copy link
Contributor

Alanxtl commented Jan 29, 2026

我觉得这个方法还是太不优雅了,因为上游对于url为nil时候的处理是不一样的,简单粗暴的返回一个空链接我个人觉得不好

只在发生panic的那个上游加一个对于nil的检测可以吗,GetURL这个函数这样改

@HGD-coder
Copy link
Author

HGD-coder commented Jan 29, 2026

image image

我觉得这个方法还是太不优雅了,因为上游对于url为nil时候的处理是不一样的,简单粗暴的返回一个空链接我个人觉得不好

只在发生panic的那个上游加一个对于nil的检测可以吗,GetURL这个函数这样改

1.太多地方用到GetURL()了,它们都是在Destory()增加url=nil前写的。只修改部分未来是否还有风险?
2.这一次panic是一个调用链中多处偶发(protocol/base/rpc_status.go内的invoker.GetURL().Key()和cluster/router/chain/chain.go的invoker.GetURL().ServiceKey()等),一路上用到的GetURL有点多,并不是修改一个上游,我之前尝试过增加url判空。但是碰到下面这种代码,或者说其他我本来就不清楚机制的代码,并不是加一层过滤就能解决,我自己擅自修改,可能引入其他的bug,我感觉改的越多,错的概率越大。

// cluster/router/chain/chain.go
// Route Loop routers in RouterChain and call Route method to determine the target invokers list.
func (c *RouterChain) Route(url *common.URL, invocation base.Invocation) []base.Invoker {
	finalInvokers := make([]base.Invoker, 0, len(c.invokers))
	// multiple invoker may include different methods, find correct invoker otherwise
	// will return the invoker without methods
	for _, invoker := range c.invokers {
		//假设在这里加了url判空
		if invoker.GetURL().ServiceKey() == url.ServiceKey() {
			finalInvokers = append(finalInvokers, invoker)
		}
	}

	// 【危险】因为上面没选出人来,系统决定“死马当活马医”
	// 强制把原始列表(包含那个url==nil的invoker)赋值回来
	if len(finalInvokers) == 0 {
		finalInvokers = c.invokers
	}

	for _, r := range c.copyRouters() {
		finalInvokers = r.Route(finalInvokers, url, invocation)
	}
	return finalInvokers
}

@Alanxtl
Copy link
Contributor

Alanxtl commented Jan 29, 2026

你举的这个例子,如果返回一个“empty”不也是一样的结果吗

@HGD-coder
Copy link
Author

image ![Uploading image.png…]() 我想着用空对象,就是原来的代码应该都有对于url内存在各种内容的匹配。这个示例代码只是说明url=nil的判断可能需要额外的代原始逻辑变更,或者过滤需要在每一层实现;但是空对象在这个情况下和正常对象一样到了下一层,它在下一层不会因为再调用url的方法报panic。但是具体的底层我并不清楚,因为我不知道引入空对象内部的空值会不会和之前的defaultvalue重合判断,所以我希望有经验或者了解底层的人帮忙看看。

@HGD-coder
Copy link
Author

image image 我想着用空对象,就是原来的代码应该都有对于url内存在各种内容的匹配。这个示例代码只是说明url=nil的判断可能需要额外的代原始逻辑变更,或者过滤需要在每一层实现;但是空对象在这个情况下和正常对象一样到了下一层,它在下一层不会因为再调用url的方法报panic。但是具体的底层我并不清楚,因为我不知道引入空对象内部的空值会不会和之前的defaultvalue重合判断,所以我希望有经验或者了解底层的人帮忙看看。

Uploading image.png…

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a panic caused by Invoker.GetURL() returning nil during concurrent invoker destruction by introducing a non-nil placeholder URL.

Changes:

  • Introduces a package-level emptyURL placeholder.
  • Updates BaseInvoker.GetURL() to return emptyURL when the internal URL has been cleared.
  • Updates BaseInvoker tests to assert GetURL() is safe after Destroy().

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
protocol/base/base_invoker.go Adds emptyURL and changes GetURL() to return a non-nil placeholder when destroyed.
protocol/base/base_invoker_test.go Updates assertions to reflect the new non-nil behavior of GetURL() after destroy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@No-SilverBullet No-SilverBullet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AlexStocks
Copy link
Contributor

这个 PR 确实比 #3191 更优雅一些,但是需要更广泛的考虑下其他地方也跟着修改下

@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 9, 2026

Copy link
Contributor

@Alanxtl Alanxtl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@AlexStocks
Copy link
Contributor

lgtm

你 LGTM,那就是先把这个 PR 合并,其他受影响的地方后面再提 PR 吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3.3.2 version 3.3.2 ☢️ Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] 多次重新构建ReferenceConfig 会发生panic

6 participants