fix(scheduler): session_hash sticky 引入健康度逃逸,慢账号不再独占用户会话#2872
Open
wucm667 wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
背景
OpenAI 高级调度器当前的调度层次是
previous_response_id->session_hash sticky-> load balance。其中
previous_response_id需要保持硬粘性以兼容上游 response 历史,这条路径本 PR 不做改动。问题出在
session_hash sticky:命中后会直接返回 sticky 账号,既不检查运行时健康度,也会在账号并发已满时继续返回该账号的WaitPlan。当某个账号的 TTFT 明显恶化、错误率上升,或已经满并发时,用户会被长期绑在这个慢/降级账号上,无法自然逃逸到更健康的账号。改动
selectBySessionHash(...)命中 sticky 账号后、真正返回前,接入已有运行时统计snapshot(accountID),增加 session sticky 健康度门控。gateway.openai_scheduler:sticky_escape_enabled:默认true,运维可设为false一键回到旧行为。sticky_escape_ttft_ms:默认15000。sticky_escape_error_rate:默认0.5。hasTTFT && ttft > sticky_escape_ttft_mserrorRate > sticky_escape_error_ratesticky_escape_triggered日志,带上account_id、reason、error_rate、ttft,便于运维定位。session_hash -> account_id持久绑定,也不在本次 load balance 成功后改写该绑定,只做“本次跳过”。previous_response_id路径保持不变,load balance 打分逻辑保持不变。默认阈值选择理由:
15000ms的 TTFT 阈值足够宽松,不会因轻微波动就频繁逃逸,但能及时释放被几十秒级长尾卡住的会话。0.5的错误率阈值意味着 EWMA 需达到明显劣化才触发,不会对短暂抖动过敏。性能影响:
snapshot()仅执行 atomic load +math.Float64frombits,属于纳秒级读取,不引入额外性能顾虑。兼容性:
previous_response_id的硬粘性路径完全不变。sticky_escape_enabled=false时可完整回退到旧行为。测试
已执行:
cd backend && go test -tags=unit ./internal/service/... -run OpenAIAccountSchedulercd backend && go test -tags=unit ./...cd backend && golangci-lint run ./...新增覆盖:
sticky_escape_enabled=false时保持旧行为未发现与“sticky 健康度逃逸 / session_hash sticky escape”同方向的并发 PR。
Fixes #2859。