Skip to content

fix(proxy): 区分上游空内容流与首字超时#186

Merged
g1331 merged 2 commits into
masterfrom
fix/upstream-no-content-stream-error
May 24, 2026
Merged

fix(proxy): 区分上游空内容流与首字超时#186
g1331 merged 2 commits into
masterfrom
fix/upstream-no-content-stream-error

Conversation

@g1331
Copy link
Copy Markdown
Owner

@g1331 g1331 commented May 24, 2026

Summary

修复 waitForFirstStreamContent 错误消息把"配置阈值"误当成"实际等待时长"的语义 bug。原先 streamDoneBeforeContentPromise 分支与真超时分支共用 FirstByteTimeoutError(timeoutMs),导致上游 SSE 流在 2–5 秒内正常关闭却没产生 content-bearing chunk 时,UI 与日志固定显示 Upstream first byte timed out after 30s —— 实际只等了 3 秒。

  • 新增 UpstreamNoContentStreamError,构造函数同时接收 elapsedMs(实际耗时)与 firstByteTimeoutMs(配置阈值),文案如实呈现 Upstream closed SSE stream after 3.58s without producing any content-bearing chunk (first-byte timeout config: 30s)
  • FailoverErrorType 增加 upstream_no_content_streamgetErrorTypeisFailoverableError 单独识别该类,故障转移行为与之前一致。
  • 顺手补齐 retryErrorType 字典里此前缺失的 first_byte_timeout / stream_idle_timeout / stream_error 三项中英文,避免运行时回落到原始枚举名。

Investigation evidence

生产环境 rc-cx-pro 上游六条 503 的实测耗时 vs 错误文案:

reqId 上游响应 耗时 错误文案
b19e6722 200 / text/event-stream 2.23 s timed out after 30s
44eac652 200 / text/event-stream 3.63 s timed out after 30s
6bff51e5 200 / text/event-stream 4.55 s timed out after 30s
acfcc4c9 200 / text/event-stream 3.58 s timed out after 30s
0b0f13f5 200 / text/event-stream 2.58 s timed out after 30s
2ef5ecd3 200 / text/event-stream 2.68 s timed out after 30s

firstByteTimeout 配置为默认 30 秒,但每条都在 2–5 秒结束。setTimeout 计时器没有触发,命中的是 streamDoneBeforeContentPromise 分支——上游正常完成了 SSE 流但只发送 metadata 事件。

Behavior changes

  • 新场景的 error_typefirst_byte_timeout 变更为 upstream_no_content_stream之前依赖匹配 first_byte_timeout 的失败规则需要追加 upstream_no_content_stream 才能继续覆盖该场景。
  • 历史 request_logs 行保留原 first_byte_timeout 值,不做迁移。
  • 不引入 recordFailure 调用变更(电路熔断行为保持原状);如需让该错误也计入熔断失败计数,应作为独立 PR 评估。

Test plan

  • pnpm exec tsc --noEmit
  • pnpm lint
  • pnpm format:check
  • pnpm test:run(147 文件 / 2487 用例全绿,其中 tests/unit/services/proxy-client.test.ts 的 path B 用例已改为期望新错误类)

之前 waitForFirstStreamContent 的 streamDoneBeforeContentPromise 分支也抛
FirstByteTimeoutError(timeoutMs),导致上游 SSE 流在 2~5 秒内正常关闭却没
产生 content-bearing chunk 时,日志固定显示 "Upstream first byte timed out
after 30s"——把"配置阈值"渲染成了"实际等待时长"。生产日志多次复现该误导。

新增 UpstreamNoContentStreamError 表达"流已结束但无内容"的真实语义,构造
函数同时接收实际耗时与配置阈值。FailoverErrorType 增加
upstream_no_content_stream,getErrorType / isFailoverableError 单独识别该
类,故障转移行为保持一致。

顺手补齐 retryErrorType 字典里此前缺失的 first_byte_timeout、
stream_idle_timeout、stream_error 三项中英文,避免运行时回落到原始枚举名。
@codecov
Copy link
Copy Markdown

codecov Bot commented May 24, 2026

Codecov Report

❌ Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 74.15%. Comparing base (d565097) to head (a4b3d72).
⚠️ Report is 1 commits behind head on master.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #186      +/-   ##
==========================================
+ Coverage   74.14%   74.15%   +0.01%     
==========================================
  Files         145      145              
  Lines       11043    11048       +5     
  Branches     3832     3832              
==========================================
+ Hits         8188     8193       +5     
  Misses       1657     1657              
  Partials     1198     1198              
Flag Coverage Δ
verify 74.15% <83.33%> (+0.01%) ⬆️
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

routing-decision-timeline.tsx 把 upstream_no_content_stream 并入 timeout 同款
Clock + 警告色,避免与 first_byte_timeout / stream_idle_timeout 同族不同色。

failover-circuit.md 异常类列表追加 UpstreamNoContentStreamError,并补一段
说明它与 FirstByteTimeoutError 的语义差异;顺手修正 isFailoverableError
的行号区间,跟随上一次提交带来的函数体扩展。
@g1331 g1331 merged commit c203793 into master May 24, 2026
14 checks passed
@g1331 g1331 deleted the fix/upstream-no-content-stream-error branch May 24, 2026 12:33
g1331 added a commit that referenced this pull request May 25, 2026
…卷挂载现状 (#167, #188)

Self-review 抽读时发现四处与仓库现状不符的事实陈述,本次一并修正:

1. database.md / upgrade-rollback.md 关于「容器不会自动跑迁移」「需要部署人手工触发 pnpm db:migrate」的描述与 scripts/docker-entrypoint.sh 现状矛盾。
   该 entrypoint 在应用启动前会自动跑一遍内嵌的 migration runner(不依赖 drizzle-kit,按文件名顺序 apply drizzle/*.sql、用 __drizzle_migrations 表去重)。改为说明自动 apply 行为,并把破坏性迁移段重写为「entrypoint 仍 forward apply,回滚必须靠 pg_dump」。

2. database.md / upgrade-rollback.md 建议的 `docker compose exec autorouter node node_modules/drizzle-kit/bin.cjs migrate` 在生产镜像内无法执行。
   Dockerfile standalone runner stage 只 copy postgres 这一个 node_modules 子包,drizzle-kit 是 devDependency 不进镜像。改为推荐「重启 autorouter 让 entrypoint 重跑」或「docker run --rm --entrypoint /app/docker-entrypoint.sh ghcr.io/g1331/autorouter:vN.N.N true」这种把 entrypoint 与 server.js 解耦的临时容器写法。

3. persistence-backup.md 关于「RECORDER_FIXTURES_DIR 通常会挂入 autorouter-data named volume(如默认编排)」的描述错误。
   docker-compose.yml 中 RECORDER_FIXTURES_DIR 默认值是 `tests/fixtures`,相对容器内 /app/,实际写到 /app/tests/fixtures,不在任何 named volume 上——容器重建即丢。补 ::: danger ::: 容器警告,并显式给出「显式把 RECORDER_FIXTURES_DIR 指到 /app/data/...」的修复路径。

4. contributing.md 关于「推荐用 squash merge」与仓库实际 merge commit 历史(PR #184/#185/#186 都是 Merge pull request 形态)冲突。改为陈述「近期实际历史以 merge commit 为主,cliff.toml 显式 skip 这类 commit」,把策略选择留给 reviewer。

来源对照段同步补 scripts/docker-entrypoint.sh 与 Dockerfile 两项依据。
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant