Skip to content

fix(ws): 完善 Responses WebSocket 会话生命周期与测试覆盖#1154

Merged
ding113 merged 22 commits into
ding113:devfrom
tesgth032:fix/responses-ws-close-paths-e2e-20260504
May 4, 2026
Merged

fix(ws): 完善 Responses WebSocket 会话生命周期与测试覆盖#1154
ding113 merged 22 commits into
ding113:devfrom
tesgth032:fix/responses-ws-close-paths-e2e-20260504

Conversation

@tesgth032

@tesgth032 tesgth032 commented May 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Completes the WebSocket close path fixes for /v1/responses inbound bridge by standardizing all close operations through the requestClose() helper. Adds regression tests covering binary frame handling, pipelined frame dropping, and pending queue overflow scenarios. Includes an opt-in E2E transport probe for Codex CLI to document and validate transport behavior.

Related Issues

Problem

PR #1153 introduced requestClose() to ensure proper WebSocket closing handshake, but several error paths still used inline ws.close() try/catch blocks:

  1. drain().catch() error handler called ws.close() directly, bypassing the synchronous closed=true and pending queue cleanup
  2. Binary frame rejection paths had inconsistent close handling
  3. Queue overflow paths could race with pending frame processing

Additionally, there was no way to verify whether Codex CLI actually uses WebSocket or HTTP transport when feature flags are enabled.

Solution

Core Changes

  • Standardized all close paths in server.js to use requestClose() helper:
    • Binary frame rejection (code 1003)
    • drain().catch() error handler (code 1011)
    • Queue overflow (code 1008)
    • Request processing errors (code 1011)

Testing Improvements

  • Added regression test: drops pipelined frames after binary protocol close
  • Added regression test: clears pending queue synchronously on overflow close
  • Fixed test environment cleanup (restore PORT, HOSTNAME, NODE_ENV)

E2E Transport Probe

  • New opt-in E2E test: tests/e2e/responses-ws-codex-cli-transport.test.ts
  • Records whether Codex CLI uses HTTP or WebSocket transport
  • Configurable via CCH_CODEX_E2E_EXPECT_TRANSPORT env var (any, http, websocket)
  • Documents current behavior: Codex CLI 0.128.0 uses HTTP even with WebSocket feature flags enabled

Changes

File Change
server.js Standardize 5 close paths to use requestClose() helper
tests/unit/server-ws-close-handshake.test.ts +2 regression tests, improved test cleanup
tests/e2e/responses-ws-codex-cli-transport.test.ts +410 lines, new opt-in E2E transport probe

Verification

# Lint and typecheck
npx biome check server.js tests/unit/server-ws-close-handshake.test.ts tests/e2e/responses-ws-codex-cli-transport.test.ts
npx vitest run src/app/v1/_lib/responses-ws/__tests__/*.ts tests/unit/server-ws-close-handshake.test.ts
npm run typecheck
npm run build

# E2E probe (opt-in, skipped by default)
npx vitest run --config tests/configs/e2e.config.ts tests/e2e/responses-ws-codex-cli-transport.test.ts

# Force HTTP expectation
CCH_CODEX_E2E=1 CCH_CODEX_E2E_EXPECT_TRANSPORT=http npx vitest run --config tests/configs/e2e.config.ts tests/e2e/responses-ws-codex-cli-transport.test.ts

Test Results

  • Unit tests: 5843 passed (13 skipped)
  • 2 pre-existing k8s shell helper failures unrelated to this change
  • New regression tests cover binary frame rejection and queue overflow scenarios

E2E Observations

Local testing with Codex CLI 0.128.0 shows that even with responses_websockets and responses_websockets_v2 feature flags enabled, Codex still sends HTTP POST /v1/responses instead of WebSocket upgrade. The E2E probe preserves this evidence and allows future validation when Codex builds support true WebSocket transport.

Checklist

  • Code follows project conventions (Biome formatting)
  • Self-review completed
  • Unit tests added for regression scenarios
  • E2E test added (opt-in, does not affect CI)
  • TypeScript type checking passes
  • Build succeeds

Description enhanced by Claude AI

Greptile Summary

This PR completes the WebSocket close-path standardization from #1153 by routing all ws.close() call sites through requestClose(), and introduces the larger feature of persistent upstream WebSocket sessions keyed by a per-client UUID (x-cch-responses-ws-session). Terminal events (success and error) no longer close the client connection; the upstream WS is pooled and reused across response.create turns for the same client. The forwarder.ts eligibility check is correctly moved to session.headers (the inbound headers that carry the internal secret) rather than processedHeaders (the filtered outgoing headers).

Confidence Score: 4/5

Safe to merge with minor follow-ups; no P0/P1 bugs found beyond already-tracked previous review threads.

All critical close paths are now routed through requestClose(); the persistent session pool logic handles concurrent-busy, fingerprint-mismatch, and eviction cases correctly. The forwarder.ts eligibility bug fix (session.headers vs processedHeaders) is a meaningful correctness improvement. Remaining findings are all P2.

server.js (non-spec status field, missing ws_turn_completed log for JSON path); src/app/v1/_lib/responses-ws/upstream-adapter.ts (permanent close/error listeners on pooled sockets)

Important Files Changed

Filename Overview
server.js Major refactor: all 5 close paths standardized to requestClose(); connection is now persistent (no close after terminal events); new per-connection UUID session header; res.on("close") guard added; minor non-spec status field added to error events.
src/app/v1/_lib/responses-ws/upstream-adapter.ts Large new feature: per-client persistent upstream WS session pool (max 512), connection fingerprinting, idle timers, session lifecycle via globalThis hook. Concurrent-busy case handled correctly with canRetainFreshSession=false. finishRequest idempotency looks correct. Minor arm+cancel timer inefficiency on error-terminal completions.
tests/unit/server-ws-close-handshake.test.ts Tests updated to reflect persistent WS semantics; two new regression tests for binary-close drain-dropping and overflow abort. Previously-flagged fixed-delay timing issues remain in the new tests, deferred from prior review.
tests/e2e/responses-ws-codex-cli-transport.test.ts New opt-in E2E transport probe for Codex CLI; correctly skipped by default; previously-flagged sawCleanWsClose always-false bug remains unresolved.
src/app/v1/_lib/proxy/forwarder.ts Fixes eligibility check to use session.headers instead of processedHeaders (correct since internal secret is in inbound headers, not outgoing proxy headers); passes sessionId and endpointId to tryResponsesWebsocketUpstream for persistent session reuse.
src/app/v1/_lib/responses-ws/eligibility.ts New getResponsesWsSessionId() helper with length and charset validation; UUID values correctly pass the /^[\w.-]+$/ check.
src/app/v1/_lib/responses-ws/internal-secret.ts Adds RESPONSES_WS_SESSION_HEADER constant and includes it in RESERVED_INTERNAL_HEADERS, ensuring the session header is stripped before forwarding to upstream.

Sequence Diagram

sequenceDiagram
    participant C as Codex CLI (client WS)
    participant S as server.js
    participant N as Next.js app
    participant U as Upstream WS (OpenAI)

    C->>S: WS upgrade /v1/responses
    S->>S: generate responsesWsSessionId (UUID)

    C->>S: response.create (turn 1)
    S->>N: POST /v1/responses + x-cch-responses-ws-session
    N->>U: WS upgrade (new persistent session)
    U-->>N: SSE events (response.created … response.completed)
    N-->>S: SSE stream
    S-->>C: forward events (WS stays OPEN)
    N->>N: persistentSessions.set(sessionId, entry)

    C->>S: response.create (turn 2)
    S->>N: POST /v1/responses + x-cch-responses-ws-session
    N->>U: send frame (REUSED upstream WS)
    U-->>N: SSE events
    N-->>S: SSE stream
    S-->>C: forward events (WS stays OPEN)

    C->>S: WS close(1000)
    S->>S: finalize() → abortCurrentInternalReq + cleanupUpstreamWsSession
    S->>U: WS close(1000) via globalThis.__cchCleanupResponsesWsSession
Loading

Comments Outside Diff (1)

  1. src/app/v1/_lib/responses-ws/upstream-adapter.ts, line 724-731 (link)

    P1 Busy-session branch silently evicts in-flight map entry on next request

    When existing.active && !isWsClosingOrClosed(existing.ws) is true, the code only logs a warning and falls through so a fresh WS is opened. Later, registerPersistentSession calls persistentSessions.set(sessionId, entry), which overwrites the map while the old persistentEntry is still active.

    When the old request's finishRequest() runs, persistentSessions.get(sessionId) === persistentEntry is now false (retainedForReuse = false), so closeDetachedEntry = !options?.closeCode becomes true and the old WS is closed with code 1000 mid-stream. From the upstream's perspective the connection was closed cleanly during a live response, which can cause a truncated stream rather than a clean terminal event.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/app/v1/_lib/responses-ws/upstream-adapter.ts
    Line: 724-731
    
    Comment:
    **Busy-session branch silently evicts in-flight map entry on next request**
    
    When `existing.active && !isWsClosingOrClosed(existing.ws)` is true, the code only logs a warning and falls through so a fresh WS is opened. Later, `registerPersistentSession` calls `persistentSessions.set(sessionId, entry)`, which overwrites the map while the old `persistentEntry` is still active.
    
    When the old request's `finishRequest()` runs, `persistentSessions.get(sessionId) === persistentEntry` is now false (`retainedForReuse = false`), so `closeDetachedEntry = !options?.closeCode` becomes `true` and the old WS is closed with code 1000 mid-stream. From the upstream's perspective the connection was closed cleanly during a live response, which can cause a truncated stream rather than a clean terminal event.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 4 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 4
server.js:466-473
**Non-SSE success path never closes the persistent connection**

After a successful non-SSE JSON response (`!isHttpError`), `settleResponse()` resolves the drain promise and allows the next pending frame to be dispatched — but the WS never closes. This is intentional for persistent mode, however the `ws_turn_completed` telemetry log that exists for the SSE path is absent here, making it harder to correlate non-SSE turns in observability tooling.

### Issue 2 of 4
server.js:452-460
**`status` field in error event is non-standard**

`status: res.statusCode` is appended to the error event frame sent to the client WebSocket. This field is absent from the OpenAI Responses API WebSocket event schema. Strictly-validating clients (including Codex CLI, which this PR specifically targets) may reject or silently drop the event if they deserialize into a typed struct. If this field should stay, a comment explaining the intentional schema extension would prevent it from being removed as a mistake in future cleanup.

### Issue 3 of 4
src/app/v1/_lib/responses-ws/upstream-adapter.ts:641-660
**`armPersistentIdleTimer` called then immediately cancelled**

In `finishRequest`, when `retainedForReuse === true` (session still in map), `armPersistentIdleTimer(persistentEntry)` is called unconditionally. If the caller also passes `forgetSession: true`, the very next statement calls `forgetPersistentSession(sessionId, ws)` which cancels and nullifies the timer that was just armed. The timer fires at 65 minutes so there is no real consequence, but the dead arm+cancel cycle runs on every error-terminal completion. Consider checking `options?.forgetSession` before calling `armPersistentIdleTimer`.

### Issue 4 of 4
src/app/v1/_lib/responses-ws/upstream-adapter.ts:294-310
**Permanent session listeners accumulate on pooled sockets**

In `registerPersistentSession`, permanent `close` and `error` listeners are attached via `ws.on(...)` and are never removed. At 512 sessions × 2 listeners = 1024 permanent listeners accumulate on their respective sockets. Node.js's default `EventEmitter` max-listeners threshold is 10; this could surface `MaxListenersExceededWarning` for sockets that are reused and re-registered (e.g., after reconnect in tests calling `clearResponsesWsSessionsForTests`). Consider using `ws.once("close", ...)` or removing the listeners inside `forgetPersistentSession`.

Reviews (20): Last reviewed commit: "ci: 恢复 Codex 自动化默认模型" | Re-trigger Greptile

@coderabbitai

coderabbitai Bot commented May 3, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

引入并传播每连接的 Responses WS 会话 id(x-cch-responses-ws-session),导出 isNextDevMode(nodeEnv),重构 server.js 的 WebSocket 关闭/清理与内部转发语义;在 upstream-adapter 中新增持久化/复用上游 WS 会话及其清理接口;调整转发黑名单与会话提取;新增/迁移大量测试与测试工具,并将示例/文档模型统一为 gpt-5.4。

Changes

Responses WS 边缘与内部转发(server.js)

Layer / File(s) Summary
导出 / 环境
server.js
新增并导出 isNextDevMode(nodeEnv),替换原先直接基于 process.env.NODE_ENV 的 dev 判定。
连接级辅助
server.js
为每个连接生成 responsesWsSessionId(UUID);新增 abortCurrentInternalReq()dropPendingFrames()cleanupUpstreamWsSession() 等连接生命周期辅助。
统一关闭/错误路径
server.js
重构 finalize()requestClose(code, reason):在所有关闭/错误路径上同步标记 closed、终止/销毁当前内部请求、丢弃待处理帧、运行上游会话清理。
帧拒绝与队列限制
server.js
将散落的 ws.close(...) 调用替换为 requestClose(...):用于拒绝非文本/二进制帧、队列溢出、队列处理失败与内部排水错误。
内部转发调整
server.js
forwardToInternalHttp 新增 responsesWsSessionId 参数;构建上游请求时在存在时加入 x-cch-responses-ws-session 头。
响应落定语义
server.js
增加 responseSettled/settleResponse() 保护避免双重解析;非 SSE(非流式 JSON)路径发送单个终态 error/response.completed 事件后 resolve 而不发起关闭握手;SSE 路径引入 failIfUnsettled(...) 并在见终态事件后保留连接(不自动 close)。
导出变更
server.js
isNextDevMode 导出加入 module.exports

上游 WebSocket 适配器:持久化会话与复用(upstream-adapter.ts)

Layer / File(s) Summary
持久化会话状态
src/app/v1/_lib/responses-ws/upstream-adapter.ts
新增 persistentSessions map、空闲计时器、最大保留条目与测试钩子(set/clear/get)。
连接指纹
src/app/v1/_lib/responses-ws/upstream-adapter.ts
新增 buildConnectionFingerprint(...)(基于 provider/endpoint/upstreamUrl/规范化 headers 的 SHA-256)用于兼容性判定。
生命周期原语
src/app/v1/_lib/responses-ws/upstream-adapter.ts
引入 isWsOpen/isWsClosingOrClosedcloseWs/terminateWscloseAndForgetforgetPersistentSession 等原语,并导出 cleanupResponsesWsSession(sessionId) 及若干测试接口(clearResponsesWsSessionsForTests 等)。
复用决策
src/app/v1/_lib/responses-ws/upstream-adapter.ts
tryResponsesWebsocketUpstream 接受可选 sessionId/endpointId,在指纹匹配且空闲/兼容时复用上游 WS,返回新增的 UpstreamWsOutcome.reused 标志。
终结与保留策略
src/app/v1/_lib/responses-ws/upstream-adapter.ts
终态事件解析后根据策略决定保留或关闭并忘记持久会话;finishRequest() 负责刷新 last-used、装定空闲定时器或执行关闭忘记。

转发层与会话提取(forwarder / eligibility / internal-secret)

Layer / File(s) Summary
内部头常量
src/app/v1/_lib/responses-ws/internal-secret.ts
新增导出常量 RESPONSES_WS_SESSION_HEADER = "x-cch-responses-ws-session" 并将其加入 RESERVED_INTERNAL_HEADERS(用于在 WS 边缘剥离)。
会话提取工具
src/app/v1/_lib/responses-ws/eligibility.ts
新增 getResponsesWsSessionId(headers):支持 Headers 或 plain record,trim 与验证(非空、≤128、匹配 /^[\w.-]+$/),返回 string
转发黑名单与调用点
src/app/v1/_lib/proxy/forwarder.ts
RESERVED_INTERNAL_HEADERS 并入 OUTBOUND_TRANSPORT_HEADER_BLACKLIST;调用上游 WS 时传入 sessionId: getResponsesWsSessionId(session.headers),并基于原始 session.headers 评估 WS eligibility。
单元断言
tests/unit/proxy/proxy-forwarder.test.ts, src/app/v1/_lib/responses-ws/__tests__/eligibility.test.ts
新增测试:断言向上游请求剥离内部隧道/会话/secret 头;为 getResponsesWsSessionId 添加裁剪、过长、非法字符等边界测试。

E2E 与 CCH edge harness(tests/e2e/responses-ws-codex-cli-transport.test.ts)

Layer / File(s) Summary
本地 probe server
tests/e2e/responses-ws-codex-cli-transport.test.ts
新增可选的本地 model-provider probe:实现 GET /v1/modelsPOST /v1/responses(SSE)与 WS /v1/responses,记录 HTTP/WS 交互并可注入行为。
CCH edge harness
tests/e2e/...
新增隔离 server.js 启动/控制 harness:捕获内部 /v1/responses 请求字节、头/会话元数据,支持故障注入(断流、延迟、强制销毁)。
Codex CLI 探测
tests/e2e/...
新增可选的 Codex CLI 运行与断言:检测传输类型(websocket/SSE)、warmup/previous_response_id 行为,以及在故障模式下的进程退出与内部请求中止信号。

测试工具:Bash 脚本运行器(tests/helpers/bash.ts)与迁移

Layer / File(s) Summary
bash helper
tests/helpers/bash.ts
新增 runBashScript(scriptBody, options):解析/缓存适用的 bash/wsl 可执行路径,构建隔离 shell 环境,支持 requiredFunctions 断言与 setup 注入,执行并在失败时包装诊断信息。
测试迁移
tests/unit/k8s-cch-update-flow.test.ts, tests/unit/k8s-deploy-shell-helpers.test.ts
将原先用 execFileSync/bash -lc 的测试迁移为使用 runBashScript,统一 setup 与 requiredFunctions,并设置 Vitest 超时。

文档、本地化与测试夹具:模型名与 K8s 超时示例更新

Layer / File(s) Summary
K8s / 反向代理示例
docs/k8s-deployment.md, deploy/k8s/ingress/ingress.yaml
将 Nginx/Ingress 超时从 6003700(示例值/注解字符串),并更新示例以显式处理 Upgrade/Connection、提高超时与连接设置。
usage-doc / workflows / presets
src/app/[locale]/usage-doc/page.tsx, .github/workflows/*, src/lib/provider-testing/presets.ts
将示例模型与 workflow 默认由旧值统一更新为 gpt-5.4;同步测试预设与夹具默认模型。
本地化与测试夹具
messages/*, tests/**, src/lib/provider-testing/**
大量本地化字符串、测试夹具、快照与单元/E2E 测试输入将模型字段统一为 gpt-5.4(仅文本/测试数据变更)。

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

Suggested reviewers

  • ding113
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

@coderabbitai coderabbitai Bot requested a review from ding113 May 3, 2026 17:02
@github-actions github-actions Bot added bug Something isn't working area:core javascript Pull requests that update javascript code size/L Large PR (< 1000 lines) labels May 3, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 45a5071873

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server.js
} catch {
// ignore
}
requestClose(1008, "too_many_requests");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Abort active upstream request when closing on overflow

Switching this path from ws.close(...) to requestClose(...) changes teardown semantics: requestClose sets closed = true immediately, so the later ws.on("close", finalize) becomes a no-op and never destroys currentInternalReq. In the overflow/binary-close scenario where one response.create is already in flight, the internal /v1/responses HTTP request now continues running after the client is closed, which can burn provider tokens/work the client can no longer receive. Please ensure the active internal request is explicitly aborted when initiating these protocol/error closes.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors WebSocket connection handling in server.js by centralizing the closing logic into a requestClose function. It also introduces a new end-to-end test for Codex CLI transport probing and expands unit tests for WebSocket close-handshake scenarios. Feedback highlights a potential resource leak in the new requestClose implementation where active internal requests are not terminated, and a possible TypeError in the unit tests if environment restoration occurs before initialization.

Comment thread server.js
} catch {
// ignore
}
requestClose(1003, "binary_not_supported");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

这里调用 requestClose 虽然统一了关闭路径,但目前的 requestClose 实现(第 158-179 行)存在一个资源泄露风险:它仅设置了 closed = true 并清空了 pending 队列,但没有像 finalize 函数那样主动销毁 currentInternalReq。由于 finalize 会在 closedtrue 时直接返回,这会导致正在进行的内部 HTTP 请求无法被及时中止,从而继续占用上游连接和资源。建议在 requestClose 中也加入对 currentInternalReq 的销毁逻辑。

Comment on lines +165 to +167
restoreEnvVar("PORT", originalEnv.PORT);
restoreEnvVar("HOSTNAME", originalEnv.HOSTNAME);
restoreEnvVar("NODE_ENV", originalEnv.NODE_ENV);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

afterAll 中恢复环境变量时,如果 beforeAll 中的 pickFreePort 失败,originalEnv 可能尚未初始化。这会导致 afterAll 抛出 TypeError,掩盖真实的测试失败原因。建议在恢复前增加空值检查。

    if (originalEnv) {
      restoreEnvVar("PORT", originalEnv.PORT);
      restoreEnvVar("HOSTNAME", originalEnv.HOSTNAME);
      restoreEnvVar("NODE_ENV", originalEnv.NODE_ENV);
    }

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR improves WebSocket close handling by centralizing close operations through requestClose() and adds comprehensive test coverage including an E2E Codex CLI transport probe. The changes are well-structured and follow existing patterns in the codebase.

PR Size: L

  • Lines changed: 536 (510 additions + 26 deletions)
  • Files changed: 3

Note: This PR is borderline M/L at 536 lines. Consider splitting future PRs by separating the core fix (server.js changes) from test additions for easier review.

Files Changed

  1. server.js - Refactors WebSocket close paths to use centralized requestClose()
  2. tests/e2e/responses-ws-codex-cli-transport.test.ts - New opt-in E2E test (410 lines)
  3. tests/unit/server-ws-close-handshake.test.ts - Additional unit tests for close scenarios

Review Coverage

  • Logic and correctness - Clean. The refactoring to requestClose() removes code duplication and standardizes error handling paths.
  • Security (OWASP Top 10) - Clean. No security concerns identified in WebSocket handling changes.
  • Error handling - Acceptable. Test files use empty catch patterns consistent with existing codebase conventions.
  • Type safety - Clean. TypeScript types are well-defined in the new E2E test file.
  • Documentation accuracy - Clean. Comments accurately describe the opt-in E2E test behavior.
  • Test coverage - Good. New tests cover binary frame rejection, queue overflow, and transport detection.
  • Code clarity - Good. The requestClose() abstraction improves readability over inline try/catch blocks.

Observations (Non-blocking)

  1. E2E Test Design: The responses-ws-codex-cli-transport.test.ts E2E test is well-designed with:

    • Environment-gated execution (CCH_CODEX_E2E=1)
    • Flexible transport assertion (any/http/websocket)
    • Proper resource cleanup with probe.close()
  2. Test Coverage: New unit tests validate:

    • Binary frame protocol errors (1003 close code)
    • Queue overflow handling (1008 close code)
    • Synchronous queue clearing on overflow
  3. Consistency: The refactoring from inline try/catch { ws.close() } to requestClose() improves maintainability by centralizing the close logic.

No significant issues identified. PR is ready for merge pending any additional reviewer feedback.


Automated review by Claude AI

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@server.js`:
- Around line 291-292: 请求关闭路径在调用 requestClose(1008, ...) 后因将 closed 设为 true 导致
ws.on("close") 回调里的 finalize() 提前返回,从而跳过 currentInternalReq.destroy(),需在
requestClose 内同步中止任何 in-flight 请求。修改 requestClose 函数以在设置 closed = true
之前或紧接其后检查并立即调用 currentInternalReq.destroy()(或调用负责中止的 helper),并确保与 finalize()
中的逻辑不重复使得 double-destroy 安全(比如先判断 currentInternalReq
存在且未被销毁后再销毁);参考符号:requestClose, closed, ws.on("close"), finalize,
currentInternalReq.destroy()。

In `@tests/unit/server-ws-close-handshake.test.ts`:
- Around line 31-35: The variable originalEnv is declared but not initialized,
so if beforeAll throws the afterAll cleanup (which reads
originalEnv.PORT/HOSTNAME/NODE_ENV) can throw a secondary error; initialize
originalEnv at declaration with a safe snapshot (e.g., { PORT: process.env.PORT,
HOSTNAME: process.env.HOSTNAME, NODE_ENV: process.env.NODE_ENV } or explicit
undefined strings) so afterAll can always restore safely, and ensure the
afterAll restoration logic (which reads originalEnv.PORT/ HOSTNAME/ NODE_ENV)
uses those initialized values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e7ccb61a-b1e4-487a-b3fc-f228a6f6ef2d

📥 Commits

Reviewing files that changed from the base of the PR and between 9948c70 and 45a5071.

📒 Files selected for processing (3)
  • server.js
  • tests/e2e/responses-ws-codex-cli-transport.test.ts
  • tests/unit/server-ws-close-handshake.test.ts

Comment thread server.js
Comment thread tests/unit/server-ws-close-handshake.test.ts
Comment on lines +390 to +396
observedTransport: transport,
events: probe.events,
stderrTail: result.stderr.slice(-2000),
},
null,
2
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 sawCleanWsClose reason check is always false

The ws.on('close', (code, reason)) event on the probe's server-side WebSocket fires with the code/reason from the client's echoed close frame — not from the server's ws.close(1000, "response_completed") call. Codex CLI (tungstenite-rs or similar) will echo back code 1000 with an empty reason string, so event.reason === "response_completed" will never be true. This means sawCleanWsClose is always false, causing expect(sawCleanWsClose).toBe(true) to fail whenever transport actually is "websocket" — precisely the case this assertion is meant to guard.

Suggested change
observedTransport: transport,
events: probe.events,
stderrTail: result.stderr.slice(-2000),
},
null,
2
)
const sawCleanWsClose = probe.events.some(
(event) => event.type === "ws_close" && event.code === 1000
);
Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/e2e/responses-ws-codex-cli-transport.test.ts
Line: 390-396

Comment:
**`sawCleanWsClose` reason check is always false**

The `ws.on('close', (code, reason))` event on the probe's server-side WebSocket fires with the code/reason from the **client's** echoed close frame — not from the server's `ws.close(1000, "response_completed")` call. Codex CLI (tungstenite-rs or similar) will echo back code 1000 with an empty reason string, so `event.reason === "response_completed"` will never be true. This means `sawCleanWsClose` is always `false`, causing `expect(sawCleanWsClose).toBe(true)` to fail whenever transport actually is `"websocket"` — precisely the case this assertion is meant to guard.

```suggestion
      const sawCleanWsClose = probe.events.some(
        (event) => event.type === "ws_close" && event.code === 1000
      );
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +357 to +358
expect(upstreamCalls).toBe(0);
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Fixed-delay setTimeout makes the assertion timing-sensitive

Waiting a hard-coded 50 ms after the close event to assert upstreamCalls === 0 can fail on an overloaded or slow CI runner where the pipelined text frame is still in-flight through Node's event loop at assertion time. Consider using a deterministic signal — e.g., a server-side close listener — instead of a fixed delay. The same pattern appears in the overflow test (150 ms window).

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/unit/server-ws-close-handshake.test.ts
Line: 357-358

Comment:
**Fixed-delay `setTimeout` makes the assertion timing-sensitive**

Waiting a hard-coded 50 ms after the close event to assert `upstreamCalls === 0` can fail on an overloaded or slow CI runner where the pipelined text frame is still in-flight through Node's event loop at assertion time. Consider using a deterministic signal — e.g., a server-side close listener — instead of a fixed delay. The same pattern appears in the overflow test (150 ms window).

How can I resolve this? If you propose a fix, please make it concise.

@tesgth032

Copy link
Copy Markdown
Contributor Author

补充处理了本轮 review 反馈:

  • requestClose() 现在会复用关闭清理逻辑,同步中止 currentInternalReq 并清空 pending frames,避免 overflow / binary close 后内部 /v1/responses 请求继续跑。
  • originalEnv 在声明处初始化,避免 beforeAll 早失败时 afterAll 二次抛错掩盖根因。
  • Codex E2E 的 WebSocket clean close 断言改为只检查 close code 1000,不再误用服务端 close 事件收到的 echoed reason。
  • overflow 回归测试改为等待 server-side WS close 和 internal response close 信号,确认 in-flight request 被 abort;同时移除了固定 50ms / 150ms sleep。

本地验证:

  • npx biome check server.js tests/unit/server-ws-close-handshake.test.ts tests/e2e/responses-ws-codex-cli-transport.test.ts
  • npx vitest run src/app/v1/_lib/responses-ws/__tests__/eligibility.test.ts src/app/v1/_lib/responses-ws/__tests__/internal-secret.test.ts src/app/v1/_lib/responses-ws/__tests__/server-helpers.test.ts src/app/v1/_lib/responses-ws/__tests__/unsupported-cache.test.ts src/app/v1/_lib/responses-ws/__tests__/upstream-adapter.test.ts tests/unit/server-ws-close-handshake.test.ts:6 files / 52 tests passed
  • npx vitest run --config tests/configs/e2e.config.ts tests/e2e/responses-ws-codex-cli-transport.test.ts:默认跳过通过
  • CCH_CODEX_E2E=1 CCH_CODEX_E2E_EXPECT_TRANSPORT=http npx vitest run --config tests/configs/e2e.config.ts tests/e2e/responses-ws-codex-cli-transport.test.ts:通过,当前 Codex CLI 仍观测为 HTTP POST /v1/responses
  • npm run typecheck:通过
  • npm run lint / npm run lint:fix:通过;仅剩既有 Biome schema / hook dependency / unused suppression warnings,未改动本 PR 文件
  • npm run build:通过;仍有既有 Edge Runtime / DB fallback warnings
  • npm run test:5843 passed / 13 skipped / 10 failed;失败仍只在既有 tests/unit/k8s-cch-update-flow.test.tstests/unit/k8s-deploy-shell-helpers.test.ts,与本 PR 的 WS 改动无关

@tesgth032

Copy link
Copy Markdown
Contributor Author

补充一个模型 ID 清理提交:5b7fe3ac test: 统一 GPT-5 测试模型到 gpt-5.4

  • 将测试、provider testing 探针、API test 默认值、WS/E2E 探针里的低版本 GPT-5 系列模型统一更新为 gpt-5.4
  • 按最新要求保留 gpt-4.1 / gpt-4o 系列不变;全仓复扫时已排除历史 CHANGELOG.md,当前代码/文案里没有低于 5.4 的 gpt-5* 引用。
  • 同步更新 OpenCode 使用文档和多语言占位符,避免当前示例继续指向旧 GPT-5 系列模型。

验证:

  • npx biome check $(git diff --name-only)(等价于本次变更文件集合)通过。
  • npm run typecheck 通过。
  • 受影响非 E2E 测试集合:21 passed / 1 skipped,252 passed / 2 skipped。
  • npx vitest run --config tests/configs/e2e.config.ts tests/e2e/responses-ws-codex-cli-transport.test.ts 通过(默认 opt-in skipped)。
  • WS 固定套件:6 files passed,52 tests passed。
  • npm run build 通过;期间出现既有 Edge Runtime / fallback settings warning,无新增失败。

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5b7fe3acb7

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +209 to +210
await new Promise<void>((resolve) => wss.close(() => resolve()));
await new Promise<void>((resolve) => server.close(() => resolve()));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Terminate probe WS clients before awaiting server close

probe.close() awaits wss.close() directly, but with noServer mode the ws docs state that existing connections are not closed automatically by server.close([callback]). If runCodex() times out and kills the child while a WebSocket is still open, this callback may never fire, so the finally block hangs and the test fails on the outer timeout with a non-diagnostic error. Explicitly terminating wss.clients before awaiting wss.close() avoids this stuck cleanup path.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/app/`[locale]/usage-doc/page.tsx:
- Around line 1254-1255: The locale entries droid.configuration.switching.steps
still reference the old model name ("GPT-5-Codex [cch]") while the configuration
example changed model_display_name to "GPT-5.4 [cch]"; update the step text
(step 3) in all five locale files (messages/zh-CN/usage.json,
messages/zh-TW/usage.json, messages/en/usage.json, messages/ja/usage.json,
messages/ru/usage.json) so the string mentions "GPT-5.4 [cch]" (and keeps Sonnet
4.5 [cch] as appropriate) to match the new model_display_name under the
droid.configuration.switching.steps key.

In `@tests/e2e/responses-ws-codex-cli-transport.test.ts`:
- Around line 345-351: The observedTransport function currently treats a
ws_message with frameType === "response.create" as the sole indicator of
WebSocket usage; change it to detect WebSocket transport by checking for probe
events with type "ws_connection" or "ws_upgrade" that target the "/v1/responses"
path (use the event.path or equivalent field) and treat frameType only as
diagnostic info—if such ws_connection/ws_upgrade events exist return
"websocket", else if any event.type === "http_responses" return "http",
otherwise "none"; update references to ProbeEvent properties (type, path,
frameType) in observedTransport so the decision is based on
ws_connection/ws_upgrade targeting "/v1/responses" rather than frameType ===
"response.create".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7a2ef977-2d5f-45e9-a961-c69e85759b1f

📥 Commits

Reviewing files that changed from the base of the PR and between fdb9460 and 5b7fe3a.

📒 Files selected for processing (49)
  • messages/en/settings/prices.json
  • messages/en/settings/providers/form/modelSelect.json
  • messages/en/usage.json
  • messages/ja/settings/prices.json
  • messages/ja/settings/providers/form/modelSelect.json
  • messages/ja/settings/providers/form/strings.json
  • messages/ja/usage.json
  • messages/ru/settings/prices.json
  • messages/ru/settings/providers/form/modelSelect.json
  • messages/ru/settings/providers/form/strings.json
  • messages/ru/usage.json
  • messages/zh-CN/settings/prices.json
  • messages/zh-CN/settings/providers/form/modelSelect.json
  • messages/zh-CN/settings/providers/form/strings.json
  • messages/zh-CN/usage.json
  • messages/zh-TW/settings/prices.json
  • messages/zh-TW/settings/providers/form/modelSelect.json
  • messages/zh-TW/settings/providers/form/strings.json
  • messages/zh-TW/usage.json
  • src/actions/providers.ts
  • src/app/[locale]/dashboard/sessions/[sessionId]/messages/_components/session-messages-client-actions.test.tsx
  • src/app/[locale]/dashboard/sessions/[sessionId]/messages/_components/session-messages-client.test.tsx
  • src/app/[locale]/settings/providers/_components/forms/api-test-button.tsx
  • src/app/[locale]/usage-doc/page.tsx
  • src/app/v1/_lib/responses-ws/__tests__/server-helpers.test.ts
  • src/app/v1/_lib/responses-ws/__tests__/upstream-adapter.test.ts
  • src/lib/model-vendor-icons.test.ts
  • src/lib/provider-testing/data/cx_base.json
  • src/lib/provider-testing/data/cx_codex_basic.json
  • src/lib/provider-testing/presets.ts
  • src/lib/provider-testing/test-service.test.ts
  • src/lib/provider-testing/utils/test-prompts.ts
  • src/lib/session-manager-detail-snapshots.test.ts
  • tests/api/v1/providers/providers.read.test.ts
  • tests/e2e/responses-ws-codex-cli-transport.test.ts
  • tests/integration/non-chat-endpoint-fallback-observability.test.ts
  • tests/unit/actions/active-sessions-detail-snapshots.test.ts
  • tests/unit/actions/model-prices.test.ts
  • tests/unit/codex/session-completer.test.ts
  • tests/unit/proxy/actual-response-model.test.ts
  • tests/unit/proxy/codex-provider-overrides.test.ts
  • tests/unit/proxy/non-chat-endpoint-fallback.test.ts
  • tests/unit/proxy/non-chat-endpoint-session-context.test.ts
  • tests/unit/proxy/proxy-forwarder-large-chunked-response.test.ts
  • tests/unit/proxy/proxy-forwarder-nonok-body-hang.test.ts
  • tests/unit/proxy/proxy-forwarder-raw-passthrough-regression.test.ts
  • tests/unit/server-ws-close-handshake.test.ts
  • tests/unit/settings/providers/api-test-button.test.tsx
  • tests/unit/usage-doc/opencode-usage-doc.test.tsx
✅ Files skipped from review due to trivial changes (34)
  • src/lib/provider-testing/data/cx_base.json
  • src/lib/provider-testing/data/cx_codex_basic.json
  • tests/integration/non-chat-endpoint-fallback-observability.test.ts
  • messages/zh-CN/settings/providers/form/strings.json
  • src/app/[locale]/settings/providers/_components/forms/api-test-button.tsx
  • tests/unit/settings/providers/api-test-button.test.tsx
  • messages/ja/settings/providers/form/strings.json
  • messages/ja/settings/prices.json
  • messages/en/usage.json
  • messages/ru/settings/providers/form/strings.json
  • messages/zh-TW/settings/prices.json
  • messages/ru/usage.json
  • tests/unit/proxy/non-chat-endpoint-session-context.test.ts
  • src/lib/provider-testing/presets.ts
  • messages/zh-TW/settings/providers/form/strings.json
  • messages/zh-CN/settings/prices.json
  • messages/zh-CN/settings/providers/form/modelSelect.json
  • messages/ru/settings/providers/form/modelSelect.json
  • messages/zh-TW/settings/providers/form/modelSelect.json
  • messages/ru/settings/prices.json
  • messages/en/settings/providers/form/modelSelect.json
  • messages/ja/usage.json
  • tests/unit/proxy/proxy-forwarder-large-chunked-response.test.ts
  • tests/unit/proxy/proxy-forwarder-nonok-body-hang.test.ts
  • src/actions/providers.ts
  • tests/unit/proxy/actual-response-model.test.ts
  • src/lib/model-vendor-icons.test.ts
  • messages/ja/settings/providers/form/modelSelect.json
  • tests/api/v1/providers/providers.read.test.ts
  • src/app/v1/_lib/responses-ws/tests/server-helpers.test.ts
  • messages/en/settings/prices.json
  • messages/zh-TW/usage.json
  • src/lib/session-manager-detail-snapshots.test.ts
  • tests/unit/codex/session-completer.test.ts

Comment on lines +1254 to +1255
"model_display_name": "GPT-5.4 [cch]",
"model": "gpt-5.4",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Droid 配置中 model_display_name 已更新为 "GPT-5.4 [cch]",但各语言 locale 文件中的切换步骤仍指向旧名称。

droid.configuration.switching.steps 在各语言文件(如 messages/zh-CN/usage.json 第 621 行)中写的是 "选择 GPT-5-Codex [cch] 或 Sonnet 4.5 [cch]",而本 PR 将配置示例里的 model_display_name 改为 "GPT-5.4 [cch]"。用户按指南配置好 Droid 之后,实际的模型列表里会出现 GPT-5.4 [cch],但切换步骤却让他们去找 GPT-5-Codex [cch],导致用户困惑。

建议将全部 5 个语言文件(zh-CNzh-TWenjaru)中 droid.configuration.switching.steps 的第 3 步同步更新,例如:

- "选择 GPT-5-Codex [cch] 或 Sonnet 4.5 [cch]",
+ "选择 GPT-5.4 [cch] 或 Sonnet 4.5 [cch]",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/app/`[locale]/usage-doc/page.tsx around lines 1254 - 1255, The locale
entries droid.configuration.switching.steps still reference the old model name
("GPT-5-Codex [cch]") while the configuration example changed model_display_name
to "GPT-5.4 [cch]"; update the step text (step 3) in all five locale files
(messages/zh-CN/usage.json, messages/zh-TW/usage.json, messages/en/usage.json,
messages/ja/usage.json, messages/ru/usage.json) so the string mentions "GPT-5.4
[cch]" (and keeps Sonnet 4.5 [cch] as appropriate) to match the new
model_display_name under the droid.configuration.switching.steps key.

Comment thread tests/e2e/responses-ws-codex-cli-transport.test.ts
@tesgth032

Copy link
Copy Markdown
Contributor Author

已处理 CodeRabbit 新一轮的 2 条 actionable comment,追加提交:535903cf test: 修正 WS E2E 传输判定与文档模型名

  • 多语言 droid.configuration.switching.steps 已从 GPT-5-Codex [cch] 同步为 GPT-5.4 [cch],与 Droid 配置示例里的 model_display_name 一致。
  • responses-ws-codex-cli-transport.test.tsobservedTransport() 已改为基于 /v1/responses 路径上的 ws_connection / ws_upgrade 事件判断 WebSocket transport;ws_message.frameType 只保留为诊断信息,不再作为 transport 判定依据。

复验:

  • npx biome check messages/en/usage.json messages/zh-CN/usage.json messages/zh-TW/usage.json messages/ja/usage.json messages/ru/usage.json tests/e2e/responses-ws-codex-cli-transport.test.ts 通过。
  • npx vitest run --config tests/configs/e2e.config.ts tests/e2e/responses-ws-codex-cli-transport.test.ts 通过(默认 opt-in skipped)。
  • npx vitest run tests/unit/usage-doc/opencode-usage-doc.test.tsx 通过。
  • npm run typecheck 通过。
  • 低版本 GPT-5 全仓扫描(排除历史 CHANGELOG.md)无命中。

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e/responses-ws-codex-cli-transport.test.ts`:
- Around line 248-256: The Windows branch uses execFileSync("where.exe",
["codex.cmd"]) which throws when codex.cmd is not found so the friendly check
never runs; wrap the execFileSync call in a try/catch (or use spawnSync and test
.status) inside the process.platform === "win32" block, and on error or no
result throw the clearer Error("Cannot find codex.cmd on PATH. Install Codex CLI
or set CCH_CODEX_E2E_BIN.") before returning nodeInvocationForWindowsCmd;
reference execFileSync, codex.cmd, nodeInvocationForWindowsCmd and the
process.platform === "win32" branch to locate where to add the try/catch and
error translation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1f79f74d-84ff-4f19-aa30-75f4475b3aef

📥 Commits

Reviewing files that changed from the base of the PR and between 5b7fe3a and 535903c.

📒 Files selected for processing (6)
  • messages/en/usage.json
  • messages/ja/usage.json
  • messages/ru/usage.json
  • messages/zh-CN/usage.json
  • messages/zh-TW/usage.json
  • tests/e2e/responses-ws-codex-cli-transport.test.ts
✅ Files skipped from review due to trivial changes (1)
  • messages/ja/usage.json
🚧 Files skipped from review as they are similar to previous changes (4)
  • messages/ru/usage.json
  • messages/en/usage.json
  • messages/zh-TW/usage.json
  • messages/zh-CN/usage.json

Comment thread tests/e2e/responses-ws-codex-cli-transport.test.ts
@tesgth032

Copy link
Copy Markdown
Contributor Author

已处理 CodeRabbit 最新意见,追加提交:165ce231 test: 改善 Codex E2E Windows 缺失提示

  • Windows 分支下 where.exe codex.cmd 失败时现在会转换为清晰错误:Cannot find codex.cmd on PATH. Install Codex CLI or set CCH_CODEX_E2E_BIN.,避免直接暴露 execFileSync 的原始异常。
  • 找到 codex.cmd 后仍沿用 nodeInvocationForWindowsCmd() 的兼容路径。

复验:

  • npx biome check tests/e2e/responses-ws-codex-cli-transport.test.ts 通过。
  • npx vitest run --config tests/configs/e2e.config.ts tests/e2e/responses-ws-codex-cli-transport.test.ts 通过(默认 opt-in skipped)。
  • npm run typecheck 通过。

@tesgth032

Copy link
Copy Markdown
Contributor Author

本轮继续把 Codex /v1/responses WebSocket 支持补齐,重点对照了 OpenAI 官方文档、github.com/openai/codex 当前实现,以及本仓库相关 issue。

对照依据:

实现内容:

  • server.js 现在为每个客户端 WS 生成 x-cch-responses-ws-session,正常 terminal 事件后不再关闭客户端 WS,允许同一连接继续发送下一次 response.create;fatal 路径仍会关闭并清队列/abort 内部请求。
  • responses-ws/upstream-adapter.ts 新增按客户端 session 复用上游 WS,fingerprint 覆盖 provider、endpoint、upstream URL 和过滤后的 headers;正常 terminal 后保留上游连接,error / websocket_connection_limit_reached / client disconnect 会关闭并清理。
  • 内部 tunnel headers 现在统一保留在 CCH 内部,eligibility 从原始 session headers 读取内部标记,发往上游前会剥离 x-cch-* 内部头。
  • Codex CLI E2E 探针保留为 opt-in,并更新为真实 WS 行为:同一 mock WS 连接支持多次 response.create,断言先 generate:false 预热,再 previous_response_id 正式请求。
  • Nginx / k8s ingress 文档和模板把 WS 超时提升到 3700s,并补齐 Upgrade、Connection map、buffering/request-buffering 说明,贴近 60 分钟上限和常见反代环境。
  • 测试/工作流里的 GPT-5 系列旧模型默认值已更新到 gpt-5.4gpt-4.1 / gpt-4o 保持不动。

验证结果:

  • 目标单测:npx vitest run src\app\v1\_lib\responses-ws\__tests__\eligibility.test.ts src\app\v1\_lib\responses-ws\__tests__\internal-secret.test.ts src\app\v1\_lib\responses-ws\__tests__\upstream-adapter.test.ts tests\unit\proxy\proxy-forwarder.test.ts tests\unit\server-ws-close-handshake.test.ts 通过,5 files / 78 tests。
  • E2E:CCH_CODEX_E2E=1 + CCH_CODEX_E2E_EXPECT_TRANSPORT=websocket 的 Codex CLI transport 探针通过;观测到同一 WS 内先 generate:false,再带 previous_response_id:"resp_cch_ws_e2e_1" 发正式 turn。
  • npx biome check 目标文件通过。
  • bun run typecheck 通过。
  • bun run lint 通过,只有既有 warning。
  • bun run build 通过,只有既有 Edge Runtime/Node API warning 和 DSN 未配置构建期 fallback 日志。
  • git diff --check 通过。
  • bun run test 全量跑完:661 files passed / 2 files failed / 2 skipped;失败 10 个均集中在既有 Windows/Git Bash k8s shell helper 测试:
    • tests/unit/k8s-cch-update-flow.test.ts
    • tests/unit/k8s-deploy-shell-helpers.test.ts
      失败形态是 shell helper source/stub 场景下 generate_randombuild_image_ref_with_digest、kubectl/k3s stub 命令解析异常,和本次 WS 改动无关。

没有新开 WSL2 虚拟机:当前本机 opt-in E2E 已能覆盖 Codex CLI -> CCH WS tunnel -> upstream WS mock 的全链路;考虑到现有两个 WSL2 VM 都有业务在跑,本轮没有打扰它们。

@tesgth032

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@codex review

@coderabbitai

coderabbitai Bot commented May 3, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

@tesgth032 I'll kick off a fresh review of the latest changes now, including the persistent WebSocket connection support, upstream session reuse in upstream-adapter.ts, Nginx/k8s documentation updates, and all the related test updates.

[review]

@coderabbitai

coderabbitai Bot commented May 3, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 24c52e2826

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server.js
Comment on lines +35 to 37
const dev = isNextDevMode(process.env.NODE_ENV);
const hostname = process.env.HOSTNAME || "0.0.0.0";
const port = parseInt(process.env.PORT || (dev ? "13500" : "3000"), 10);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve non-production default port behavior

Changing dev to only NODE_ENV === "development" also changes the default PORT fallback from 13500 to 3000 whenever NODE_ENV is unset or set to values like test/staging. In this repo, npm/bun run start does not set PORT, while docs/tests and helper defaults still target localhost:13500, so this introduces a silent runtime mismatch where the server starts on a different port than the rest of the project expects.

Useful? React with 👍 / 👎.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e/responses-ws-codex-cli-transport.test.ts`:
- Around line 1399-1402: The test incorrectly assumes the raw ws_message
ordering when detecting warmup/follow-up; instead filter probe.events for
ws_message frames that are actual response.create events and then inspect those
frames' generate and previousResponseId properties to decide warmup vs
follow-up. Specifically, replace uses of wsMessages (from probe.events.filter(e
=> e.type === "ws_message")) with a filtered list of response.create frames
(e.g., filter where event.type === "ws_message" && event.name ===
"response.create" or where payload/type === "response.create"), then use that
list to check generate and previousResponseId; apply the same change to the
similar block around the other occurrence mentioned (1435-1440).
- Around line 201-227: 当前实现对任何入站 WS 帧都回写一整套
response.created/response.completed,导致收到非请求控制帧(如 session.update)时也会错误地标记 probe
成功;修改 ws.on("message", ...) 中的逻辑:基于已解析的 frameType(变量 frameType)只在 frameType ===
"response.create" 时才执行 responseSeq 增加并调用 responseEvents(...)/ws.send(...)
回写模拟响应,同时保留对 generate(变量 generate)为 false 的判断来控制 includeOutput,这样只有真正的
response.create 请求会得到 response.created/response.completed 的模拟回复。
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ec803e56-296e-4c11-8ca0-afe9985e2a0e

📥 Commits

Reviewing files that changed from the base of the PR and between 734751b and 24c52e2.

📒 Files selected for processing (1)
  • tests/e2e/responses-ws-codex-cli-transport.test.ts

Comment thread tests/e2e/responses-ws-codex-cli-transport.test.ts
Comment thread tests/e2e/responses-ws-codex-cli-transport.test.ts Outdated
Comment thread server.js
Comment on lines 26 to +32
const { randomUUID } = require("node:crypto");
const { parse } = require("node:url");

const dev = process.env.NODE_ENV !== "production";
function isNextDevMode(nodeEnv) {
return nodeEnv === "development";
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Behavioral regression: dev is now false when NODE_ENV is unset or "test"

The old guard was process.env.NODE_ENV !== "production", so any value other than "production" (including undefined, "test", "staging", etc.) enabled dev mode (Next compiler, port 13500). The new guard nodeEnv === "development" silently flips those envs to production mode (port 3000, expects built artifacts).

Running node server.js without setting NODE_ENV used to serve the dev HMR server on 13500; it now tries to serve a prebuilt bundle on 3000. Any deployment that relied on the old default — Docker images that don't set NODE_ENV=production, CI run scripts, or local npm start — will either fail to start (missing .next build output) or unexpectedly move to port 3000.

Prompt To Fix With AI
This is a comment left during a code review.
Path: server.js
Line: 26-32

Comment:
**Behavioral regression: `dev` is now `false` when `NODE_ENV` is unset or `"test"`**

The old guard was `process.env.NODE_ENV !== "production"`, so any value other than `"production"` (including `undefined`, `"test"`, `"staging"`, etc.) enabled dev mode (Next compiler, port 13500). The new guard `nodeEnv === "development"` silently flips those envs to production mode (port 3000, expects built artifacts).

Running `node server.js` without setting `NODE_ENV` used to serve the dev HMR server on 13500; it now tries to serve a prebuilt bundle on 3000. Any deployment that relied on the old default — Docker images that don't set `NODE_ENV=production`, CI run scripts, or local `npm start` — will either fail to start (missing `.next` build output) or unexpectedly move to port 3000.

How can I resolve this? If you propose a fix, please make it concise.

@tesgth032

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 3, 2026

Copy link
Copy Markdown
✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Comment on lines +362 to +368
persistentEntry.lastUsedAt = Date.now();
if (persistentEntry.idleTimer) {
clearTimeout(persistentEntry.idleTimer);
persistentEntry.idleTimer = null;
}
ws = existing.ws;
reused = true;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Active-session close in else branch can abort an in-flight upstream request

The else branch fires for three distinct cases: fingerprint mismatch, WS already closing/closed, and existing.active === true. The first two are safe to close eagerly; the third is not. When existing.active is true a request is mid-flight on that upstream WS, and closePersistentEntry(existing, 1000) will trigger onClosesocketClosed = trueresolve(null) in the message-loop, causing the in-progress SSE stream to end with stream_ended_without_terminal and a 1011 close to the client.

The sequential drain queue normally prevents the same sessionId from being active twice, but the condition can be reached if an abort signal or a prior forgetPersistentSession call lands between active = true being set and finishRequest resetting it. Splitting the condition makes the invariant explicit and prevents accidentally closing in-flight work.

// suggested split:
if (
  existing.fingerprint === fingerprint &&
  !existing.active &&
  !isWsClosingOrClosed(existing.ws)
) {
  // reuse
} else if (existing.active) {
  // Safety: drain queue guarantees sequential execution per session,
  // so active===true here indicates a programming error. Log and fall
  // through to creating a fresh connection rather than closing in-flight work.
  logger.warn("[ResponsesWsAdapter] unexpected active session on new request – creating fresh", { sessionId });
  forgetPersistentSession(sessionId, existing.ws); // unregister without closing
} else {
  closePersistentEntry(existing, 1000);
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/app/v1/_lib/responses-ws/upstream-adapter.ts
Line: 362-368

Comment:
**Active-session close in `else` branch can abort an in-flight upstream request**

The `else` branch fires for three distinct cases: fingerprint mismatch, WS already closing/closed, and `existing.active === true`. The first two are safe to close eagerly; the third is not. When `existing.active` is `true` a request is mid-flight on that upstream WS, and `closePersistentEntry(existing, 1000)` will trigger `onClose``socketClosed = true``resolve(null)` in the message-loop, causing the in-progress SSE stream to end with `stream_ended_without_terminal` and a 1011 close to the client.

The sequential drain queue normally prevents the same `sessionId` from being active twice, but the condition can be reached if an abort signal or a prior `forgetPersistentSession` call lands between `active = true` being set and `finishRequest` resetting it. Splitting the condition makes the invariant explicit and prevents accidentally closing in-flight work.

```ts
// suggested split:
if (
  existing.fingerprint === fingerprint &&
  !existing.active &&
  !isWsClosingOrClosed(existing.ws)
) {
  // reuse
} else if (existing.active) {
  // Safety: drain queue guarantees sequential execution per session,
  // so active===true here indicates a programming error. Log and fall
  // through to creating a fresh connection rather than closing in-flight work.
  logger.warn("[ResponsesWsAdapter] unexpected active session on new request – creating fresh", { sessionId });
  forgetPersistentSession(sessionId, existing.ws); // unregister without closing
} else {
  closePersistentEntry(existing, 1000);
}
```

How can I resolve this? If you propose a fix, please make it concise.

@tesgth032

Copy link
Copy Markdown
Contributor Author

补充本轮 review 修复与最终验证结果:

  • 修复 Greptile/Codex 对 server.js dev 判定的有效意见:isNextDevMode 恢复既有 NODE_ENV !== "production" 语义,避免未设置、teststaging 等环境意外改走 production/3000。
  • 修复 CodeRabbit 对 Codex WS probe 的有效意见:probe server 只对 response.create 入站业务帧回写模拟 response.* 事件;warmup/follow-up 断言也只基于 response.create 帧,避免 session.update 等控制帧插入导致误判。
  • 保留兼容性:observedTransport 仍按 /v1/responses WS upgrade/connection 识别传输类型,frameType 仅用于请求语义断言和诊断。

本地验证:

  • npx biome check server.js src\app\v1\_lib\responses-ws\__tests__\server-helpers.test.ts tests\e2e\responses-ws-codex-cli-transport.test.ts
  • npx vitest run src\app\v1\_lib\responses-ws\__tests__\server-helpers.test.ts tests\e2e\responses-ws-codex-cli-transport.test.ts
  • npx vitest run --config tests\configs\e2e.config.ts tests\e2e\responses-ws-codex-cli-transport.test.ts
  • npx vitest run src\app\v1\_lib\responses-ws\__tests__\server-helpers.test.ts tests\unit\server-ws-close-handshake.test.ts
  • npx tsgo -p tsconfig.json --noEmit
  • bun run lint:fix
  • bun run lint
  • bun run typecheck
  • bun run build
  • bun run test:663 files passed / 5864 tests passed / 13 skipped
  • 真实 Codex CLI + WS fault E2E:
    CCH_CODEX_E2E=1 CCH_CODEX_E2E_EXPECT_TRANSPORT=websocket CCH_CODEX_E2E_FAULTS=1 npx vitest run --config tests\configs\e2e.config.ts tests\e2e\responses-ws-codex-cli-transport.test.ts,16/16 passed

远端检查已通过:Code Quality Check、Docker Build Test、Unit Tests、API Tests、Integration Tests、Test Summary、CodeRabbit、Greptile Review 全部 green。

@tesgth032

Copy link
Copy Markdown
Contributor Author

本轮已处理 AI review 中与本 PR 修改直接相关且有价值的一条 Greptile P1 建议:

  • 修复同一个 Codex/Responses session 里已有上游 WS 正在 active 时,后续并发请求不再关闭该活跃连接,而是打开 fresh upstream WS,避免打断 in-flight 流。
  • 对被新请求替换后的 detached old entry 做了收尾:旧流自然收到 terminal 后关闭自身,但不会误删 map 中保留的新连接,也不会留下无 idle timer 的旧连接泄漏。
  • 新增回归测试覆盖同 session 并发 active 连接场景,确认第二个请求不会 close 第一个请求;第一个请求 terminal 后以 1000 正常关闭。

已重新验证:

  • npx biome check src\app\v1\_lib\responses-ws\upstream-adapter.ts src\app\v1\_lib\responses-ws\__tests__\upstream-adapter.test.ts
  • npx vitest run src\app\v1\_lib\responses-ws\__tests__\upstream-adapter.test.ts
  • npx vitest run --config tests\configs\e2e.config.ts tests\e2e\responses-ws-codex-cli-transport.test.ts
  • npx vitest run src\app\v1\_lib\responses-ws\__tests__\upstream-adapter.test.ts src\app\v1\_lib\responses-ws\__tests__\server-helpers.test.ts tests\unit\server-ws-close-handshake.test.ts
  • bun run lint
  • bun run typecheck
  • CCH_CODEX_E2E=1 CCH_CODEX_E2E_EXPECT_TRANSPORT=websocket CCH_CODEX_E2E_FAULTS=1 npx vitest run --config tests\configs\e2e.config.ts tests\e2e\responses-ws-codex-cli-transport.test.ts
  • bun run build

@coderabbitai review

@coderabbitai

coderabbitai Bot commented May 3, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

@tesgth032,修复思路清晰:并发请求不再关闭 active 上游 WS,而是开新连接;旧连接在自然 terminal 后收尾,同时不会误删 map 里的新条目,也没有 idle timer 泄漏。Greptile P1 的核心竞态条件已正确处理。触发新一轮 review。

[review]

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: db67f8293e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

return persistentSessions.size;
}

globalThis.__cchCleanupResponsesWsSession = cleanupResponsesWsSession;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep session cleanup hook stable across module reloads

Assigning globalThis.__cchCleanupResponsesWsSession directly to this module’s closure means any hot-reload/re-import swaps the hook to a new persistentSessions map, while sessions created by the previous module instance become unreachable by server.js cleanup calls. In Next.js dev/test reload scenarios, client disconnects will no longer close those older upstream sockets immediately, so they linger until the 65-minute idle timer, which can accumulate leaked upstream WS connections.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 29d254121b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +613 to +625
async function pickFreePort(): Promise<number> {
return new Promise((resolve, reject) => {
const probe = http.createServer();
probe.listen(0, "127.0.0.1", () => {
const address = probe.address();
if (!address || typeof address !== "object") {
probe.close();
reject(new Error("failed to allocate local port"));
return;
}
const port = address.port;
probe.close(() => resolve(port));
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid TOCTOU when choosing test server ports

pickFreePort() binds an ephemeral port, closes the probe server, and later reuses that numeric port for startIsolatedCchEdgeHarness/fault harness startup. Another process (or another test worker) can claim the port in the gap, causing intermittent EADDRINUSE failures that make this E2E suite flaky under concurrent CI load. Allocate with listen(0) on the actual server you are about to use (or keep the probe socket open until handoff) so reservation and use are atomic.

Useful? React with 👍 / 👎.

@tesgth032 tesgth032 changed the title fix(ws): 补齐关闭路径并保留 Codex E2E 探针 [未完成] fix(ws): 补齐关闭路径并保留 Codex E2E 探针 May 3, 2026
@tesgth032 tesgth032 changed the title fix(ws): 补齐关闭路径并保留 Codex E2E 探针 fix(ws): 完善 Responses WebSocket 会话生命周期与测试覆盖 May 3, 2026
Comment on lines +475 to +481
await withTimeout(
serverConnection.close,
3000,
"server WebSocket did not close after binary protocol close"
);
expect(upstreamCalls).toBe(0);
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Race condition in binary-close test: upstreamCalls === 0 asserted without synchronizing on frame processing

The test awaits serverConnection.close to confirm the server-side WebSocket closed, then immediately checks upstreamCalls === 0. However, serverConnection.close resolves as soon as the ws.on('close') event fires — this happens before handleWebSocketConnection has necessarily finished draining its pending queue. The pipelined text frame is still inside the message event handler's void drain().catch(...) microtask at this point, so upstreamCalls could be 0 for the wrong reason (drain hasn't run yet), and the assertion is not verifying the intended invariant. A deterministic synchronization point (e.g., a server-side signal after drain settles) is needed.

Prompt To Fix With AI
This is a comment left during a code review.
Path: tests/unit/server-ws-close-handshake.test.ts
Line: 475-481

Comment:
**Race condition in binary-close test: `upstreamCalls === 0` asserted without synchronizing on frame processing**

The test awaits `serverConnection.close` to confirm the server-side WebSocket closed, then immediately checks `upstreamCalls === 0`. However, `serverConnection.close` resolves as soon as the `ws.on('close')` event fires — this happens before `handleWebSocketConnection` has necessarily finished draining its pending queue. The pipelined text frame is still inside the `message` event handler's `void drain().catch(...)` microtask at this point, so `upstreamCalls` could be 0 for the wrong reason (drain hasn't run yet), and the assertion is not verifying the intended invariant. A deterministic synchronization point (e.g., a server-side signal after drain settles) is needed.

How can I resolve this? If you propose a fix, please make it concise.

@ding113 ding113 merged commit 6cf7e44 into ding113:dev May 4, 2026
9 checks passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in Claude Code Hub Roadmap May 4, 2026
@tesgth032 tesgth032 deleted the fix/responses-ws-close-paths-e2e-20260504 branch June 7, 2026 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core bug Something isn't working javascript Pull requests that update javascript code size/L Large PR (< 1000 lines)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants