Skip to content

fix(chat): survive socket reconnects — thread-key session/cancel + thread-room stream#2493

Merged
senamakel merged 3 commits into
tinyhumansai:mainfrom
sanil-23:fix/thread-keyed-session-and-stream
May 22, 2026
Merged

fix(chat): survive socket reconnects — thread-key session/cancel + thread-room stream#2493
senamakel merged 3 commits into
tinyhumansai:mainfrom
sanil-23:fix/thread-keyed-session-and-stream

Conversation

@sanil-23
Copy link
Copy Markdown
Contributor

@sanil-23 sanil-23 commented May 22, 2026

Summary

  • The web channel keyed durable, thread-scoped runtime state by the ephemeral per-socket client_id instead of the stable thread_id. Since client_id is regenerated on every Socket.IO reconnect, a reconnect silently orphaned that state.
  • One root cause, three symptoms: conversation amnesia, a dead Cancel button, and the in-flight stream vanishing after a reconnect.
  • Fix: key the session + in-flight/cancel registry by thread_id, and mirror the chat stream to a per-thread room (with a thread:subscribe re-subscribe on (re)connect). Event delivery still uses client_id.

Problem

key_for(client_id, thread_id) produced "{client_id}::{thread_id}", and that was the key for both THREAD_SESSIONS (the in-memory conversation/transcript cache) and IN_FLIGHT (the cancel registry). The chat stream was likewise emitted only to room = client_id (emit_web_channel_event).

client_id is the per-connection engine.io id — a new one on every reconnect. So when a webview's socket dropped and reconnected mid-session:

  1. AmnesiaTHREAD_SESSIONS.get(newClientId::thread) missed → the turn started with history_len=0 (the on-disk transcript is keyed by thread_id, so the durable record was intact; only the in-memory lookup failed).
  2. Dead Cancelcancel_chat looked up IN_FLIGHT under the new client_id, found nothing, returned Ok (logged -> ok) and aborted nothing; the turn ran on to the iteration cap.
  3. Lost stream — the in-flight turn kept emitting to the old client_id room, which the reconnected socket was no longer in.

The persistence layer already keys by thread_id (turn-state snapshots, transcripts), so the runtime was simply inconsistent with its own durable identity.

Solution

  • key_for(thread_id)THREAD_SESSIONS and IN_FLIGHT now key by thread_id alone. A reconnect (new client_id) resolves to the same session (cache hit, history retained) and the same in-flight handle (Cancel works). Event delivery still routes by client_id.
  • Per-thread stream roomemit_web_channel_event now emits to both the client_id room and thread:<id>. A new thread:subscribe socket event joins thread:<id>; the frontend (socketService) re-emits it for the active thread on every connect. So a reconnected socket re-joins the thread room and the in-flight stream re-attaches. socket.io de-duplicates a socket present in multiple target rooms, so no double-render.

Verified live (with socketio debug logging on): after a mid-turn reload, the new socket logged joined room 'thread:<id>' and the turn's stream events emitted to rooms=["<newClientId>", "thread:<id>"] and were received; cache hits retained history_len; and Cancel aborted the in-flight turn across a reconnect (vs. the prior no-op).

Submission Checklist

  • Tests added or updated — key_for_is_thread_scoped_not_client_scoped (regression guard, replaces key_for_combines_client_id_and_thread_id) and two socketService.events.test.ts cases covering connect → thread:subscribe (active thread emits; no active thread does not). Existing cancel_chat tests still pass.
  • Diff coverage ≥ 80% — the Coverage Gate (diff-cover ≥ 80%) check passes on this PR.
  • Coverage matrix updated — N/A: behaviour-only bug fix.
  • No new external network dependencies — N/A: none.
  • Manual smoke checklist — N/A: no release-cut surface beyond chat reconnect (covered above).
  • Linked issue closed — N/A: no tracking issue filed.

Impact

  • Desktop/web chat: any socket reconnect mid-conversation previously wiped context, broke Cancel, and dropped the live stream. This makes reconnects transparent. No migration; no security/compat implications. Tool dispatch and event delivery are unchanged.
  • Note: the frontend half (socketService.ts re-subscribe) requires an app bundle rebuild to take effect; the backend halves are independent.

Related

  • Closes: N/A (no issue filed)
  • Follow-ups: (1) optional unit test for the per-thread room emit; (2) separate — the frontend's reconnectionAttempts is low, so a long drop needs a manual reload to reconnect (not addressed here).

AI Authored PR Metadata

Linear Issue

  • Key: N/A
  • URL: N/A

Commit & Branch

  • Branch: fix/thread-keyed-session-and-stream
  • Commit SHAs: ce0f1c8f (session/cancel keying), 95c314da (thread-room stream)

Validation Run

  • pnpm --filter openhuman-app format:check — frontend change is a small socketService.ts addition; typechecked on the equivalent source.
  • pnpm typecheck — passed on the equivalent source (socketService.ts).
  • Focused tests: key_for_is_thread_scoped_not_client_scoped, cancel_chat_validates_required_fields pass.
  • Rust fmt/check: cargo fmt clean; the identical code builds in release (cargo build --release --bin openhuman-core) and was run live, confirming the fixes.
  • Tauri fmt/check: N/A: no src-tauri changes.

Behavior Changes

  • Intended: chat session, cancel, and live stream are keyed/routed by thread_id, surviving socket reconnects.
  • User-visible: reconnect mid-chat no longer forgets history, Cancel works after a reconnect, and the in-flight stream re-attaches.

Parity Contract

  • Legacy behavior preserved: event delivery still uses client_id; the per-thread room is additive (dual-emit), so non-reconnect streaming is unchanged.
  • On-disk persistence (already thread_id-keyed) is now matched by the in-memory runtime.

Duplicate / Superseded PR Handling

  • Duplicate PR(s): none
  • Canonical PR: this
  • Resolution: N/A

🤖 Generated with Claude Code

@sanil-23 sanil-23 requested a review from a team May 22, 2026 14:11
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 22, 2026

📝 Walkthrough

Walkthrough

Frontend socket service re-subscribes to active threads on reconnect. Backend adds a thread:subscribe handler, emits events to multiple rooms (client and thread), and keys session/task caches by thread_id so sessions survive Socket.IO reconnects.

Changes

Thread-stable session persistence across socket reconnects

Layer / File(s) Summary
Thread subscription protocol
app/src/services/socketService.ts, app/src/services/__tests__/socketService.events.test.ts, src/core/socketio.rs
Frontend emits thread:subscribe on socket connect with the active thread ID; tests mock getState() to assert behavior. Backend adds ThreadSubscribePayload and an authenticated handler that trims/validates thread_id and joins thread:<id> rooms.
Multi-room event emission
src/core/socketio.rs
emit_web_channel_event now builds a list of target rooms (client room plus optional thread:<id>) and uses emit_rooms_with_aliases to broadcast the event (and any alias) to all rooms in one call.
Thread-keyed session and task management
src/openhuman/channels/providers/web.rs, src/openhuman/channels/providers/web_tests.rs
Introduces key_for(thread_id) and switches start_chat, cancel_chat, and run_chat_task to use thread-only keys so cached agent sessions and abortable in-flight handles persist across reconnects. Tests verify key derivation is thread-scoped and stable.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client Socket
  participant SocketIO as Socket.IO Server
  participant SessionCache as Session & Task Cache

  Note over Client,SocketIO: Initial connection
  Client->>SocketIO: connect (client_id_1)
  Client->>SocketIO: emit 'thread:subscribe' {thread_id: 'ABC'}
  SocketIO->>SessionCache: lookup/ensure key_for('ABC')
  SessionCache-->>SocketIO: cached session/task handles
  SocketIO->>SocketIO: join room 'thread:ABC'

  Note over Client,SocketIO: Disconnect and reconnect
  Client->>SocketIO: connect (client_id_2)
  Client->>SocketIO: emit 'thread:subscribe' {thread_id: 'ABC'}
  SocketIO->>SessionCache: lookup key_for('ABC')
  SessionCache-->>SocketIO: same cached session/task handles
  SocketIO->>SocketIO: join room 'thread:ABC'
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hop where threads persist and stay,
Through reconnects, I find my way;
No vanished tasks, no scattered roam,
Thread keys keep every session home. 🥕✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(chat): survive socket reconnects — thread-key session/cancel + thread-room stream' directly and precisely summarizes the main objective: fixing a reconnection bug by changing from client+thread keying to thread-only keying and adding thread-room subscription logic.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/channels/providers/web.rs`:
- Around line 134-145: invalidate_thread_sessions is still deleting entries
using the old "::<thread_id>" suffix while key_for(thread_id) now returns just
thread_id, so removals no longer match; update invalidate_thread_sessions to
remove entries from THREAD_SESSIONS and IN_FLIGHT using key_for(thread_id)
(i.e., the plain thread_id) instead of the legacy suffix, or delete both
patterns for backward compatibility (call key_for(thread_id) and also attempt
the old format) so thread session and in-flight handles are actually cleared.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e1771923-32e8-41a6-8020-374429ab4382

📥 Commits

Reviewing files that changed from the base of the PR and between ed3e453 and 95c314d.

📒 Files selected for processing (4)
  • app/src/services/socketService.ts
  • src/core/socketio.rs
  • src/openhuman/channels/providers/web.rs
  • src/openhuman/channels/providers/web_tests.rs

Comment thread src/openhuman/channels/providers/web.rs
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
app/src/services/__tests__/socketService.events.test.ts (2)

125-145: ⚡ Quick win

Add test coverage for the activeThreadId fallback path.

The production code falls back to activeThreadId when selectedThreadId is absent:

const activeThreadId = threadState?.selectedThreadId ?? threadState?.activeThreadId;

This test only verifies the selectedThreadId path. Add a companion test case where selectedThreadId is null but activeThreadId is set to ensure the fallback logic works correctly.

🧪 Suggested test case for activeThreadId fallback
it('re-subscribes using activeThreadId when selectedThreadId is null', async () => {
  const { handlers, mockSocket } = buildMockSocket();

  vi.doMock('socket.io-client', () => ({ io: vi.fn(() => mockSocket) }));
  getCoreRpcUrlMock.mockResolvedValue('http://127.0.0.1:7788/rpc');
  storeMock.getState.mockReturnValue({
    thread: { selectedThreadId: null, activeThreadId: 'thread-abc' },
  });

  const { socketService } = await import('../socketService');
  socketService.connect('jwt-test-thread-sub-fallback');

  await pollUntil(() => expect(handlers['connect']).toBeDefined());

  handlers['connect']!();

  expect((mockSocket as { emit: ReturnType<typeof vi.fn> }).emit).toHaveBeenCalledWith(
    'thread:subscribe',
    { thread_id: 'thread-abc' }
  );
});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src/services/__tests__/socketService.events.test.ts` around lines 125 -
145, The existing test only covers the selectedThreadId path; add a new test
that mocks storeMock.getState to return thread: { selectedThreadId: null,
activeThreadId: 'thread-abc' } and otherwise mirrors the original test flow so
that after importing socketService and calling socketService.connect(...) the
connect handler triggers and the mock socket's emit is asserted to have been
called with 'thread:subscribe' and { thread_id: 'thread-abc' } (use same
helpers: buildMockSocket, getCoreRpcUrlMock, vi.doMock, pollUntil and inspect
handlers/connect and mockSocket.emit).

147-164: ⚡ Quick win

Add explicit test coverage for a completely absent thread slice.

The PR objectives mention "avoid errors when the thread slice is absent," and the production code uses optional chaining (threadState?.) to handle this. However, this test uses a present thread slice with null IDs rather than testing a completely absent slice.

Consider adding a test case where getState() returns {} or { thread: undefined } to explicitly verify the guard handles a missing thread slice without errors.

🧪 Suggested test case for absent thread slice
it('does not emit thread:subscribe when thread slice is absent from state', async () => {
  const { handlers, mockSocket } = buildMockSocket();

  vi.doMock('socket.io-client', () => ({ io: vi.fn(() => mockSocket) }));
  getCoreRpcUrlMock.mockResolvedValue('http://127.0.0.1:7788/rpc');
  storeMock.getState.mockReturnValue({}); // No thread slice at all

  const { socketService } = await import('../socketService');
  socketService.connect('jwt-test-no-thread-slice');

  await pollUntil(() => expect(handlers['connect']).toBeDefined());

  handlers['connect']!();

  const emitMock = (mockSocket as { emit: ReturnType<typeof vi.fn> }).emit;
  const threadSub = emitMock.mock.calls.find(([ev]) => ev === 'thread:subscribe');
  expect(threadSub).toBeUndefined();
});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src/services/__tests__/socketService.events.test.ts` around lines 147 -
164, Add a new test that simulates a completely absent thread slice by making
storeMock.getState return {} (or { thread: undefined }) before importing and
calling socketService.connect; reuse buildMockSocket, mockSocket and the
existing connect handler check (handlers['connect']) and assert that
mockSocket.emit was never called with 'thread:subscribe' to verify the
optional-chaining guard handles a missing thread slice without throwing. Ensure
the test name and setup mirror the existing test ('does not emit
thread:subscribe when thread slice is absent from state') and keep
getCoreRpcUrlMock and vi.doMock('socket.io-client') setup identical to the
current tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@app/src/services/__tests__/socketService.events.test.ts`:
- Around line 125-145: The existing test only covers the selectedThreadId path;
add a new test that mocks storeMock.getState to return thread: {
selectedThreadId: null, activeThreadId: 'thread-abc' } and otherwise mirrors the
original test flow so that after importing socketService and calling
socketService.connect(...) the connect handler triggers and the mock socket's
emit is asserted to have been called with 'thread:subscribe' and { thread_id:
'thread-abc' } (use same helpers: buildMockSocket, getCoreRpcUrlMock, vi.doMock,
pollUntil and inspect handlers/connect and mockSocket.emit).
- Around line 147-164: Add a new test that simulates a completely absent thread
slice by making storeMock.getState return {} (or { thread: undefined }) before
importing and calling socketService.connect; reuse buildMockSocket, mockSocket
and the existing connect handler check (handlers['connect']) and assert that
mockSocket.emit was never called with 'thread:subscribe' to verify the
optional-chaining guard handles a missing thread slice without throwing. Ensure
the test name and setup mirror the existing test ('does not emit
thread:subscribe when thread slice is absent from state') and keep
getCoreRpcUrlMock and vi.doMock('socket.io-client') setup identical to the
current tests.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 51c93e42-b983-47f3-a3ec-182369663d73

📥 Commits

Reviewing files that changed from the base of the PR and between 95c314d and 5e2bb0d.

📒 Files selected for processing (2)
  • app/src/services/__tests__/socketService.events.test.ts
  • app/src/services/socketService.ts

sanil-23 and others added 2 commits May 22, 2026 20:36
THREAD_SESSIONS (conversation history cache) and IN_FLIGHT (cancel registry) were keyed by key_for(client_id, thread_id), where client_id is the ephemeral Socket.IO connection id that is regenerated on every reconnect. So when the webview's socket reconnected with a new client_id, the same thread mapped to a fresh empty session (conversation amnesia) and cancel_chat could no longer find the running turn (Cancel became a silent no-op). The persistence layer already keys by thread_id (turn-state snapshots, transcripts); this realigns the in-memory runtime with that stable identity. Event delivery still routes by client_id (the live socket) — only the thread-owned runtime state now keys off thread_id. NOTE: the live-stream Socket.IO room is still client_id-scoped (room = event.client_id); re-attaching an in-flight turn's stream to a reconnected socket needs a paired thread-room + client re-subscribe change, tracked separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… reconnects

Chat-stream events were emitted only to the initiating socket's client_id room. Since client_id is regenerated on every Socket.IO reconnect, a webview that reconnected mid-turn stopped receiving the in-flight turn's stream (it went to the dead old-client_id room). Now emit to BOTH the client_id room and a per-thread room (thread:<id>); a socket re-subscribes to its active thread room on (re)connect via a new thread:subscribe event, so the stream re-attaches to the new connection. socket.io de-duplicates a socket across multiple target rooms, so no double-render. Pairs with the thread-keyed session/cancel fix (63ded2f); together they make a mid-turn reconnect fully transparent. Frontend: socketService re-emits thread:subscribe for the active thread on connect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sanil-23 sanil-23 force-pushed the fix/thread-keyed-session-and-stream branch from 5e2bb0d to 4dcf15f Compare May 22, 2026 18:37
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 22, 2026
…test thread:subscribe

The thread:subscribe re-subscribe added to the connect handler did store.getState().thread.selectedThreadId, which threw under the existing socketService.events.test.ts mock (storeMock had no getState) and would throw in any state where the thread slice is absent. Read the thread slice once and optional-chain it; add getState to the test mock; add two cases covering connect→thread:subscribe (active thread emits, no active thread does not).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sanil-23 sanil-23 force-pushed the fix/thread-keyed-session-and-stream branch from 4dcf15f to 1194afe Compare May 22, 2026 18:45
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
app/src/services/__tests__/socketService.events.test.ts (1)

130-169: ⚡ Quick win

Add a fallback-branch test for activeThreadId when selectedThreadId is null.

Line 250 in app/src/services/socketService.ts uses selectedThreadId ?? activeThreadId, but current tests only cover selectedThreadId present and both missing. Add one case with selectedThreadId: null and activeThreadId: 'thread-abc' to lock the fallback behavior.

Proposed test addition
+  it('re-subscribes using activeThreadId when selectedThreadId is null', async () => {
+    const { handlers, mockSocket } = buildMockSocket();
+
+    vi.doMock('socket.io-client', () => ({ io: vi.fn(() => mockSocket) }));
+    getCoreRpcUrlMock.mockResolvedValue('http://127.0.0.1:7788/rpc');
+    storeMock.getState.mockReturnValue({
+      thread: { selectedThreadId: null, activeThreadId: 'thread-abc' },
+    });
+
+    const { socketService } = await import('../socketService');
+    socketService.connect('jwt-test-active-thread-fallback');
+
+    await pollUntil(() => expect(handlers['connect']).toBeDefined());
+    handlers['connect']!();
+
+    expect((mockSocket as { emit: ReturnType<typeof vi.fn> }).emit).toHaveBeenCalledWith(
+      'thread:subscribe',
+      { thread_id: 'thread-abc' }
+    );
+  });
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/src/services/__tests__/socketService.events.test.ts` around lines 130 -
169, Add a new test that verifies the fallback to activeThreadId when
selectedThreadId is null: use buildMockSocket() and
vi.doMock('socket.io-client') like the other tests, set storeMock.getState to
return { thread: { selectedThreadId: null, activeThreadId: 'thread-abc' } },
import socketService and call socketService.connect(...), wait for
handlers['connect'] and invoke it, then assert the mock socket emit was called
with 'thread:subscribe' and payload { thread_id: 'thread-abc' } to exercise the
selectedThreadId ?? activeThreadId branch in socketService.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@app/src/services/__tests__/socketService.events.test.ts`:
- Around line 130-169: Add a new test that verifies the fallback to
activeThreadId when selectedThreadId is null: use buildMockSocket() and
vi.doMock('socket.io-client') like the other tests, set storeMock.getState to
return { thread: { selectedThreadId: null, activeThreadId: 'thread-abc' } },
import socketService and call socketService.connect(...), wait for
handlers['connect'] and invoke it, then assert the mock socket emit was called
with 'thread:subscribe' and payload { thread_id: 'thread-abc' } to exercise the
selectedThreadId ?? activeThreadId branch in socketService.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c44c42fc-fd65-4379-b0a8-7b6119aa5caa

📥 Commits

Reviewing files that changed from the base of the PR and between 4dcf15f and 1194afe.

📒 Files selected for processing (2)
  • app/src/services/__tests__/socketService.events.test.ts
  • app/src/services/socketService.ts

@senamakel senamakel merged commit e854a07 into tinyhumansai:main May 22, 2026
30 of 32 checks passed
senamakel pushed a commit to aqilaziz/openhuman that referenced this pull request May 23, 2026
…read-room stream (tinyhumansai#2493)

Co-authored-by: sanil-23 <sanil@alphahuman.xyz>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants