realtime(observability): forward diagnostics + stats over the signaling WS#148
realtime(observability): forward diagnostics + stats over the signaling WS#148nagar-decart wants to merge 6 commits into
Conversation
…ng WS
Pipes client-side WebRTC / ICE / networking observability events through
the existing realtime WebSocket as `{type: "observability", data}`
messages. Bouncer logs them under the session's structured-log context
in Datadog (DecartAI/api#1882), so SDK-side observations land
correlated with our server-side LiveKit / inference traces without any
new endpoints, auth, or transport.
What flows over the WS now:
- Every diagnostic emitted from `RealtimeObservability.diagnostic()`
(e.g. `client-session-connection-breakdown`, `reconnect`, `videoStall`)
- Every periodic WebRTC stats snapshot collected by the in-process
stats collector
Wiring:
- `SignalingChannel.sendObservability(data)` — new public, fire-and-forget
method that writes `{type: "observability", data}` (drops silently if
the socket isn't open; never throws).
- `RealtimeObservability.setObservabilityForwarder(fn | null)` — sink
set by `StreamSession.createTransport()` after the signaling channel
is constructed, and cleared on `tearDown()`.
- `ObservabilityMessage` added to `OutgoingRealtimeMessage` union.
This is intentionally additive to existing observability: the local
`onDiagnostic`/`onStats` callbacks and the platform telemetry POSTs
keep working as before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
commit: |
The existing observability sink only emitted three diagnostic types (client-session-connection-breakdown, reconnect, videoStall) plus periodic WebRTC stats — useful for metrics, useless for actually debugging an ICE failure. This adds raw event-level instrumentation that mirrors the iOS PR DecartAI/decart-ios#28's DecartLiveKitLogger + NetworkPathObserver coverage. New file: observability/network-instrumentation.ts Attaches addEventListener-based listeners to: - The LiveKit Room (Connected/Disconnected/Reconnecting/Reconnected, SignalReconnecting, ConnectionStateChanged, ConnectionQualityChanged, MediaDevicesError, LocalTrackPublished, ParticipantConnected, TrackSubscribed/Muted/Unmuted). - The underlying publisher + subscriber RTCPeerConnections (private access via room.engine.pcManager, with addEventListener so we never displace LiveKit's own handlers): icecandidate, icecandidateerror, iceconnectionstatechange, connectionstatechange, icegatheringstatechange, signalingstatechange, negotiationneeded, datachannel, track. ICE candidates carry full address/port/type /priority/foundation. Selected candidate pair (with RTT + addresses) is snapshotted from getStats() when ICE settles. - Browser network state (initial snapshot of navigator.connection + online/offline + visibilitychange + NetworkInformation 'change'). New methods: - RealtimeObservability.emitInstrumentationEvent(name, data) bypasses the strict DiagnosticEvents type union so per-candidate events can flow without bloating the public type surface. Signaling traffic: - SignalingChannel.writeMessage emits 'signaling-sent' (skips 'observability' to avoid recursing). - SignalingChannel.handleMessage emits 'signaling-received'. All events flow through the same observabilityForwarder -> realtime WS -> bouncer -> Datadog @event.* path that the previous change established. No new transport, no new endpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| } | ||
|
|
||
| // Re-export for convenience so consumers know what's available. | ||
| export { Track }; |
There was a problem hiding this comment.
Unused Track import and re-export
Low Severity
Track is imported from livekit-client and re-exported at line 370, but it's never used within this file, and no other file imports Track from network-instrumentation. This is dead code — the only import from this module is attachRoomInstrumentation.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit f58f9c7. Configure here.
…ebug signal Earlier instrumentation pass dumped everything over the WS — periodic WebRTC stats (every 1s), every signaling frame, every signalingstatechange, every track/visibility/connection-quality event. In a 15-min Datadog window with 3 successful sessions, stats alone produced 141 of 250 events (56%) while actual ICE-level data was ~7%. That signal-to-noise ratio makes the stream useless for debugging connection failures. Forward only what helps diagnose WHY a connection fails or stalls: DROPPED (noise during a healthy session, not useful for failure debug): - Periodic WebRTC stats (still local via onStats + platform telemetry) - signaling-sent for every outbound frame - signaling-received for routine generation_tick / prompt_ack / set_image_ack - signaling-state on PC (SDP renegotiation chatter) - negotiation-needed, track-received, data-channel-opened on PC - track-subscribed/-muted/-unmuted, local-track-published, participant-connected, connection-quality on Room - page-visibility KEPT (the ICE / connection debug picture): - ice-candidate, ice-candidate-past, ice-candidate-error - ice-connection-state, ice-gathering-state, pc-connection-state - pc-attached (with snapshot of current state) - selected-candidate-pair (with local/remote addresses + RTT) - connected-address (LiveKit's getter for the chosen address) - room-connected/-disconnected/-reconnecting/-signal-reconnecting/-reconnected - room-connection-state - media-devices-error - network-state (initial + on change), browser-online/-offline - signaling-received for livekit_room_info / session_id (the only signaling messages that matter for connection setup) - client-session-connection-breakdown (the breakdown diagnostic) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 4518087. Configure here.
| this.options.onStats?.(stats); | ||
| this.telemetryReporter.addStats(stats); | ||
| this.detectVideoStall(stats); | ||
| // Stats intentionally not forwarded over the WS. They fire every 1s |
There was a problem hiding this comment.
stop() doesn't clear observabilityForwarder like other fields
Low Severity
The stop() method resets every other private field (statsCollector, telemetryReporter, connectionBreakdown, etc.) but does not clear observabilityForwarder. After stop() is called, diagnostic() and emitInstrumentationEvent() still forward payloads over the WebSocket. In client.ts, observability.stop() is called before session.disconnect(), creating a window where the "stopped" observability instance continues sending data.
Reviewed by Cursor Bugbot for commit 4518087. Configure here.
- pc-attached now carries the PC's iceServers config (URLs only; credentials redacted to presence-bools) + iceTransportPolicy. Directly answers "did the JoinResponse give us STUN/TURN at all?", which was the core unknown in the 29CM ICE-failure investigation. - room-disconnected now decodes the numeric DisconnectReason enum into a readable string (signal_close, connection_timeout, media_failure, etc.). Mirrored in the test mock. - livekit_room_info signaling event now carries livekitUrl / roomName / sessionId — tells us which SFU node owns the room without needing to grep server-side logs. - Dropped room-connection-state (duplicated room-connected / -disconnected / -reconnecting). - Dropped connected-address (redundant with selected-candidate-pair.remote.address). - ice-candidate trimmed from 15 fields to 12: drop raw SDP `candidate` string, sdpMid, sdpMLineIndex, usernameFragment. Keep type, protocol, address, port, priority, foundation, component, tcpType, relatedAddress/Port, networkType, url. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The TS lib used in CI typecheck (different from the DOM types vitest uses locally) doesn't expose RTCIceCandidateStats. The RTCIceCandidatePairStats reference also tripped because the same lib exposes it under a slightly different shape — narrowing failed and field accesses landed on 'never'. Replaced with two minimal structural types (IceCandidateStat, IceCandidatePairStat) covering only the fields we actually read from getStats(). Functional behavior is unchanged. Also cast the session_id signaling case through `unknown` — it's sent by the bouncer but not in the typed IncomingRealtimeMessage union, so TS narrowed the comparison to never. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Summary
Pipes client-side WebRTC / ICE / networking observability events through the existing realtime WebSocket as
{type: "observability", data}messages. Bouncer logs them under the session's structured-log context in Datadog (DecartAI/api#1882), so SDK-side observations land correlated with our server-side LiveKit / inference traces without any new endpoints, auth, or transport.Context: while debugging the 29CM connection failures we had rich server-side LiveKit / ICE logs but no symmetric visibility into what the SDK's own WebRTC engine was producing on the client. This closes the gap.
What flows over the WS now
RealtimeObservability.diagnostic():client-session-connection-breakdown(with per-phase timings + error)reconnectvideoStallWiring
SignalingChannel.sendObservability(data)— new public, fire-and-forget method that writes{type: "observability", data}(drops silently if socket isn't open, never throws).RealtimeObservability.setObservabilityForwarder(fn | null)— sink set byStreamSession.createTransport()after the signaling channel is constructed, and cleared ontearDown().ObservabilityMessageadded to theOutgoingRealtimeMessageunion.Intentionally additive to existing observability: the local
onDiagnostic/onStatscallbacks and the platform telemetry POSTs keep working as before.Diff size
4 files, +49 / -1 lines. No new dependencies, no new config.
Test plan
pnpm -F @decartai/sdk build— passespnpm -F @decartai/sdk test— 206 tests passservice:bouncer-realtime "RTAPI: client observability"log lines in Datadog us5🤖 Generated with Claude Code
Note
Medium Risk
Touches live session transport and relies on LiveKit private APIs for peer-connection hooks; failures are mostly best-effort but could add listener/polling overhead during connect.
Overview
Adds client-side connection observability on the realtime SDK path: diagnostics and ICE/WebRTC debug events are sent on the existing signaling WebSocket as
{ type: "observability", data }, so bouncer can log them next to server traces (e.g. Datadog).Wiring:
StreamSessionregisters an observability forwarder onRealtimeObservabilitythat callsSignalingChannel.sendObservability(best-effort, no throw). TypedObservabilityMessageis added to outgoing WS messages. Forwarder is cleared on teardown.What gets forwarded: All
diagnostic()payloads (connection breakdown, reconnect, video stall, etc.) plus free-formemitInstrumentationEventstreams. Periodic WebRTC stats are not forwarded over the WS (still go to local callbacks and telemetry POST) to avoid noise.Deep ICE instrumentation: New
attachRoomInstrumentationruns after LiveKitroom.connect()fromMediaChannel, reaches privateroom.engine.pcManagerpeer connections, and emits candidates, ICE/PC state, selected pairs, browser online/network-change, and selected room/signaling events (livekit_room_infowithout token). Routine prompt/generation ack traffic is suppressed on the signaling side.Reviewed by Cursor Bugbot for commit deeaca7. Bugbot is set up for automated code reviews on this repo. Configure here.