Problem
We are trying to reduce LiveKit browser time-to-first-generated-frame toward <1s. Server-side work in DecartAI/api PR #1870 moved output track publication off the critical path by eager-publishing model-output-video after the server receives the client's input track_subscribed event.
That change helps, but it exposes two SDK/client-side risks:
- The SDK currently attaches the LiveKit remote track internally, then emits a
MediaStream for the app to assign to its own <video>. This splits hidden playback from the actual visible video element, so SDK-side playback events are not the same as user-visible first frame.
- A warm/pre-published server output track can already exist by the time the browser finishes
room.connect(). RoomEvent.TrackPublished is not reliable for tracks that were published before the local participant joined, so the SDK needs a post-connect sweep over existing remote publications.
Data points
Production rVFC probe before the eager-publish server change showed p50 connect() -> first decoded frame still around 2.0-2.9s across regions:
| Region |
Setup p50 |
TTFF p50 |
Derived connect->first decoded |
| us-west-2 |
1082ms |
3972ms |
2890ms |
| us-east-1 |
1583ms |
3688ms |
2105ms |
| eu-west-1 |
2224ms |
4338ms |
2114ms |
| ap-northeast-1 |
2004ms |
4174ms |
2170ms |
| ap-southeast-1 |
2715ms |
4758ms |
2043ms |
Receiver jitter buffer was populated at roughly 80-110ms, so the 2-3s post-connect delay was not explained by jitter buffer alone.
Server PR #1870 local Fast-4G bit-invert A/B showed the lazy output-track critical path was real:
- lazy baseline median first-frame: 2843ms, n=5, range 2656-3314
- eager publish median first-frame: 1667ms, n=5, range 1160-1808
- median improvement: -1176ms / -41%
That still leaves a large browser/media-path tail. The SDK needs to make the receive path deterministic and measurable.
Exact SDK observations
From local package inspection:
TrackSubscribed is the first event with a usable RemoteTrack; it is not first decoded frame.
TrackPublished is documented/implemented as a publication event after join; initial participant/publication state can be applied before ConnectionState.Connected. A pre-published server output track therefore needs an explicit post-connect sweep.
track.attach() resolves playback wiring, not first decoded/rendered video. Use HTMLVideoElement.requestVideoFrameCallback for user-visible first-frame timing.
- Current SDK behavior calls
track.attach() without the user's visible <video>, then separately emits a MediaStream. That is weak for both TTFF and adaptive-stream behavior.
autoSubscribe should stay true for TTFF; turning it off adds a manual subscribe signaling round trip.
- Local publish currently happens after connect and publishes stream tracks in
stream.getTracks() order. If audio is first, video publish can be delayed.
Recommended SDK changes
-
Add an SDK option for the actual output HTMLVideoElement, or expose onRemoteTrack(track, publication, participant) so apps/probes can attach synchronously.
-
On TrackSubscribed for the inference-server video track:
videoEl.muted = true;
videoEl.autoplay = true;
videoEl.playsInline = true;
videoEl.requestVideoFrameCallback(onFirstFrame);
track.attach(videoEl);
void videoEl.play().catch(...);
-
After room.connect(...), sweep room.remoteParticipants and their publications/tracks to attach any already-present inference-server video. This is required for server pre-publish/warm-output tracks.
-
Publish local video before audio, or publish video first and defer audio until after first generated output if audio is not needed to start generation.
-
Keep autoSubscribe: true and keep adaptiveStream: false for the first TTFF experiments. If enabling adaptive stream later, attach the real visible element and set pauseVideoInBackground: false; otherwise LiveKit may suppress/downscale because it cannot observe the actual element.
Measurement requirements
The probe/SDK should separately report:
connect_start
room_connect_done
- local video publish start/done
- server output
TrackPublished if observed
- server output
TrackSubscribed
- visible element attach/play resolved
- first rVFC any frame
- first rVFC after
generation_started / first generated frame
- first inbound stats
framesDecoded > 0
- first inbound stats
keyFramesDecoded > 0
pliCount / firCount around startup
Important: after PR #1870 the server intentionally pushes an initial black warmup frame. A plain first-rVFC latch will measure that black frame and overstate the generated-content win. We need both first decoded any frame and first decoded generated frame.
Acceptance criteria
- Warm/pre-published server output track is handled even if no
TrackPublished event fires after connect.
- The app-visible
<video> receives the attached track directly.
- SDK/probe can distinguish black warmup first frame from first generated frame.
- TTFF instrumentation uses rVFC, not 1Hz
framesReceived polling.
- A browser test covers the warm pre-published output-track path.
Target
Server PR #1870 appears to remove roughly 1.2s median in local A/B by moving output publication/SFU subscription off the first real frame path. The remaining target is to reduce the post-connect generated-frame tail from ~1.6-2.1s toward <1s by removing hidden-element ambiguity, missed pre-published tracks, publish-order delay, and keyframe/decode wait.
Problem
We are trying to reduce LiveKit browser time-to-first-generated-frame toward <1s. Server-side work in
DecartAI/apiPR #1870 moved output track publication off the critical path by eager-publishingmodel-output-videoafter the server receives the client's inputtrack_subscribedevent.That change helps, but it exposes two SDK/client-side risks:
MediaStreamfor the app to assign to its own<video>. This splits hidden playback from the actual visible video element, so SDK-side playback events are not the same as user-visible first frame.room.connect().RoomEvent.TrackPublishedis not reliable for tracks that were published before the local participant joined, so the SDK needs a post-connect sweep over existing remote publications.Data points
Production rVFC probe before the eager-publish server change showed p50
connect() -> first decoded framestill around 2.0-2.9s across regions:Receiver jitter buffer was populated at roughly 80-110ms, so the 2-3s post-connect delay was not explained by jitter buffer alone.
Server PR #1870 local Fast-4G bit-invert A/B showed the lazy output-track critical path was real:
That still leaves a large browser/media-path tail. The SDK needs to make the receive path deterministic and measurable.
Exact SDK observations
From local package inspection:
TrackSubscribedis the first event with a usableRemoteTrack; it is not first decoded frame.TrackPublishedis documented/implemented as a publication event after join; initial participant/publication state can be applied beforeConnectionState.Connected. A pre-published server output track therefore needs an explicit post-connect sweep.track.attach()resolves playback wiring, not first decoded/rendered video. UseHTMLVideoElement.requestVideoFrameCallbackfor user-visible first-frame timing.track.attach()without the user's visible<video>, then separately emits aMediaStream. That is weak for both TTFF and adaptive-stream behavior.autoSubscribeshould stay true for TTFF; turning it off adds a manual subscribe signaling round trip.stream.getTracks()order. If audio is first, video publish can be delayed.Recommended SDK changes
Add an SDK option for the actual output
HTMLVideoElement, or exposeonRemoteTrack(track, publication, participant)so apps/probes can attach synchronously.On
TrackSubscribedfor the inference-server video track:After
room.connect(...), sweeproom.remoteParticipantsand their publications/tracks to attach any already-present inference-server video. This is required for server pre-publish/warm-output tracks.Publish local video before audio, or publish video first and defer audio until after first generated output if audio is not needed to start generation.
Keep
autoSubscribe: trueand keepadaptiveStream: falsefor the first TTFF experiments. If enabling adaptive stream later, attach the real visible element and setpauseVideoInBackground: false; otherwise LiveKit may suppress/downscale because it cannot observe the actual element.Measurement requirements
The probe/SDK should separately report:
connect_startroom_connect_doneTrackPublishedif observedTrackSubscribedgeneration_started/ first generated frameframesDecoded > 0keyFramesDecoded > 0pliCount/firCountaround startupImportant: after PR #1870 the server intentionally pushes an initial black warmup frame. A plain first-rVFC latch will measure that black frame and overstate the generated-content win. We need both
first decoded any frameandfirst decoded generated frame.Acceptance criteria
TrackPublishedevent fires after connect.<video>receives the attached track directly.framesReceivedpolling.Target
Server PR #1870 appears to remove roughly 1.2s median in local A/B by moving output publication/SFU subscription off the first real frame path. The remaining target is to reduce the post-connect generated-frame tail from ~1.6-2.1s toward <1s by removing hidden-element ambiguity, missed pre-published tracks, publish-order delay, and keyframe/decode wait.