Skip to content

LiveKit startup: attach real output video element and preserve warm pre-published tracks #149

@eilon-decart

Description

@eilon-decart

Problem

We are trying to reduce LiveKit browser time-to-first-generated-frame toward <1s. Server-side work in DecartAI/api PR #1870 moved output track publication off the critical path by eager-publishing model-output-video after the server receives the client's input track_subscribed event.

That change helps, but it exposes two SDK/client-side risks:

  1. The SDK currently attaches the LiveKit remote track internally, then emits a MediaStream for the app to assign to its own <video>. This splits hidden playback from the actual visible video element, so SDK-side playback events are not the same as user-visible first frame.
  2. A warm/pre-published server output track can already exist by the time the browser finishes room.connect(). RoomEvent.TrackPublished is not reliable for tracks that were published before the local participant joined, so the SDK needs a post-connect sweep over existing remote publications.

Data points

Production rVFC probe before the eager-publish server change showed p50 connect() -> first decoded frame still around 2.0-2.9s across regions:

Region Setup p50 TTFF p50 Derived connect->first decoded
us-west-2 1082ms 3972ms 2890ms
us-east-1 1583ms 3688ms 2105ms
eu-west-1 2224ms 4338ms 2114ms
ap-northeast-1 2004ms 4174ms 2170ms
ap-southeast-1 2715ms 4758ms 2043ms

Receiver jitter buffer was populated at roughly 80-110ms, so the 2-3s post-connect delay was not explained by jitter buffer alone.

Server PR #1870 local Fast-4G bit-invert A/B showed the lazy output-track critical path was real:

  • lazy baseline median first-frame: 2843ms, n=5, range 2656-3314
  • eager publish median first-frame: 1667ms, n=5, range 1160-1808
  • median improvement: -1176ms / -41%

That still leaves a large browser/media-path tail. The SDK needs to make the receive path deterministic and measurable.

Exact SDK observations

From local package inspection:

  • TrackSubscribed is the first event with a usable RemoteTrack; it is not first decoded frame.
  • TrackPublished is documented/implemented as a publication event after join; initial participant/publication state can be applied before ConnectionState.Connected. A pre-published server output track therefore needs an explicit post-connect sweep.
  • track.attach() resolves playback wiring, not first decoded/rendered video. Use HTMLVideoElement.requestVideoFrameCallback for user-visible first-frame timing.
  • Current SDK behavior calls track.attach() without the user's visible <video>, then separately emits a MediaStream. That is weak for both TTFF and adaptive-stream behavior.
  • autoSubscribe should stay true for TTFF; turning it off adds a manual subscribe signaling round trip.
  • Local publish currently happens after connect and publishes stream tracks in stream.getTracks() order. If audio is first, video publish can be delayed.

Recommended SDK changes

  1. Add an SDK option for the actual output HTMLVideoElement, or expose onRemoteTrack(track, publication, participant) so apps/probes can attach synchronously.

  2. On TrackSubscribed for the inference-server video track:

videoEl.muted = true;
videoEl.autoplay = true;
videoEl.playsInline = true;
videoEl.requestVideoFrameCallback(onFirstFrame);
track.attach(videoEl);
void videoEl.play().catch(...);
  1. After room.connect(...), sweep room.remoteParticipants and their publications/tracks to attach any already-present inference-server video. This is required for server pre-publish/warm-output tracks.

  2. Publish local video before audio, or publish video first and defer audio until after first generated output if audio is not needed to start generation.

  3. Keep autoSubscribe: true and keep adaptiveStream: false for the first TTFF experiments. If enabling adaptive stream later, attach the real visible element and set pauseVideoInBackground: false; otherwise LiveKit may suppress/downscale because it cannot observe the actual element.

Measurement requirements

The probe/SDK should separately report:

  • connect_start
  • room_connect_done
  • local video publish start/done
  • server output TrackPublished if observed
  • server output TrackSubscribed
  • visible element attach/play resolved
  • first rVFC any frame
  • first rVFC after generation_started / first generated frame
  • first inbound stats framesDecoded > 0
  • first inbound stats keyFramesDecoded > 0
  • pliCount / firCount around startup

Important: after PR #1870 the server intentionally pushes an initial black warmup frame. A plain first-rVFC latch will measure that black frame and overstate the generated-content win. We need both first decoded any frame and first decoded generated frame.

Acceptance criteria

  • Warm/pre-published server output track is handled even if no TrackPublished event fires after connect.
  • The app-visible <video> receives the attached track directly.
  • SDK/probe can distinguish black warmup first frame from first generated frame.
  • TTFF instrumentation uses rVFC, not 1Hz framesReceived polling.
  • A browser test covers the warm pre-published output-track path.

Target

Server PR #1870 appears to remove roughly 1.2s median in local A/B by moving output publication/SFU subscription off the first real frame path. The remaining target is to reduce the post-connect generated-frame tail from ~1.6-2.1s toward <1s by removing hidden-element ambiguity, missed pre-published tracks, publish-order delay, and keyframe/decode wait.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions