LiveKit startup: attach real output video element and preserve warm pre-published tracks

## Problem

We are trying to reduce LiveKit browser time-to-first-generated-frame toward <1s. Server-side work in `DecartAI/api` PR #1870 moved output track publication off the critical path by eager-publishing `model-output-video` after the server receives the client's input `track_subscribed` event.

That change helps, but it exposes two SDK/client-side risks:

1. The SDK currently attaches the LiveKit remote track internally, then emits a `MediaStream` for the app to assign to its own `<video>`. This splits hidden playback from the actual visible video element, so SDK-side playback events are not the same as user-visible first frame.
2. A warm/pre-published server output track can already exist by the time the browser finishes `room.connect()`. `RoomEvent.TrackPublished` is not reliable for tracks that were published before the local participant joined, so the SDK needs a post-connect sweep over existing remote publications.

## Data points

Production rVFC probe before the eager-publish server change showed p50 `connect() -> first decoded frame` still around 2.0-2.9s across regions:

| Region | Setup p50 | TTFF p50 | Derived connect->first decoded |
|---|---:|---:|---:|
| us-west-2 | 1082ms | 3972ms | 2890ms |
| us-east-1 | 1583ms | 3688ms | 2105ms |
| eu-west-1 | 2224ms | 4338ms | 2114ms |
| ap-northeast-1 | 2004ms | 4174ms | 2170ms |
| ap-southeast-1 | 2715ms | 4758ms | 2043ms |

Receiver jitter buffer was populated at roughly 80-110ms, so the 2-3s post-connect delay was not explained by jitter buffer alone.

Server PR #1870 local Fast-4G bit-invert A/B showed the lazy output-track critical path was real:

- lazy baseline median first-frame: 2843ms, n=5, range 2656-3314
- eager publish median first-frame: 1667ms, n=5, range 1160-1808
- median improvement: -1176ms / -41%

That still leaves a large browser/media-path tail. The SDK needs to make the receive path deterministic and measurable.

## Exact SDK observations

From local package inspection:

- `TrackSubscribed` is the first event with a usable `RemoteTrack`; it is not first decoded frame.
- `TrackPublished` is documented/implemented as a publication event after join; initial participant/publication state can be applied before `ConnectionState.Connected`. A pre-published server output track therefore needs an explicit post-connect sweep.
- `track.attach()` resolves playback wiring, not first decoded/rendered video. Use `HTMLVideoElement.requestVideoFrameCallback` for user-visible first-frame timing.
- Current SDK behavior calls `track.attach()` without the user's visible `<video>`, then separately emits a `MediaStream`. That is weak for both TTFF and adaptive-stream behavior.
- `autoSubscribe` should stay true for TTFF; turning it off adds a manual subscribe signaling round trip.
- Local publish currently happens after connect and publishes stream tracks in `stream.getTracks()` order. If audio is first, video publish can be delayed.

## Recommended SDK changes

1. Add an SDK option for the actual output `HTMLVideoElement`, or expose `onRemoteTrack(track, publication, participant)` so apps/probes can attach synchronously.

2. On `TrackSubscribed` for the inference-server video track:

```ts
videoEl.muted = true;
videoEl.autoplay = true;
videoEl.playsInline = true;
videoEl.requestVideoFrameCallback(onFirstFrame);
track.attach(videoEl);
void videoEl.play().catch(...);
```

3. After `room.connect(...)`, sweep `room.remoteParticipants` and their publications/tracks to attach any already-present inference-server video. This is required for server pre-publish/warm-output tracks.

4. Publish local video before audio, or publish video first and defer audio until after first generated output if audio is not needed to start generation.

5. Keep `autoSubscribe: true` and keep `adaptiveStream: false` for the first TTFF experiments. If enabling adaptive stream later, attach the real visible element and set `pauseVideoInBackground: false`; otherwise LiveKit may suppress/downscale because it cannot observe the actual element.

## Measurement requirements

The probe/SDK should separately report:

- `connect_start`
- `room_connect_done`
- local video publish start/done
- server output `TrackPublished` if observed
- server output `TrackSubscribed`
- visible element attach/play resolved
- first rVFC any frame
- first rVFC after `generation_started` / first generated frame
- first inbound stats `framesDecoded > 0`
- first inbound stats `keyFramesDecoded > 0`
- `pliCount` / `firCount` around startup

Important: after PR #1870 the server intentionally pushes an initial black warmup frame. A plain first-rVFC latch will measure that black frame and overstate the generated-content win. We need both `first decoded any frame` and `first decoded generated frame`.

## Acceptance criteria

- Warm/pre-published server output track is handled even if no `TrackPublished` event fires after connect.
- The app-visible `<video>` receives the attached track directly.
- SDK/probe can distinguish black warmup first frame from first generated frame.
- TTFF instrumentation uses rVFC, not 1Hz `framesReceived` polling.
- A browser test covers the warm pre-published output-track path.

## Target

Server PR #1870 appears to remove roughly 1.2s median in local A/B by moving output publication/SFU subscription off the first real frame path. The remaining target is to reduce the post-connect generated-frame tail from ~1.6-2.1s toward <1s by removing hidden-element ambiguity, missed pre-published tracks, publish-order delay, and keyframe/decode wait.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LiveKit startup: attach real output video element and preserve warm pre-published tracks #149

Problem

Data points

Exact SDK observations

Recommended SDK changes

Measurement requirements

Acceptance criteria

Target

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Region	Setup p50	TTFF p50	Derived connect->first decoded
us-west-2	1082ms	3972ms	2890ms
us-east-1	1583ms	3688ms	2105ms
eu-west-1	2224ms	4338ms	2114ms
ap-northeast-1	2004ms	4174ms	2170ms
ap-southeast-1	2715ms	4758ms	2043ms

LiveKit startup: attach real output video element and preserve warm pre-published tracks #149

Description

Problem

Data points

Exact SDK observations

Recommended SDK changes

Measurement requirements

Acceptance criteria

Target

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions