Skip to content

Refactor transcript pipeline: direct Vexa WS passthrough + frontend TranscriptManager#139

Merged
jspada200 merged 6 commits intoAcademySoftwareFoundation:mainfrom
DmitriyG228:transcript-passthrough-135
May 5, 2026
Merged

Refactor transcript pipeline: direct Vexa WS passthrough + frontend TranscriptManager#139
jspada200 merged 6 commits intoAcademySoftwareFoundation:mainfrom
DmitriyG228:transcript-passthrough-135

Conversation

@DmitriyG228
Copy link
Copy Markdown
Collaborator

Summary

Closes #135.

Replaces the re-hash-and-flatten transcript pipeline with a passthrough over Vexa's new WS contract ({type:"transcript", confirmed, pending}). Backend preserves Vexa's stable segment_id all the way to MongoDB; frontend delegates dedup to @vexaai/transcript-rendering's TranscriptManager.

Backend

  • models/stored_segment.py — carry Vexa's segment_id; add start_time, end_time, completed; remove the hashed generate_segment_id.
  • transcription_service.on_transcription_updated — upsert confirmed segments by Vexa's segment_id; broadcast the flat {type:"transcript", speaker, confirmed, pending, playlist_id, version_id, ts} shape verbatim.
  • transcription_providers/vexa.py::_handle_ws_message — accept the new type:"transcript" frame; forward confirmed/pending/speaker/ts raw.
  • storage_providers/mongodb.py::upsert_segment — drop duplicate segment_id from $setOnInsert (it already lives in $set via the model dump); MongoDB rejects the same field in both operators (ConflictingUpdateOperators).
  • main.pynew dev-only endpoint POST /test/broadcast-transcript, 404 unless DNA_TESTING_ENABLED=true. Used by end-to-end WS shape assertions; must never be enabled in production.

Frontend

  • Add @vexaai/transcript-rendering@^0.4.0 — the single dedup authority for rendering.
  • @dna/core gains TranscriptEventPayload and a 'transcript' EventType; DNAEventClient.handleMessage forwards raw transcript messages without the {type, payload} envelope so TranscriptManager.handleMessage consumes them directly.
  • useSegmentscreateTranscriptManager<StoredSegment>() per (playlist, version); REST bootstrap and WS ticks both flow through it.
  • useDNAEvents — new useTranscriptEvents hook.
  • useAISuggestion — switched off useSegmentEvents onto useTranscriptEvents.
  • StoredSegment interface — gains start_time, end_time, completed.

Test plan

Exercised locally (mocks) and on a fresh Linode VM against real Vexa Cloud.

  • Backend: upsert uses Vexa's segment_id verbatim (not a hash)
  • Backend: 2 WS ticks for the same segment_id with shifted absolute_start_time → 1 MongoDB document (the bug in Refactor transcript pipeline: direct Vexa WS passthrough + frontend TranscriptManager #135 that kicked this off)
  • Backend: StoredSegmentCreate carries segment_id, start_time, end_time, completed, language, vexa_updated_at
  • Backend: DNA WS broadcast is flat {type, speaker, confirmed, pending, playlist_id, version_id, ts} — no envelope wrapping
  • Frontend: StoredSegment interface has start_time, end_time, completed
  • Frontend: useSegments.ts imports + calls createTranscriptManager, .bootstrap(...), .handleMessage(...)
  • Frontend: 3-tick sequence through TranscriptManager — confirmed duplicates collapse to 1; pending drafts pass through
  • VM: backend /health, frontend / both 200
  • VM: /transcription/bot/.../status round-trips to Vexa Cloud with the configured key
  • VM: /ws stays open
  • VM: live WS receives the flat shape verbatim when a synthetic transcript is broadcast through the dev endpoint

Follow-ups (not in this PR)

  • backend/tests/test_transcription_service.py + test_legacy_models.py still reference the removed generate_segment_id and the old segments: [...] payload shape — they'll fail pytest until rewritten against the new contract.
  • Consider adding a compound MongoDB index on {segment_id, playlist_id, version_id} to back the upsert query.
  • SEGMENT_CREATED / SEGMENT_UPDATED EventTypes and useSegmentEvents are no longer emitted/called and can be removed in a follow-up cleanup.

🤖 Generated with Claude Code

DmitriyG228 and others added 6 commits May 4, 2026 20:35
Closes AcademySoftwareFoundation#135.

Replaces the re-hash-and-flatten transcript pipeline with a passthrough over
Vexa's new WS contract (`{type:"transcript", confirmed, pending}`).

Backend:
- models/stored_segment.py: carry Vexa's stable `segment_id`; add
  `start_time`, `end_time`, `completed`; remove hashed `generate_segment_id`.
- transcription_service.on_transcription_updated: upsert `confirmed` by
  Vexa's `segment_id`; broadcast the flat `{type:"transcript", speaker,
  confirmed, pending, playlist_id, version_id, ts}` shape verbatim.
- transcription_providers/vexa.py::_handle_ws_message: accept the new
  `type:"transcript"` frame; forward `confirmed/pending/speaker/ts` raw.
- storage_providers/mongodb.py::upsert_segment: drop duplicate `segment_id`
  from `$setOnInsert` — it already lives in `$set` via the model dump, and
  MongoDB rejects the same field in both operators (ConflictingUpdateOperators).
- main.py: add `/test/broadcast-transcript` endpoint gated by env var
  `DNA_TESTING_ENABLED` for end-to-end WS shape assertions (404 otherwise).

Frontend:
- Add `@vexaai/transcript-rendering@^0.4.0`; it is now the single dedup
  authority for transcript rendering.
- `@dna/core` gains `TranscriptEventPayload` + a `'transcript'` EventType;
  `DNAEventClient.handleMessage` forwards raw transcript messages without
  the `{type, payload}` envelope so `TranscriptManager.handleMessage`
  consumes them directly.
- useSegments: `createTranscriptManager<StoredSegment>()` per (playlist,
  version); REST bootstrap and WS ticks both flow through it.
- useDNAEvents: new `useTranscriptEvents` hook.
- useAISuggestion: switched off `useSegmentEvents`.
- StoredSegment: gains `start_time`, `end_time`, `completed`.

Follow-ups (out of scope for this PR):
- backend/tests/test_transcription_service.py + test_legacy_models.py still
  reference the removed `generate_segment_id` and the old `segments:[...]`
  payload shape; they will need to be rewritten against the new contract.

Signed-off-by: DmitryG228 <2280905@gmail.com>
Follow-up cleanup to the transcript passthrough:

Dead-code removal
- EventType.SEGMENT_CREATED / SEGMENT_UPDATED — never emitted after the
  passthrough lands; drop from the enum. Update /ws docstring in main.py.
- SegmentEventPayload, useSegmentEvents, subscribeToSegmentEvents — the
  frontend reads segments via the new flat `transcript` event only; drop
  the dead wrappers.

Tests
- Delete TestOnTranscriptionUpdated + TestSegmentIdGeneration classes in
  test_transcription_service.py (they referenced the old segments:[...]
  payload shape and the removed generate_segment_id).
- Update `test_forwards_transcript_updated` to use the new confirmed/pending
  payload shape.
- Update test_storage_providers.py: StoredSegmentCreate now requires
  segment_id.
- Swap SEGMENT_CREATED/UPDATED sample event types for
  TRANSCRIPTION_UPDATED/COMPLETED in test_event_publisher.py and
  test_websocket.py (sample types used to verify subscribe/publish wiring —
  behavior unchanged, just picked live EventType values).

Full pytest: 256 passing. tests-local Gate still GREEN (7/7).

Signed-off-by: DmitryG228 <2280905@gmail.com>
`upsert_segment` keys on {segment_id, playlist_id, version_id}. Without an
index, each upsert does a full collection scan, which only matters once the
collection grows — but growth is fast (refine-heavy write rate on a live
meeting), so the fix belongs in the same ship as the upsert refactor.

- mongodb.py: `ensure_indexes()` creates a unique compound index
  segments_upsert_key {segment_id, playlist_id, version_id} and a supporting
  index segments_list_by_version {playlist_id, version_id, absolute_start_time}
  for the REST bootstrap query.
- main.py: call `storage.ensure_indexes()` in the FastAPI startup hook
  (guarded by hasattr so tests that mock the provider stay happy).

Idempotent — safe to call on every container start.

Signed-off-by: DmitryG228 <2280905@gmail.com>
Mirrors the Vexa dashboard's two render-time polish cues now that
`useSegments` yields the raw Vexa shape (including `completed` per
segment and contiguous same-speaker runs via `TranscriptManager`):

- Pending segments (`completed === false`, i.e. draft ticks arriving in
  the WS `pending[]` array) render with muted color, italic, and 0.75
  opacity — same visual semantics as Vexa's `text-muted-foreground/70
  italic` in services/dashboard/src/components/transcript/transcript-segment.tsx.
- Consecutive segments from the same speaker no longer repeat the name +
  timestamp header. The first segment of a run carries the header; the
  rest pad tightly to read as one block. Mirrors `showSpeakerHeader` in
  Vexa's transcript-viewer.

Render-time only — no changes to the manager, backend, schema, or the
raw WS envelope.

Signed-off-by: DmitryG228 <2280905@gmail.com>
…pdated

Two issues reported from the live deploy:
- Sometimes only fresh WS transcripts showed after a version switch; the
  REST-loaded historical segments appeared to vanish.
- Every Vexa tick produced two WS frames (`transcription.updated` wrapped +
  `transcript` flat); nothing subscribes to the wrapped one.

useSegments — fix the race
- `manager.bootstrap(rest)` in the queryFn CLEARS confirmed + pending maps;
  any WS tick that landed during the REST fetch was wiped. Swap it for the
  additive tick path (`manager.handleMessage({type:'transcript', confirmed:
  rest, pending: []})`) so REST + WS converge on the same confirmed map
  keyed by Vexa's `segment_id`.
- Seed the manager from React Query's cache during the version-change
  effect. Without this, a revisit to a version within `staleTime` skipped
  the queryFn entirely — the cleared manager never got the REST data, so
  new WS ticks sat alone on an empty base. Now the cached REST list is
  tick-merged into the manager on mount, before WS ticks start.
- Capture the queryKey's playlist/version at queryFn start; if the user
  switched while we were fetching, return the raw REST (valid cache for
  the OLD queryKey) instead of contaminating it with the current manager
  state (which belongs to the NEW version).

Dead-code cleanup — `transcription.updated`
- `_on_vexa_event` no longer publishes `TRANSCRIPTION_UPDATED`; the flat
  `{type:"transcript", ...}` broadcast from `on_transcription_updated`
  carries the full payload and is what the frontend consumes. Nothing
  subscribed to the wrapped event.
- Trim `EventType` enum to the 3 values actually emitted: BOT_STATUS_CHANGED,
  TRANSCRIPTION_COMPLETED, TRANSCRIPTION_ERROR. Remove unused
  TRANSCRIPTION_SUBSCRIBE, TRANSCRIPTION_STARTED, TRANSCRIPTION_UPDATED,
  PLAYLIST_UPDATED, VERSION_UPDATED, DRAFT_NOTE_UPDATED.
- Trim frontend `EventType` union to match (removes `'playlist.updated'`,
  `'version.updated'`).
- `test_forwards_transcript_updated` rewritten to verify
  `on_transcription_updated` is called (the actual behaviour).
- Sample event types in test_event_publisher + test_websocket retargeted
  to live enum values so the subscribe/publish mechanism stays exercised.

Full pytest: 256 passing. Local Gate: GREEN.

Signed-off-by: DmitryG228 <2280905@gmail.com>
…iption_updated

- test_vexa_provider: replace stale `transcript.mutable` payloads with
  the new flat `{type:"transcript", confirmed, pending, speaker, ts}`
  shape; add coverage for the empty-defaults path.
- test_transcription_service: add TestOnTranscriptionUpdated covering
  the upsert + flat broadcast happy path, every early-return branch
  (missing providers / no playlist mapping / no metadata / paused /
  in_review None), the resumed_at filter (aware + naive) including
  the ValueError fall-through on bad timestamps, missing-required-field
  skips, top-level-speaker fallback, and upsert exception swallowing.
- Bumps coverage on transcription_service.py from 67% to 92%, restoring
  the suite above the 90% gate. Black-formatted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: DmitryG228 <2280905@gmail.com>
@DmitriyG228 DmitriyG228 force-pushed the transcript-passthrough-135 branch from d1b1167 to 09c157d Compare May 4, 2026 17:35
existing = await self.segments_collection.find_one(query)
is_new = existing is None

# `segment_id` is already in `data.model_dump()` — MongoDB rejects an
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment seems unneeded


class StoredSegmentCreate(BaseModel):
"""Model for creating a stored segment."""
"""Model for creating/upserting a stored segment (raw Vexa passthrough)."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although Vexa is a first class dependency now, I would like to try to keep this part of the codebase generalized as much as we can. Nit picky, but change where we say "Vexa" to "transcriptProvider".

absolute_end_time: str = Field(
..., description="UTC timestamp (ISO 8601) of segment end"
)
vexa_updated_at: Optional[str] = Field(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Copy Markdown
Collaborator

@jspada200 jspada200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looking and working great! small thing with the variable naming on the model in mongo but besides that this is good to merge! Once fixed, feel free to merge this in.

@jspada200 jspada200 merged commit 26f1486 into AcademySoftwareFoundation:main May 5, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor transcript pipeline: direct Vexa WS passthrough + frontend TranscriptManager

2 participants