Refactor transcript pipeline: direct Vexa WS passthrough + frontend TranscriptManager by DmitriyG228 · Pull Request #139 · AcademySoftwareFoundation/dna

DmitriyG228 · 2026-04-20T18:11:05Z

Summary

Closes #135.

Replaces the re-hash-and-flatten transcript pipeline with a passthrough over Vexa's new WS contract ({type:"transcript", confirmed, pending}). Backend preserves Vexa's stable segment_id all the way to MongoDB; frontend delegates dedup to @vexaai/transcript-rendering's TranscriptManager.

Backend

models/stored_segment.py — carry Vexa's segment_id; add start_time, end_time, completed; remove the hashed generate_segment_id.
transcription_service.on_transcription_updated — upsert confirmed segments by Vexa's segment_id; broadcast the flat {type:"transcript", speaker, confirmed, pending, playlist_id, version_id, ts} shape verbatim.
transcription_providers/vexa.py::_handle_ws_message — accept the new type:"transcript" frame; forward confirmed/pending/speaker/ts raw.
storage_providers/mongodb.py::upsert_segment — drop duplicate segment_id from $setOnInsert (it already lives in $set via the model dump); MongoDB rejects the same field in both operators (ConflictingUpdateOperators).
main.py — new dev-only endpoint POST /test/broadcast-transcript, 404 unless DNA_TESTING_ENABLED=true. Used by end-to-end WS shape assertions; must never be enabled in production.

Frontend

Add @vexaai/transcript-rendering@^0.4.0 — the single dedup authority for rendering.
@dna/core gains TranscriptEventPayload and a 'transcript' EventType; DNAEventClient.handleMessage forwards raw transcript messages without the {type, payload} envelope so TranscriptManager.handleMessage consumes them directly.
useSegments — createTranscriptManager<StoredSegment>() per (playlist, version); REST bootstrap and WS ticks both flow through it.
useDNAEvents — new useTranscriptEvents hook.
useAISuggestion — switched off useSegmentEvents onto useTranscriptEvents.
StoredSegment interface — gains start_time, end_time, completed.

Test plan

Exercised locally (mocks) and on a fresh Linode VM against real Vexa Cloud.

Follow-ups (not in this PR)

backend/tests/test_transcription_service.py + test_legacy_models.py still reference the removed generate_segment_id and the old segments: [...] payload shape — they'll fail pytest until rewritten against the new contract.
Consider adding a compound MongoDB index on {segment_id, playlist_id, version_id} to back the upsert query.
SEGMENT_CREATED / SEGMENT_UPDATED EventTypes and useSegmentEvents are no longer emitted/called and can be removed in a follow-up cleanup.

🤖 Generated with Claude Code

Closes AcademySoftwareFoundation#135. Replaces the re-hash-and-flatten transcript pipeline with a passthrough over Vexa's new WS contract (`{type:"transcript", confirmed, pending}`). Backend: - models/stored_segment.py: carry Vexa's stable `segment_id`; add `start_time`, `end_time`, `completed`; remove hashed `generate_segment_id`. - transcription_service.on_transcription_updated: upsert `confirmed` by Vexa's `segment_id`; broadcast the flat `{type:"transcript", speaker, confirmed, pending, playlist_id, version_id, ts}` shape verbatim. - transcription_providers/vexa.py::_handle_ws_message: accept the new `type:"transcript"` frame; forward `confirmed/pending/speaker/ts` raw. - storage_providers/mongodb.py::upsert_segment: drop duplicate `segment_id` from `$setOnInsert` — it already lives in `$set` via the model dump, and MongoDB rejects the same field in both operators (ConflictingUpdateOperators). - main.py: add `/test/broadcast-transcript` endpoint gated by env var `DNA_TESTING_ENABLED` for end-to-end WS shape assertions (404 otherwise). Frontend: - Add `@vexaai/transcript-rendering@^0.4.0`; it is now the single dedup authority for transcript rendering. - `@dna/core` gains `TranscriptEventPayload` + a `'transcript'` EventType; `DNAEventClient.handleMessage` forwards raw transcript messages without the `{type, payload}` envelope so `TranscriptManager.handleMessage` consumes them directly. - useSegments: `createTranscriptManager<StoredSegment>()` per (playlist, version); REST bootstrap and WS ticks both flow through it. - useDNAEvents: new `useTranscriptEvents` hook. - useAISuggestion: switched off `useSegmentEvents`. - StoredSegment: gains `start_time`, `end_time`, `completed`. Follow-ups (out of scope for this PR): - backend/tests/test_transcription_service.py + test_legacy_models.py still reference the removed `generate_segment_id` and the old `segments:[...]` payload shape; they will need to be rewritten against the new contract. Signed-off-by: DmitryG228 <2280905@gmail.com>

Follow-up cleanup to the transcript passthrough: Dead-code removal - EventType.SEGMENT_CREATED / SEGMENT_UPDATED — never emitted after the passthrough lands; drop from the enum. Update /ws docstring in main.py. - SegmentEventPayload, useSegmentEvents, subscribeToSegmentEvents — the frontend reads segments via the new flat `transcript` event only; drop the dead wrappers. Tests - Delete TestOnTranscriptionUpdated + TestSegmentIdGeneration classes in test_transcription_service.py (they referenced the old segments:[...] payload shape and the removed generate_segment_id). - Update `test_forwards_transcript_updated` to use the new confirmed/pending payload shape. - Update test_storage_providers.py: StoredSegmentCreate now requires segment_id. - Swap SEGMENT_CREATED/UPDATED sample event types for TRANSCRIPTION_UPDATED/COMPLETED in test_event_publisher.py and test_websocket.py (sample types used to verify subscribe/publish wiring — behavior unchanged, just picked live EventType values). Full pytest: 256 passing. tests-local Gate still GREEN (7/7). Signed-off-by: DmitryG228 <2280905@gmail.com>

`upsert_segment` keys on {segment_id, playlist_id, version_id}. Without an index, each upsert does a full collection scan, which only matters once the collection grows — but growth is fast (refine-heavy write rate on a live meeting), so the fix belongs in the same ship as the upsert refactor. - mongodb.py: `ensure_indexes()` creates a unique compound index segments_upsert_key {segment_id, playlist_id, version_id} and a supporting index segments_list_by_version {playlist_id, version_id, absolute_start_time} for the REST bootstrap query. - main.py: call `storage.ensure_indexes()` in the FastAPI startup hook (guarded by hasattr so tests that mock the provider stay happy). Idempotent — safe to call on every container start. Signed-off-by: DmitryG228 <2280905@gmail.com>

Mirrors the Vexa dashboard's two render-time polish cues now that `useSegments` yields the raw Vexa shape (including `completed` per segment and contiguous same-speaker runs via `TranscriptManager`): - Pending segments (`completed === false`, i.e. draft ticks arriving in the WS `pending[]` array) render with muted color, italic, and 0.75 opacity — same visual semantics as Vexa's `text-muted-foreground/70 italic` in services/dashboard/src/components/transcript/transcript-segment.tsx. - Consecutive segments from the same speaker no longer repeat the name + timestamp header. The first segment of a run carries the header; the rest pad tightly to read as one block. Mirrors `showSpeakerHeader` in Vexa's transcript-viewer. Render-time only — no changes to the manager, backend, schema, or the raw WS envelope. Signed-off-by: DmitryG228 <2280905@gmail.com>

…pdated Two issues reported from the live deploy: - Sometimes only fresh WS transcripts showed after a version switch; the REST-loaded historical segments appeared to vanish. - Every Vexa tick produced two WS frames (`transcription.updated` wrapped + `transcript` flat); nothing subscribes to the wrapped one. useSegments — fix the race - `manager.bootstrap(rest)` in the queryFn CLEARS confirmed + pending maps; any WS tick that landed during the REST fetch was wiped. Swap it for the additive tick path (`manager.handleMessage({type:'transcript', confirmed: rest, pending: []})`) so REST + WS converge on the same confirmed map keyed by Vexa's `segment_id`. - Seed the manager from React Query's cache during the version-change effect. Without this, a revisit to a version within `staleTime` skipped the queryFn entirely — the cleared manager never got the REST data, so new WS ticks sat alone on an empty base. Now the cached REST list is tick-merged into the manager on mount, before WS ticks start. - Capture the queryKey's playlist/version at queryFn start; if the user switched while we were fetching, return the raw REST (valid cache for the OLD queryKey) instead of contaminating it with the current manager state (which belongs to the NEW version). Dead-code cleanup — `transcription.updated` - `_on_vexa_event` no longer publishes `TRANSCRIPTION_UPDATED`; the flat `{type:"transcript", ...}` broadcast from `on_transcription_updated` carries the full payload and is what the frontend consumes. Nothing subscribed to the wrapped event. - Trim `EventType` enum to the 3 values actually emitted: BOT_STATUS_CHANGED, TRANSCRIPTION_COMPLETED, TRANSCRIPTION_ERROR. Remove unused TRANSCRIPTION_SUBSCRIBE, TRANSCRIPTION_STARTED, TRANSCRIPTION_UPDATED, PLAYLIST_UPDATED, VERSION_UPDATED, DRAFT_NOTE_UPDATED. - Trim frontend `EventType` union to match (removes `'playlist.updated'`, `'version.updated'`). - `test_forwards_transcript_updated` rewritten to verify `on_transcription_updated` is called (the actual behaviour). - Sample event types in test_event_publisher + test_websocket retargeted to live enum values so the subscribe/publish mechanism stays exercised. Full pytest: 256 passing. Local Gate: GREEN. Signed-off-by: DmitryG228 <2280905@gmail.com>

…iption_updated - test_vexa_provider: replace stale `transcript.mutable` payloads with the new flat `{type:"transcript", confirmed, pending, speaker, ts}` shape; add coverage for the empty-defaults path. - test_transcription_service: add TestOnTranscriptionUpdated covering the upsert + flat broadcast happy path, every early-return branch (missing providers / no playlist mapping / no metadata / paused / in_review None), the resumed_at filter (aware + naive) including the ValueError fall-through on bad timestamps, missing-required-field skips, top-level-speaker fallback, and upsert exception swallowing. - Bumps coverage on transcription_service.py from 67% to 92%, restoring the suite above the 90% gate. Black-formatted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: DmitryG228 <2280905@gmail.com>

jspada200 · 2026-05-05T16:42:48Z

        existing = await self.segments_collection.find_one(query)
        is_new = existing is None

+        # `segment_id` is already in `data.model_dump()` — MongoDB rejects an


Comment seems unneeded

jspada200 · 2026-05-05T16:44:30Z

-
 class StoredSegmentCreate(BaseModel):
-    """Model for creating a stored segment."""
+    """Model for creating/upserting a stored segment (raw Vexa passthrough)."""


Although Vexa is a first class dependency now, I would like to try to keep this part of the codebase generalized as much as we can. Nit picky, but change where we say "Vexa" to "transcriptProvider".

jspada200 · 2026-05-05T16:44:46Z

    absolute_end_time: str = Field(
        ..., description="UTC timestamp (ISO 8601) of segment end"
    )
    vexa_updated_at: Optional[str] = Field(


jspada200

This looking and working great! small thing with the variable naming on the model in mongo but besides that this is good to merge! Once fixed, feel free to merge this in.

DmitriyG228 requested a review from jspada200 April 20, 2026 18:19

DmitriyG228 and others added 6 commits May 4, 2026 20:35

DmitriyG228 force-pushed the transcript-passthrough-135 branch from d1b1167 to 09c157d Compare May 4, 2026 17:35

jspada200 reviewed May 5, 2026

View reviewed changes

jspada200 approved these changes May 5, 2026

View reviewed changes

jspada200 merged commit 26f1486 into AcademySoftwareFoundation:main May 5, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor transcript pipeline: direct Vexa WS passthrough + frontend TranscriptManager#139

Refactor transcript pipeline: direct Vexa WS passthrough + frontend TranscriptManager#139
jspada200 merged 6 commits intoAcademySoftwareFoundation:mainfrom
DmitriyG228:transcript-passthrough-135

DmitriyG228 commented Apr 20, 2026

Uh oh!

jspada200 May 5, 2026

Uh oh!

jspada200 May 5, 2026

Uh oh!

jspada200 May 5, 2026

Uh oh!

jspada200 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DmitriyG228 commented Apr 20, 2026

Summary

Backend

Frontend

Test plan

Follow-ups (not in this PR)

Uh oh!

jspada200 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

jspada200 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

jspada200 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

jspada200 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants