Refactor transcript pipeline: direct Vexa WS passthrough + frontend TranscriptManager#139
Merged
jspada200 merged 6 commits intoAcademySoftwareFoundation:mainfrom May 5, 2026
Conversation
This was referenced Apr 20, 2026
Closes AcademySoftwareFoundation#135. Replaces the re-hash-and-flatten transcript pipeline with a passthrough over Vexa's new WS contract (`{type:"transcript", confirmed, pending}`). Backend: - models/stored_segment.py: carry Vexa's stable `segment_id`; add `start_time`, `end_time`, `completed`; remove hashed `generate_segment_id`. - transcription_service.on_transcription_updated: upsert `confirmed` by Vexa's `segment_id`; broadcast the flat `{type:"transcript", speaker, confirmed, pending, playlist_id, version_id, ts}` shape verbatim. - transcription_providers/vexa.py::_handle_ws_message: accept the new `type:"transcript"` frame; forward `confirmed/pending/speaker/ts` raw. - storage_providers/mongodb.py::upsert_segment: drop duplicate `segment_id` from `$setOnInsert` — it already lives in `$set` via the model dump, and MongoDB rejects the same field in both operators (ConflictingUpdateOperators). - main.py: add `/test/broadcast-transcript` endpoint gated by env var `DNA_TESTING_ENABLED` for end-to-end WS shape assertions (404 otherwise). Frontend: - Add `@vexaai/transcript-rendering@^0.4.0`; it is now the single dedup authority for transcript rendering. - `@dna/core` gains `TranscriptEventPayload` + a `'transcript'` EventType; `DNAEventClient.handleMessage` forwards raw transcript messages without the `{type, payload}` envelope so `TranscriptManager.handleMessage` consumes them directly. - useSegments: `createTranscriptManager<StoredSegment>()` per (playlist, version); REST bootstrap and WS ticks both flow through it. - useDNAEvents: new `useTranscriptEvents` hook. - useAISuggestion: switched off `useSegmentEvents`. - StoredSegment: gains `start_time`, `end_time`, `completed`. Follow-ups (out of scope for this PR): - backend/tests/test_transcription_service.py + test_legacy_models.py still reference the removed `generate_segment_id` and the old `segments:[...]` payload shape; they will need to be rewritten against the new contract. Signed-off-by: DmitryG228 <2280905@gmail.com>
Follow-up cleanup to the transcript passthrough: Dead-code removal - EventType.SEGMENT_CREATED / SEGMENT_UPDATED — never emitted after the passthrough lands; drop from the enum. Update /ws docstring in main.py. - SegmentEventPayload, useSegmentEvents, subscribeToSegmentEvents — the frontend reads segments via the new flat `transcript` event only; drop the dead wrappers. Tests - Delete TestOnTranscriptionUpdated + TestSegmentIdGeneration classes in test_transcription_service.py (they referenced the old segments:[...] payload shape and the removed generate_segment_id). - Update `test_forwards_transcript_updated` to use the new confirmed/pending payload shape. - Update test_storage_providers.py: StoredSegmentCreate now requires segment_id. - Swap SEGMENT_CREATED/UPDATED sample event types for TRANSCRIPTION_UPDATED/COMPLETED in test_event_publisher.py and test_websocket.py (sample types used to verify subscribe/publish wiring — behavior unchanged, just picked live EventType values). Full pytest: 256 passing. tests-local Gate still GREEN (7/7). Signed-off-by: DmitryG228 <2280905@gmail.com>
`upsert_segment` keys on {segment_id, playlist_id, version_id}. Without an
index, each upsert does a full collection scan, which only matters once the
collection grows — but growth is fast (refine-heavy write rate on a live
meeting), so the fix belongs in the same ship as the upsert refactor.
- mongodb.py: `ensure_indexes()` creates a unique compound index
segments_upsert_key {segment_id, playlist_id, version_id} and a supporting
index segments_list_by_version {playlist_id, version_id, absolute_start_time}
for the REST bootstrap query.
- main.py: call `storage.ensure_indexes()` in the FastAPI startup hook
(guarded by hasattr so tests that mock the provider stay happy).
Idempotent — safe to call on every container start.
Signed-off-by: DmitryG228 <2280905@gmail.com>
Mirrors the Vexa dashboard's two render-time polish cues now that `useSegments` yields the raw Vexa shape (including `completed` per segment and contiguous same-speaker runs via `TranscriptManager`): - Pending segments (`completed === false`, i.e. draft ticks arriving in the WS `pending[]` array) render with muted color, italic, and 0.75 opacity — same visual semantics as Vexa's `text-muted-foreground/70 italic` in services/dashboard/src/components/transcript/transcript-segment.tsx. - Consecutive segments from the same speaker no longer repeat the name + timestamp header. The first segment of a run carries the header; the rest pad tightly to read as one block. Mirrors `showSpeakerHeader` in Vexa's transcript-viewer. Render-time only — no changes to the manager, backend, schema, or the raw WS envelope. Signed-off-by: DmitryG228 <2280905@gmail.com>
…pdated
Two issues reported from the live deploy:
- Sometimes only fresh WS transcripts showed after a version switch; the
REST-loaded historical segments appeared to vanish.
- Every Vexa tick produced two WS frames (`transcription.updated` wrapped +
`transcript` flat); nothing subscribes to the wrapped one.
useSegments — fix the race
- `manager.bootstrap(rest)` in the queryFn CLEARS confirmed + pending maps;
any WS tick that landed during the REST fetch was wiped. Swap it for the
additive tick path (`manager.handleMessage({type:'transcript', confirmed:
rest, pending: []})`) so REST + WS converge on the same confirmed map
keyed by Vexa's `segment_id`.
- Seed the manager from React Query's cache during the version-change
effect. Without this, a revisit to a version within `staleTime` skipped
the queryFn entirely — the cleared manager never got the REST data, so
new WS ticks sat alone on an empty base. Now the cached REST list is
tick-merged into the manager on mount, before WS ticks start.
- Capture the queryKey's playlist/version at queryFn start; if the user
switched while we were fetching, return the raw REST (valid cache for
the OLD queryKey) instead of contaminating it with the current manager
state (which belongs to the NEW version).
Dead-code cleanup — `transcription.updated`
- `_on_vexa_event` no longer publishes `TRANSCRIPTION_UPDATED`; the flat
`{type:"transcript", ...}` broadcast from `on_transcription_updated`
carries the full payload and is what the frontend consumes. Nothing
subscribed to the wrapped event.
- Trim `EventType` enum to the 3 values actually emitted: BOT_STATUS_CHANGED,
TRANSCRIPTION_COMPLETED, TRANSCRIPTION_ERROR. Remove unused
TRANSCRIPTION_SUBSCRIBE, TRANSCRIPTION_STARTED, TRANSCRIPTION_UPDATED,
PLAYLIST_UPDATED, VERSION_UPDATED, DRAFT_NOTE_UPDATED.
- Trim frontend `EventType` union to match (removes `'playlist.updated'`,
`'version.updated'`).
- `test_forwards_transcript_updated` rewritten to verify
`on_transcription_updated` is called (the actual behaviour).
- Sample event types in test_event_publisher + test_websocket retargeted
to live enum values so the subscribe/publish mechanism stays exercised.
Full pytest: 256 passing. Local Gate: GREEN.
Signed-off-by: DmitryG228 <2280905@gmail.com>
…iption_updated
- test_vexa_provider: replace stale `transcript.mutable` payloads with
the new flat `{type:"transcript", confirmed, pending, speaker, ts}`
shape; add coverage for the empty-defaults path.
- test_transcription_service: add TestOnTranscriptionUpdated covering
the upsert + flat broadcast happy path, every early-return branch
(missing providers / no playlist mapping / no metadata / paused /
in_review None), the resumed_at filter (aware + naive) including
the ValueError fall-through on bad timestamps, missing-required-field
skips, top-level-speaker fallback, and upsert exception swallowing.
- Bumps coverage on transcription_service.py from 67% to 92%, restoring
the suite above the 90% gate. Black-formatted.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: DmitryG228 <2280905@gmail.com>
d1b1167 to
09c157d
Compare
jspada200
reviewed
May 5, 2026
| existing = await self.segments_collection.find_one(query) | ||
| is_new = existing is None | ||
|
|
||
| # `segment_id` is already in `data.model_dump()` — MongoDB rejects an |
jspada200
reviewed
May 5, 2026
|
|
||
| class StoredSegmentCreate(BaseModel): | ||
| """Model for creating a stored segment.""" | ||
| """Model for creating/upserting a stored segment (raw Vexa passthrough).""" |
Collaborator
There was a problem hiding this comment.
Although Vexa is a first class dependency now, I would like to try to keep this part of the codebase generalized as much as we can. Nit picky, but change where we say "Vexa" to "transcriptProvider".
jspada200
reviewed
May 5, 2026
| absolute_end_time: str = Field( | ||
| ..., description="UTC timestamp (ISO 8601) of segment end" | ||
| ) | ||
| vexa_updated_at: Optional[str] = Field( |
jspada200
approved these changes
May 5, 2026
Collaborator
jspada200
left a comment
There was a problem hiding this comment.
This looking and working great! small thing with the variable naming on the model in mongo but besides that this is good to merge! Once fixed, feel free to merge this in.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #135.
Replaces the re-hash-and-flatten transcript pipeline with a passthrough over Vexa's new WS contract (
{type:"transcript", confirmed, pending}). Backend preserves Vexa's stablesegment_idall the way to MongoDB; frontend delegates dedup to@vexaai/transcript-rendering'sTranscriptManager.Backend
models/stored_segment.py— carry Vexa'ssegment_id; addstart_time,end_time,completed; remove the hashedgenerate_segment_id.transcription_service.on_transcription_updated— upsertconfirmedsegments by Vexa'ssegment_id; broadcast the flat{type:"transcript", speaker, confirmed, pending, playlist_id, version_id, ts}shape verbatim.transcription_providers/vexa.py::_handle_ws_message— accept the newtype:"transcript"frame; forwardconfirmed/pending/speaker/tsraw.storage_providers/mongodb.py::upsert_segment— drop duplicatesegment_idfrom$setOnInsert(it already lives in$setvia the model dump); MongoDB rejects the same field in both operators (ConflictingUpdateOperators).main.py— new dev-only endpointPOST /test/broadcast-transcript, 404 unlessDNA_TESTING_ENABLED=true. Used by end-to-end WS shape assertions; must never be enabled in production.Frontend
@vexaai/transcript-rendering@^0.4.0— the single dedup authority for rendering.@dna/coregainsTranscriptEventPayloadand a'transcript'EventType;DNAEventClient.handleMessageforwards raw transcript messages without the{type, payload}envelope soTranscriptManager.handleMessageconsumes them directly.useSegments—createTranscriptManager<StoredSegment>()per(playlist, version); REST bootstrap and WS ticks both flow through it.useDNAEvents— newuseTranscriptEventshook.useAISuggestion— switched offuseSegmentEventsontouseTranscriptEvents.StoredSegmentinterface — gainsstart_time,end_time,completed.Test plan
Exercised locally (mocks) and on a fresh Linode VM against real Vexa Cloud.
segment_idverbatim (not a hash)segment_idwith shiftedabsolute_start_time→ 1 MongoDB document (the bug in Refactor transcript pipeline: direct Vexa WS passthrough + frontend TranscriptManager #135 that kicked this off)StoredSegmentCreatecarriessegment_id,start_time,end_time,completed,language,vexa_updated_at{type, speaker, confirmed, pending, playlist_id, version_id, ts}— no envelope wrappingStoredSegmentinterface hasstart_time,end_time,completeduseSegments.tsimports + callscreateTranscriptManager,.bootstrap(...),.handleMessage(...)TranscriptManager— confirmed duplicates collapse to 1; pending drafts pass through/health, frontend/both 200/transcription/bot/.../statusround-trips to Vexa Cloud with the configured key/wsstays openFollow-ups (not in this PR)
backend/tests/test_transcription_service.py+test_legacy_models.pystill reference the removedgenerate_segment_idand the oldsegments: [...]payload shape — they'll failpytestuntil rewritten against the new contract.{segment_id, playlist_id, version_id}to back the upsert query.SEGMENT_CREATED/SEGMENT_UPDATEDEventTypes anduseSegmentEventsare no longer emitted/called and can be removed in a follow-up cleanup.🤖 Generated with Claude Code