-
Notifications
You must be signed in to change notification settings - Fork 27
Refactor transcript pipeline: direct Vexa WS passthrough + frontend TranscriptManager #139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
jspada200
merged 6 commits into
AcademySoftwareFoundation:main
from
DmitriyG228:transcript-passthrough-135
May 5, 2026
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
708d63d
Refactor transcript pipeline: Vexa WS passthrough + TranscriptManager
DmitriyG228 3b3409e
Remove dead segment-event surface + fix/update tests
DmitriyG228 3131bf5
Add compound unique MongoDB index on segments upsert key
DmitriyG228 688c3a6
TranscriptPanel: render pending segments subtle + dedup speaker labels
DmitriyG228 b68cbcb
Fix WS-vs-REST race on version switch + remove legacy transcription.u…
DmitriyG228 09c157d
Fix CI: update WS tests to new transcript contract + cover on_transcr…
DmitriyG228 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,17 +1,16 @@ | ||
| """Event type definitions.""" | ||
| """Event type definitions. | ||
|
|
||
| Only events that are actually emitted AND consumed live here. The flat | ||
| `transcript` frame doesn't go through this enum — it's broadcast directly | ||
| via `EventPublisher.ws_manager.broadcast(...)` because its envelope is | ||
| shaped by the Vexa contract, not by the `{type, payload}` wrapper this | ||
| enum drives. | ||
| """ | ||
|
|
||
| from enum import Enum | ||
|
|
||
|
|
||
| class EventType(str, Enum): | ||
| TRANSCRIPTION_SUBSCRIBE = "transcription.subscribe" | ||
| TRANSCRIPTION_STARTED = "transcription.started" | ||
| TRANSCRIPTION_UPDATED = "transcription.updated" | ||
| TRANSCRIPTION_COMPLETED = "transcription.completed" | ||
| TRANSCRIPTION_ERROR = "transcription.error" | ||
| SEGMENT_CREATED = "segment.created" | ||
| SEGMENT_UPDATED = "segment.updated" | ||
| BOT_STATUS_CHANGED = "bot.status_changed" | ||
| PLAYLIST_UPDATED = "playlist.updated" | ||
| VERSION_UPDATED = "version.updated" | ||
| DRAFT_NOTE_UPDATED = "draft_note.updated" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,45 +1,45 @@ | ||
| """Stored Segment Models. | ||
|
|
||
| Pydantic models for transcription segments stored in MongoDB. | ||
|
|
||
| Backend operates as a passthrough for Vexa's transcript stream: | ||
| - `segment_id` is Vexa's stable id (e.g. "9b914779:speaker-1:72"), not a hash. | ||
| - Upsert key in MongoDB is `{segment_id, playlist_id, version_id}`. | ||
| - All Vexa fields (start_time, end_time, completed, language, ...) are preserved. | ||
| """ | ||
|
|
||
| import hashlib | ||
| from datetime import datetime, timezone | ||
| from typing import Optional | ||
|
|
||
| from pydantic import BaseModel, ConfigDict, Field | ||
|
|
||
|
|
||
| def generate_segment_id( | ||
| playlist_id: int, | ||
| version_id: int, | ||
| absolute_start_time: str, | ||
| ) -> str: | ||
| """Generate a unique segment ID based on version and start time. | ||
|
|
||
| Note: Speaker is intentionally excluded from the key because Vexa's mutable | ||
| transcription can reassign speakers as it refines the transcript. Using only | ||
| the start time ensures updates to the same moment are treated as updates | ||
| rather than new segments. | ||
| """ | ||
| key = f"{playlist_id}:{version_id}:{absolute_start_time}" | ||
| return hashlib.sha256(key.encode()).hexdigest()[:16] | ||
|
|
||
|
|
||
| class StoredSegmentCreate(BaseModel): | ||
| """Model for creating a stored segment.""" | ||
| """Model for creating/upserting a stored segment (raw Vexa passthrough).""" | ||
|
|
||
| segment_id: str = Field( | ||
| ..., description="Vexa's stable segment id (e.g. '9b914779:speaker-1:72')" | ||
| ) | ||
| text: str = Field(..., description="Transcript text content") | ||
| speaker: Optional[str] = Field(default=None, description="Speaker identifier") | ||
| language: Optional[str] = Field(default=None, description="Language code") | ||
| start_time: Optional[float] = Field( | ||
| default=None, description="Relative start time in seconds" | ||
| ) | ||
| end_time: Optional[float] = Field( | ||
| default=None, description="Relative end time in seconds" | ||
| ) | ||
| completed: Optional[bool] = Field( | ||
| default=True, description="Whether the segment is confirmed (vs draft)" | ||
| ) | ||
| absolute_start_time: str = Field( | ||
| ..., description="UTC timestamp (ISO 8601) of segment start" | ||
| ) | ||
| absolute_end_time: str = Field( | ||
| ..., description="UTC timestamp (ISO 8601) of segment end" | ||
| ) | ||
| vexa_updated_at: Optional[str] = Field( | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here |
||
| default=None, description="Vexa's updated_at timestamp for deduplication" | ||
| default=None, description="Vexa's updated_at timestamp" | ||
| ) | ||
|
|
||
|
|
||
|
|
@@ -49,12 +49,15 @@ class StoredSegment(BaseModel): | |
| model_config = ConfigDict(populate_by_name=True) | ||
|
|
||
| id: str = Field(alias="_id") | ||
| segment_id: str = Field(..., description="Unique segment ID") | ||
| segment_id: str | ||
| playlist_id: int | ||
| version_id: int | ||
| text: str | ||
| speaker: Optional[str] = None | ||
| language: Optional[str] = None | ||
| start_time: Optional[float] = None | ||
| end_time: Optional[float] = None | ||
| completed: Optional[bool] = True | ||
| absolute_start_time: str | ||
| absolute_end_time: str | ||
| vexa_updated_at: Optional[str] = None | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,6 +21,28 @@ class MongoDBStorageProvider(StorageProviderBase): | |
|
|
||
| def __init__(self) -> None: | ||
| self._client: Optional[AsyncMongoClient[Any]] = None | ||
| self._indexes_ensured = False | ||
|
|
||
| async def ensure_indexes(self) -> None: | ||
| """Create collection indexes. Idempotent; safe to call on every startup. | ||
|
|
||
| The compound unique index on the `segments` upsert key makes | ||
| `upsert_segment` O(log n) instead of a full-collection scan — at | ||
| Vexa's refine-heavy write rate, scans become user-visible at ~100k | ||
| segments and timeouts at ~1M. | ||
| """ | ||
| if self._indexes_ensured: | ||
| return | ||
| await self.segments_collection.create_index( | ||
| [("segment_id", 1), ("playlist_id", 1), ("version_id", 1)], | ||
| unique=True, | ||
| name="segments_upsert_key", | ||
| ) | ||
| await self.segments_collection.create_index( | ||
| [("playlist_id", 1), ("version_id", 1), ("absolute_start_time", 1)], | ||
| name="segments_list_by_version", | ||
| ) | ||
| self._indexes_ensured = True | ||
|
|
||
| @property | ||
| def client(self) -> AsyncMongoClient[Any]: | ||
|
|
@@ -239,14 +261,17 @@ async def upsert_segment( | |
| existing = await self.segments_collection.find_one(query) | ||
| is_new = existing is None | ||
|
|
||
| # `segment_id` is already in `data.model_dump()` — MongoDB rejects an | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Comment seems unneeded |
||
| # update that lists the same field in both `$set` and `$setOnInsert`. | ||
| # `playlist_id`/`version_id` stay in `$setOnInsert` because they aren't | ||
| # part of `StoredSegmentCreate` (they come from the enclosing context). | ||
| update: dict[str, Any] = { | ||
| "$set": { | ||
| **data.model_dump(), | ||
| "updated_at": now, | ||
| }, | ||
| "$setOnInsert": { | ||
| "created_at": now, | ||
| "segment_id": segment_id, | ||
| "playlist_id": playlist_id, | ||
| "version_id": version_id, | ||
| }, | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although Vexa is a first class dependency now, I would like to try to keep this part of the codebase generalized as much as we can. Nit picky, but change where we say "Vexa" to "transcriptProvider".