-
Notifications
You must be signed in to change notification settings - Fork 2.7k
support timed transcripts from tts #2580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
3b52e69
92eb005
7678064
9905410
5b3abd8
01d8d9b
93d6434
267848a
94d648f
03512f2
bc717e1
aba239e
10b6d6a
7711201
e57bb4b
535d17c
885127e
5f1577b
3e297d5
a4fc26e
b4104c3
184d05d
d2751a9
76e7e87
04f3534
3d9fc4a
9165efd
48f09b3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| --- | ||
| "livekit-agents": patch | ||
| "livekit-plugins-cartesia": patch | ||
| "livekit-plugins-elevenlabs": patch | ||
| --- | ||
|
|
||
| support aligned transcripts with timestamps from tts (#2580) |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| import asyncio | ||
| import logging | ||
| from collections.abc import AsyncGenerator, AsyncIterable | ||
|
|
||
| from dotenv import load_dotenv | ||
|
|
||
| from livekit.agents import Agent, AgentSession, JobContext, WorkerOptions, cli | ||
| from livekit.agents.voice.agent import ModelSettings | ||
| from livekit.agents.voice.io import TimedString | ||
| from livekit.plugins import cartesia, deepgram, openai, silero | ||
|
|
||
| logger = logging.getLogger("my-worker") | ||
| logger.setLevel(logging.INFO) | ||
|
|
||
| load_dotenv() | ||
|
|
||
|
|
||
| # This example shows how to obtain the timed transcript from the TTS. | ||
| # Right now, it's supported for Cartesia and ElevenLabs TTS (word level timestamps) | ||
| # and non-streaming TTS with StreamAdapter (sentence level timestamps). | ||
|
|
||
|
|
||
| class MyAgent(Agent): | ||
| def __init__(self): | ||
| super().__init__(instructions="You are a helpful assistant.") | ||
|
|
||
| self._closing_task: asyncio.Task[None] | None = None | ||
|
|
||
| async def transcription_node( | ||
| self, text: AsyncIterable[str | TimedString], model_settings: ModelSettings | ||
| ) -> AsyncGenerator[str | TimedString, None]: | ||
| async for chunk in text: | ||
| if isinstance(chunk, TimedString): | ||
| logger.info(f"TimedString: '{chunk}' ({chunk.start_time} - {chunk.end_time})") | ||
| yield chunk | ||
|
|
||
|
|
||
| async def entrypoint(ctx: JobContext): | ||
| session = AgentSession( | ||
| stt=deepgram.STT(), | ||
| llm=openai.LLM(), | ||
| tts=cartesia.TTS(), | ||
| vad=silero.VAD.load(), | ||
| # enable TTS-aligned transcript, can be configured at the Agent level as well | ||
| use_tts_aligned_transcript=True, | ||
| ) | ||
|
|
||
| await session.start(agent=MyAgent(), room=ctx.room) | ||
|
|
||
| session.generate_reply(instructions="say hello to the user") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint)) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -29,14 +29,14 @@ def __init__( | |
| sentence_tokenizer: NotGivenOr[tokenize.SentenceTokenizer] = NOT_GIVEN, | ||
| ) -> None: | ||
| super().__init__( | ||
| capabilities=TTSCapabilities( | ||
| streaming=True, | ||
| ), | ||
| capabilities=TTSCapabilities(streaming=True, aligned_transcript=True), | ||
| sample_rate=tts.sample_rate, | ||
| num_channels=tts.num_channels, | ||
| ) | ||
| self._wrapped_tts = tts | ||
| self._sentence_tokenizer = sentence_tokenizer or tokenize.blingfire.SentenceTokenizer() | ||
| self._sentence_tokenizer = sentence_tokenizer or tokenize.blingfire.SentenceTokenizer( | ||
| retain_format=True | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh actually we were not using retain_format for the StreamAdapter before. Since it is only used to generate a sentence. In the PR I did, I was actually keeping the basic.SentenceTokenizer inside the transcription synchronization code.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It was used in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. or maybe I added it in this pr, we need to format if we use the timed transcript from stream adapter.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah ok, the synchronizer also needs the exact same formatting?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, it can be different. They process the sentences separately.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, but I thought the aligned transcripts returned by the TTSs were not including new lines/special characters. So I assumed retain_format was not needed.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When using the StreamAdapter with OpenAI, the transcription_node is coming from the llm_node right?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If use tts aligned transcript enabled, the input of the transcription node is from tts.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wdym for we shouldn't wait for the tts when using steam adapter?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok that makes sense, so by default, even if we use the StreamAdapter, it'll use the llm output for the transcription_node |
||
| ) | ||
|
|
||
| @self._wrapped_tts.on("metrics_collected") | ||
| def _forward_metrics(*args: Any, **kwargs: Any) -> None: | ||
|
|
@@ -91,12 +91,19 @@ async def _forward_input() -> None: | |
| self._sent_stream.end_input() | ||
|
|
||
| async def _synthesize() -> None: | ||
| from ..voice.io import TimedString | ||
|
|
||
| duration = 0.0 | ||
| async for ev in self._sent_stream: | ||
| output_emitter.push_timed_transcript( | ||
| TimedString(text=ev.token, start_time=duration) | ||
| ) | ||
| async with self._tts._wrapped_tts.synthesize( | ||
| ev.token, conn_options=self._wrapped_tts_conn_options | ||
| ev.token.strip(), conn_options=self._wrapped_tts_conn_options | ||
| ) as tts_stream: | ||
| async for audio in tts_stream: | ||
| output_emitter.push(audio.frame.data.tobytes()) | ||
| duration += audio.frame.duration | ||
| output_emitter.flush() | ||
|
|
||
| tasks = [ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.