diff --git a/CHANGELOG.md b/CHANGELOG.md index 4a69a6a6b..8cb09a7da 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -94,9 +94,14 @@ Install with: `uv add "vision-agents[redis]"` `py.typed` markers added to `vision_agents.core` and `vision_agents.testing` for downstream type checking support. (#378) +### Inworld TTS v2 + +`inworld-tts-2` added to the model `Literal` and used as the default for `inworld.TTS()`. (#531) + ## Bug Fixes - **EventManager**: fix crash when event handlers have return type annotations (#381) - **RedisSessionKVStore**: fix import error when `redis` package is not installed (#384) - **Agent metrics**: fix metrics storage and serialization in session registry (#387) +- **Inworld TTS**: fix garbled / failed playback for replies that span multiple stream chunks by forcing `LINEAR16` audio encoding (#531) - **MCPServerRemote**: fix cancel-scope leak in which closing an MCP session left a half-cancelled anyio scope that pegged the event loop. The transport lifecycle now runs inside a dedicated supervisor task so `__aenter__` / `__aexit__` task-identity holds regardless of which caller drives `connect()` and `disconnect()`. (#529) diff --git a/plugins/inworld/README.md b/plugins/inworld/README.md index 4265f0dbe..c1af29df6 100644 --- a/plugins/inworld/README.md +++ b/plugins/inworld/README.md @@ -16,18 +16,23 @@ Get your API key from the [Inworld Portal](https://studio.inworld.ai/) and set ## TTS -High-quality text-to-speech with streaming support. +High-quality text-to-speech with streaming support. The plugin now defaults +to Inworld's **TTS-2** model (currently in research preview), which adds +natural-language steering, 100+ languages (15 GA, 90+ experimental), and +high-quality instant voice cloning over the previous `inworld-tts-1.5-*` +generation. ```python from vision_agents.plugins import inworld +# Defaults to model_id="inworld-tts-2", voice_id="Sarah" tts = inworld.TTS() # Or specify explicitly tts = inworld.TTS( api_key="your_inworld_api_key", - voice_id="Dennis", - model_id="inworld-tts-1.5-max", + voice_id="Ashley", + model_id="inworld-tts-2", temperature=1.1, ) ``` @@ -35,10 +40,46 @@ tts = inworld.TTS( ### TTS options - `api_key`: Inworld AI API key (default: reads from `INWORLD_API_KEY`) -- `voice_id`: Voice to use (default: `"Dennis"`) -- `model_id`: `"inworld-tts-1.5-max"`, `"inworld-tts-1.5-mini"`, `"inworld-tts-1"`, `"inworld-tts-1-max"` (default: `"inworld-tts-1.5-max"`) +- `voice_id`: Voice to use (default: `"Sarah"`; `"Dennis"`, `"Ashley"`, `"Olivia"`, `"Clive"` and custom/cloned voices also supported) +- `model_id`: `"inworld-tts-2"` (default), `"inworld-tts-1.5-max"`, `"inworld-tts-1.5-mini"`. `"inworld-tts-1"` and `"inworld-tts-1-max"` are deprecated by Inworld — migrate to `inworld-tts-2` or `inworld-tts-1.5-*`. - `temperature`: 0–2 (default: 1.1) +The plugin requests `LINEAR16` (16-bit PCM WAV) chunks from Inworld so each +streamed chunk is self-contained and decodes cleanly under streaming TTS; +no extra configuration needed. + +### Steering (TTS-2) + +TTS-2 takes natural-language stage directions inline with your text. Place +the instruction in square brackets before the segment it should apply to: + +```python +text = ( + "[whisper in a hushed style] I have to tell you something. " + "[laugh] Just kidding! [say with force] Now let's get to work." +) +async for chunk in await tts.stream_audio(text): + ... +``` + +Steering covers articulation, intonation, volume, pitch, range, speed, and +vocal style — and supports non-verbal sounds like `[laugh]`, `[breathe]`, +`[clear throat]`, `[sigh]`, `[cough]`, `[yawn]`. Combining dimensions +(`[whisper in a hushed style]`, `[say playfully and very fast]`) produces +better results than bare single-word tags. See Inworld's +[steering docs](https://docs.inworld.ai/tts/capabilities/steering) and +[prompting guide](https://docs.inworld.ai/tts/best-practices/prompting-for-tts-2) +for the full reference. + +### Agent example + +A complete example wiring `inworld.TTS()` into a Stream-edge agent with +Deepgram STT, Gemini LLM, and smart-turn detection lives at +[`example/inworld_tts_example.py`](example/inworld_tts_example.py). The +companion [`example/inworld-audio-guide.md`](example/inworld-audio-guide.md) +is loaded as the agent's system prompt and teaches the LLM how to emit +TTS-2 steering tags so replies sound expressive out of the box. + ## Realtime (WebRTC) Low-latency speech-to-speech via Inworld's Realtime API. This transport uses diff --git a/plugins/inworld/example/inworld-audio-guide.md b/plugins/inworld/example/inworld-audio-guide.md index e82333e93..6bed3164a 100644 --- a/plugins/inworld/example/inworld-audio-guide.md +++ b/plugins/inworld/example/inworld-audio-guide.md @@ -1,90 +1,109 @@ -## Audio Markup Rules - -### Emotion and Delivery Style Tags -Place these at the BEGINNING of text segments to control how the following text is spoken: -- `[happy]` - Use for positive, enthusiastic, or joyful responses -- `[sad]` - Use for empathetic, disappointing, or melancholic content -- `[angry]` - Use for firm corrections or expressing frustration -- `[surprised]` - Use for unexpected discoveries or amazement -- `[fearful]` - Use for warnings or expressing concern -- `[disgusted]` - Use for expressing strong disapproval -- `[laughing]` - Use when text should be delivered with laughter -- `[whispering]` - Use for secrets, quiet emphasis, or intimate tone - -### Non-Verbal Vocalization Tags -Insert these EXACTLY WHERE the sound should occur in your text: -- `[breathe]` - Add between thoughts or before important statements -- `[clear_throat]` - Use before corrections or important announcements -- `[cough]` - Use sparingly for realism -- `[laugh]` - Insert after humor or when expressing amusement -- `[sigh]` - Use to express resignation, relief, or empathy -- `[yawn]` - Use when expressing tiredness or boredom +## Audio Markup Rules (Inworld TTS-2) -## Response Generation Rules +Inworld TTS-2 takes **natural-language stage directions** in square brackets, +not fixed enum tags. Treat each bracket like a note to a voice actor: the +more vividly you describe how a line should be performed, the better the +output. A direction stays in effect for following sentences until you +introduce a new one. + +### Steering directions (place at the start of a segment) + +Combine these dimensions inside one bracket — layered instructions outperform +single words: -1. **Always start responses with appropriate emotion tags** based on the user's query and your response tone. +- **Emotion** — `[say excitedly]`, `[say with concern]`, `[sound terrified]` +- **Articulation** — `[say with force]`, `[say crisply with deliberate pauses]` +- **Intonation** — `[say with a falling pitch]`, `[rising pitch through the phrase]` +- **Volume** — `[very quiet]`, `[very loud]` +- **Pitch** — `[say in a low tone]`, `[say in a high tone]` +- **Range** — `[say playfully]`, `[say in a flat delivery]` +- **Speed** — `[very fast]`, `[very slow]` +- **Vocal style** — `[whisper in a hushed style]`, `[say in a nasal voice]` -2. **Insert non-verbal sounds naturally** where a human would naturally pause, breathe, or react. +Layered example: +``` +[say sadly with deliberate pauses in a low voice and hushed style] I'm sorry, that didn't work. +``` -3. **Match emotions to content**: - - Technical explanations: Start neutral or `[happy]` if being helpful - - Bad news or errors: Start with `[sad]` or concerned tone - - Exciting discoveries: Use `[surprised]` or `[happy]` - - Clarifications after misunderstanding: `[clear_throat]` before correcting +### Non-verbal sounds (insert exactly where the sound should occur) -4. **Use this frequency**: - - One emotion tag per response paragraph - - 0-2 non-verbal sounds per response - - Never use more than 3 total tags in a short response +The supported set is: -5. **Natural placement patterns**: - - `[breathe]` before listing items or explaining complex topics - - `[sigh]` when acknowledging difficulties - - `[laugh]` only after genuinely amusing content - - `[clear_throat]` before important corrections +- `[laugh]` — after genuinely amusing content +- `[sigh]` — to express resignation, relief, or empathy +- `[breathe]` — between thoughts or before important statements +- `[clear throat]` — before corrections or important announcements +- `[cough]` — sparingly, for realism +- `[yawn]` — when expressing tiredness or boredom + +## Response Generation Rules + +1. **Lead with one steering direction** when the line has a clear emotional + or delivery shift. A single tag scopes across the following sentences + until you change it — don't repeat it on every sentence. +2. **Insert non-verbal sounds inline** at the exact moment they should + occur. 0–2 per response is plenty. +3. **Match the direction to the content** — happy news gets an excited or + playful steer; bad news gets a sad, slow, or hushed steer; corrections + start with `[clear throat]`. +4. **Combine dimensions** for nuance. `[say sadly]` is okay; `[say sadly + with deliberate pauses in a low voice]` is much better. +5. **Keep it sparse** — never more than 3 total tags in a short reply. ## Example Response Patterns -For a helpful response: +Helpful response: +``` +[say warmly and a little excited] I'd be glad to help with that. [breathe] Here's what you need to know... +``` + +Delivering bad news: ``` -[happy] I'd be glad to help you with that! [breathe] Here's what you need to know... +[say sadly with deliberate pauses in a low voice] Unfortunately, that's not possible. [sigh] Let me explain why... ``` -For delivering bad news: +Exciting information: ``` -[sad] Unfortunately, that's not possible. [sigh] Let me explain why... +[say excitedly with a high pitch and fast pace] Oh, that's fascinating — I just realized something important. ``` -For exciting information: +Thinking through a problem: ``` -[surprised] Oh, that's fascinating! I just realized something important... +[say slowly and thoughtfully] Let me think about this... [breathe] Yes, I believe the solution is... ``` -For thinking through problems: +Correcting yourself: ``` -Let me think about this... [breathe] Yes, I believe the solution is... +[clear throat] [say crisply with a measured pace] Actually, there's been a misunderstanding. Let me clarify... ``` -For corrections: +Conspiratorial aside: ``` -[clear_throat] Actually, there's been a misunderstanding. Let me clarify... +[whisper in a hushed style] Between you and me, the real answer is simpler than it looks. ``` ## Critical Rules -- **NEVER use multiple emotion tags in the same text segment** - only one at the beginning -- **NEVER place non-verbal tags at the beginning** - they go where the sound occurs -- **ALWAYS consider the emotional context** of the user's message -- **KEEP usage natural** - if unsure whether to add a tag, don't -- **REMEMBER these are experimental** and only work in English +- **Use natural-language directions, not fixed enums.** `[happy]` / + `[sad]` / `[whispering]` are TTS-1 conventions and won't steer TTS-2 the + way you want — write `[say happily]`, `[say sadly]`, `[whisper in a + hushed style]` instead. +- **Don't combine opposing directions** (`[whisper]` + `[very loud]`, + `[very fast]` + `[very slow]`). The result is unpredictable. +- **Don't pick a direction that contradicts the content** — `[say + excitedly]` over a condolence reads as sarcasm. +- **Avoid non-verbal sounds in professional contexts.** Save `[laugh]`, + `[yawn]`, `[cough]` for casual or expressive replies. +- **Keep usage natural.** If you're unsure whether to add a tag, don't. ## Decision Framework -When generating each response: -1. Analyze the user's emotional state and query type -2. Choose ONE appropriate emotion tag for the beginning (if needed) -3. Identify 0-2 natural places for non-verbal sounds -4. Write your response with tags embedded -5. Verify tags feel natural when read aloud +For each response: +1. Read the user's message — what emotional register fits? +2. Pick **one** layered steering direction for the opening segment, if any. +3. Identify 0–2 places where a non-verbal sound would land naturally. +4. Write the reply with tags embedded. +5. Read it aloud (mentally) — if a tag feels theatrical or redundant, cut it. -Your responses should feel like natural human speech when processed through TTS, not robotic or over-acted. \ No newline at end of file +Your replies should feel like natural human speech through TTS-2, not +robotic and not over-acted. diff --git a/plugins/inworld/example/inworld_tts_example.py b/plugins/inworld/example/inworld_tts_example.py index f21084ec6..73f463bd9 100644 --- a/plugins/inworld/example/inworld_tts_example.py +++ b/plugins/inworld/example/inworld_tts_example.py @@ -21,7 +21,7 @@ from dotenv import load_dotenv from vision_agents.core import Agent, Runner, User from vision_agents.core.agents import AgentLauncher -from vision_agents.plugins import deepgram, gemini, getstream, inworld, smart_turn +from vision_agents.plugins import deepgram, gemini, getstream, inworld logger = logging.getLogger(__name__) @@ -34,10 +34,9 @@ async def create_agent(**kwargs) -> Agent: edge=getstream.Edge(), agent_user=User(name="Friendly AI", id="agent"), instructions="Read @inworld-audio-guide.md", - tts=inworld.TTS(voice_id="Ashley"), + tts=inworld.TTS(voice_id="Sarah"), stt=deepgram.STT(), - llm=gemini.LLM(), - turn_detection=smart_turn.TurnDetection(), + llm=gemini.LLM(model="gemini-3.1-flash-lite-preview"), ) return agent @@ -55,7 +54,7 @@ async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> Non logger.info("LLM ready") await asyncio.sleep(5) - await agent.llm.simple_response(text="Tell me a story about a dragon.") + await agent.simple_response(text="Tell me a story about a dragon.") await agent.finish() # Run till the call ends diff --git a/plugins/inworld/vision_agents/plugins/inworld/tts.py b/plugins/inworld/vision_agents/plugins/inworld/tts.py index ff0395919..fffef3f3a 100644 --- a/plugins/inworld/vision_agents/plugins/inworld/tts.py +++ b/plugins/inworld/vision_agents/plugins/inworld/tts.py @@ -33,13 +33,14 @@ class TTS(tts.TTS): def __init__( self, api_key: Optional[str] = None, - voice_id: str = "Dennis", + voice_id: str = "Sarah", model_id: Literal[ "inworld-tts-1.5-max", "inworld-tts-1.5-mini", "inworld-tts-1", "inworld-tts-1-max", - ] = "inworld-tts-1", + "inworld-tts-2", + ] = "inworld-tts-2", temperature: float = 1.1, ): """ @@ -47,9 +48,9 @@ def __init__( Args: api_key: Inworld AI API key. If not provided, the INWORLD_API_KEY environment variable will be used. - voice_id: The voice ID to use for synthesis (default: "Dennis"). - model_id: The model ID to use for synthesis. Options: "inworld-tts-1.5-max", - "inworld-tts-1.5-mini" (default: "inworld-tts-1.5-max"). + voice_id: The voice ID to use for synthesis (default: "Sarah"). + model_id: The model ID to use for synthesis. Options: "inworld-tts-2", + "inworld-tts-1.5-max", "inworld-tts-1.5-mini" (default: "inworld-tts-2"). temperature: Determines the degree of randomness when sampling audio tokens. Accepts values between 0 and 2. Default: 1.1. """ @@ -92,6 +93,7 @@ async def stream_audio(self, text: str, *_, **__) -> AsyncIterator[PcmData]: "modelId": self.model_id, "audioConfig": { "temperature": self.temperature, + "audioEncoding": "LINEAR16", }, }