GetStream · Nash0x7E2 · May 5, 2026 · Apr 30, 2026 · Apr 30, 2026 · Apr 30, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -94,9 +94,14 @@ Install with: `uv add "vision-agents[redis]"`
 
 `py.typed` markers added to `vision_agents.core` and `vision_agents.testing` for downstream type checking support. (#378)
 
+### Inworld TTS v2
+
+`inworld-tts-2` added to the model `Literal` and used as the default for `inworld.TTS()`. (#531)
+
 ## Bug Fixes
 
 - **EventManager**: fix crash when event handlers have return type annotations (#381)
 - **RedisSessionKVStore**: fix import error when `redis` package is not installed (#384)
 - **Agent metrics**: fix metrics storage and serialization in session registry (#387)
+- **Inworld TTS**: fix garbled / failed playback for replies that span multiple stream chunks by forcing `LINEAR16` audio encoding (#531)
 - **MCPServerRemote**: fix cancel-scope leak in which closing an MCP session left a half-cancelled anyio scope that pegged the event loop. The transport lifecycle now runs inside a dedicated supervisor task so `__aenter__` / `__aexit__` task-identity holds regardless of which caller drives `connect()` and `disconnect()`. (#529)
diff --git a/plugins/inworld/README.md b/plugins/inworld/README.md
@@ -16,29 +16,70 @@ Get your API key from the [Inworld Portal](https://studio.inworld.ai/) and set
 
 ## TTS
 
-High-quality text-to-speech with streaming support.
+High-quality text-to-speech with streaming support. The plugin now defaults
+to Inworld's **TTS-2** model (currently in research preview), which adds
+natural-language steering, 100+ languages (15 GA, 90+ experimental), and
+high-quality instant voice cloning over the previous `inworld-tts-1.5-*`
+generation.
 
 ```python
 from vision_agents.plugins import inworld
 
+# Defaults to model_id="inworld-tts-2", voice_id="Sarah"
 tts = inworld.TTS()
 
 # Or specify explicitly
 tts = inworld.TTS(
     api_key="your_inworld_api_key",
-    voice_id="Dennis",
-    model_id="inworld-tts-1.5-max",
+    voice_id="Ashley",
+    model_id="inworld-tts-2",
     temperature=1.1,
 )
 ```
 
 ### TTS options
 
 - `api_key`: Inworld AI API key (default: reads from `INWORLD_API_KEY`)
-- `voice_id`: Voice to use (default: `"Dennis"`)
-- `model_id`: `"inworld-tts-1.5-max"`, `"inworld-tts-1.5-mini"`, `"inworld-tts-1"`, `"inworld-tts-1-max"` (default: `"inworld-tts-1.5-max"`)
+- `voice_id`: Voice to use (default: `"Sarah"`; `"Dennis"`, `"Ashley"`, `"Olivia"`, `"Clive"` and custom/cloned voices also supported)
+- `model_id`: `"inworld-tts-2"` (default), `"inworld-tts-1.5-max"`, `"inworld-tts-1.5-mini"`. `"inworld-tts-1"` and `"inworld-tts-1-max"` are deprecated by Inworld — migrate to `inworld-tts-2` or `inworld-tts-1.5-*`.
 - `temperature`: 0–2 (default: 1.1)
 
+The plugin requests `LINEAR16` (16-bit PCM WAV) chunks from Inworld so each
+streamed chunk is self-contained and decodes cleanly under streaming TTS;
+no extra configuration needed.
+
+### Steering (TTS-2)
+
+TTS-2 takes natural-language stage directions inline with your text. Place
+the instruction in square brackets before the segment it should apply to:
+
+```python
+text = (
+    "[whisper in a hushed style] I have to tell you something. "
+    "[laugh] Just kidding! [say with force] Now let's get to work."
+)
+async for chunk in await tts.stream_audio(text):
+    ...
+```
+
+Steering covers articulation, intonation, volume, pitch, range, speed, and
+vocal style — and supports non-verbal sounds like `[laugh]`, `[breathe]`,
+`[clear throat]`, `[sigh]`, `[cough]`, `[yawn]`. Combining dimensions
+(`[whisper in a hushed style]`, `[say playfully and very fast]`) produces
+better results than bare single-word tags. See Inworld's
+[steering docs](https://docs.inworld.ai/tts/capabilities/steering) and
+[prompting guide](https://docs.inworld.ai/tts/best-practices/prompting-for-tts-2)
+for the full reference.
+
+### Agent example
+
+A complete example wiring `inworld.TTS()` into a Stream-edge agent with
+Deepgram STT, Gemini LLM, and smart-turn detection lives at
+[`example/inworld_tts_example.py`](example/inworld_tts_example.py). The
+companion [`example/inworld-audio-guide.md`](example/inworld-audio-guide.md)
+is loaded as the agent's system prompt and teaches the LLM how to emit
+TTS-2 steering tags so replies sound expressive out of the box.
+
 ## Realtime (WebRTC)
 
 Low-latency speech-to-speech via Inworld's Realtime API. This transport uses

diff --git a/plugins/inworld/example/inworld-audio-guide.md b/plugins/inworld/example/inworld-audio-guide.md
@@ -1,90 +1,109 @@
-## Audio Markup Rules
-
-### Emotion and Delivery Style Tags
-Place these at the BEGINNING of text segments to control how the following text is spoken:
-- `[happy]` - Use for positive, enthusiastic, or joyful responses
-- `[sad]` - Use for empathetic, disappointing, or melancholic content
-- `[angry]` - Use for firm corrections or expressing frustration
-- `[surprised]` - Use for unexpected discoveries or amazement
-- `[fearful]` - Use for warnings or expressing concern
-- `[disgusted]` - Use for expressing strong disapproval
-- `[laughing]` - Use when text should be delivered with laughter
-- `[whispering]` - Use for secrets, quiet emphasis, or intimate tone
-
-### Non-Verbal Vocalization Tags
-Insert these EXACTLY WHERE the sound should occur in your text:
-- `[breathe]` - Add between thoughts or before important statements
-- `[clear_throat]` - Use before corrections or important announcements
-- `[cough]` - Use sparingly for realism
-- `[laugh]` - Insert after humor or when expressing amusement
-- `[sigh]` - Use to express resignation, relief, or empathy
-- `[yawn]` - Use when expressing tiredness or boredom
+## Audio Markup Rules (Inworld TTS-2)
 
-## Response Generation Rules
+Inworld TTS-2 takes **natural-language stage directions** in square brackets,
+not fixed enum tags. Treat each bracket like a note to a voice actor: the
+more vividly you describe how a line should be performed, the better the
+output. A direction stays in effect for following sentences until you
+introduce a new one.
+
+### Steering directions (place at the start of a segment)
+
+Combine these dimensions inside one bracket — layered instructions outperform
+single words:
 
-1. **Always start responses with appropriate emotion tags** based on the user's query and your response tone.
+- **Emotion** — `[say excitedly]`, `[say with concern]`, `[sound terrified]`
+- **Articulation** — `[say with force]`, `[say crisply with deliberate pauses]`
+- **Intonation** — `[say with a falling pitch]`, `[rising pitch through the phrase]`
+- **Volume** — `[very quiet]`, `[very loud]`
+- **Pitch** — `[say in a low tone]`, `[say in a high tone]`
+- **Range** — `[say playfully]`, `[say in a flat delivery]`
+- **Speed** — `[very fast]`, `[very slow]`
+- **Vocal style** — `[whisper in a hushed style]`, `[say in a nasal voice]`
 
-2. **Insert non-verbal sounds naturally** where a human would naturally pause, breathe, or react.
+Layered example:
+```
+[say sadly with deliberate pauses in a low voice and hushed style] I'm sorry, that didn't work.
+```
 
-3. **Match emotions to content**:
-   - Technical explanations: Start neutral or `[happy]` if being helpful
-   - Bad news or errors: Start with `[sad]` or concerned tone
-   - Exciting discoveries: Use `[surprised]` or `[happy]`
-   - Clarifications after misunderstanding: `[clear_throat]` before correcting
+### Non-verbal sounds (insert exactly where the sound should occur)
 
-4. **Use this frequency**:
-   - One emotion tag per response paragraph
-   - 0-2 non-verbal sounds per response
-   - Never use more than 3 total tags in a short response
+The supported set is:
 
-5. **Natural placement patterns**:
-   - `[breathe]` before listing items or explaining complex topics
-   - `[sigh]` when acknowledging difficulties
-   - `[laugh]` only after genuinely amusing content
-   - `[clear_throat]` before important corrections
+- `[laugh]` — after genuinely amusing content
+- `[sigh]` — to express resignation, relief, or empathy
+- `[breathe]` — between thoughts or before important statements
+- `[clear throat]` — before corrections or important announcements
+- `[cough]` — sparingly, for realism
+- `[yawn]` — when expressing tiredness or boredom
+
+## Response Generation Rules
+
+1. **Lead with one steering direction** when the line has a clear emotional
+   or delivery shift. A single tag scopes across the following sentences
+   until you change it — don't repeat it on every sentence.
+2. **Insert non-verbal sounds inline** at the exact moment they should
+   occur. 0–2 per response is plenty.
+3. **Match the direction to the content** — happy news gets an excited or
+   playful steer; bad news gets a sad, slow, or hushed steer; corrections
+   start with `[clear throat]`.
+4. **Combine dimensions** for nuance. `[say sadly]` is okay; `[say sadly
+   with deliberate pauses in a low voice]` is much better.
+5. **Keep it sparse** — never more than 3 total tags in a short reply.
 
 ## Example Response Patterns
 
-For a helpful response:
+Helpful response:
+```
+[say warmly and a little excited] I'd be glad to help with that. [breathe] Here's what you need to know...
+```
+
+Delivering bad news:
 ```
-[happy] I'd be glad to help you with that! [breathe] Here's what you need to know...
+[say sadly with deliberate pauses in a low voice] Unfortunately, that's not possible. [sigh] Let me explain why...
 ```
 
-For delivering bad news:
+Exciting information:
 ```
-[sad] Unfortunately, that's not possible. [sigh] Let me explain why...
+[say excitedly with a high pitch and fast pace] Oh, that's fascinating — I just realized something important.
 ```
 
-For exciting information:
+Thinking through a problem:
 ```
-[surprised] Oh, that's fascinating! I just realized something important...
+[say slowly and thoughtfully] Let me think about this... [breathe] Yes, I believe the solution is...
 ```
 
-For thinking through problems:
+Correcting yourself:
 ```
-Let me think about this... [breathe] Yes, I believe the solution is...
+[clear throat] [say crisply with a measured pace] Actually, there's been a misunderstanding. Let me clarify...
 ```
 
-For corrections:
+Conspiratorial aside:
 ```
-[clear_throat] Actually, there's been a misunderstanding. Let me clarify...
+[whisper in a hushed style] Between you and me, the real answer is simpler than it looks.
 ```
 
 ## Critical Rules
 
-- **NEVER use multiple emotion tags in the same text segment** - only one at the beginning
-- **NEVER place non-verbal tags at the beginning** - they go where the sound occurs
-- **ALWAYS consider the emotional context** of the user's message
-- **KEEP usage natural** - if unsure whether to add a tag, don't
-- **REMEMBER these are experimental** and only work in English
+- **Use natural-language directions, not fixed enums.** `[happy]` /
+  `[sad]` / `[whispering]` are TTS-1 conventions and won't steer TTS-2 the
+  way you want — write `[say happily]`, `[say sadly]`, `[whisper in a
+  hushed style]` instead.
+- **Don't combine opposing directions** (`[whisper]` + `[very loud]`,
+  `[very fast]` + `[very slow]`). The result is unpredictable.
+- **Don't pick a direction that contradicts the content** — `[say
+  excitedly]` over a condolence reads as sarcasm.
+- **Avoid non-verbal sounds in professional contexts.** Save `[laugh]`,
+  `[yawn]`, `[cough]` for casual or expressive replies.
+- **Keep usage natural.** If you're unsure whether to add a tag, don't.
 
 ## Decision Framework
 
-When generating each response:
-1. Analyze the user's emotional state and query type
-2. Choose ONE appropriate emotion tag for the beginning (if needed)
-3. Identify 0-2 natural places for non-verbal sounds
-4. Write your response with tags embedded
-5. Verify tags feel natural when read aloud
+For each response:
+1. Read the user's message — what emotional register fits?
+2. Pick **one** layered steering direction for the opening segment, if any.
+3. Identify 0–2 places where a non-verbal sound would land naturally.
+4. Write the reply with tags embedded.
+5. Read it aloud (mentally) — if a tag feels theatrical or redundant, cut it.
 
-Your responses should feel like natural human speech when processed through TTS, not robotic or over-acted.
+Your replies should feel like natural human speech through TTS-2, not
+robotic and not over-acted.
diff --git a/plugins/inworld/example/inworld_tts_example.py b/plugins/inworld/example/inworld_tts_example.py
@@ -21,7 +21,7 @@
 from dotenv import load_dotenv
 from vision_agents.core import Agent, Runner, User
 from vision_agents.core.agents import AgentLauncher
-from vision_agents.plugins import deepgram, gemini, getstream, inworld, smart_turn
+from vision_agents.plugins import deepgram, gemini, getstream, inworld
 
 logger = logging.getLogger(__name__)
 
@@ -34,10 +34,9 @@ async def create_agent(**kwargs) -> Agent:
         edge=getstream.Edge(),
         agent_user=User(name="Friendly AI", id="agent"),
         instructions="Read @inworld-audio-guide.md",
-        tts=inworld.TTS(voice_id="Ashley"),
+        tts=inworld.TTS(voice_id="Sarah"),
         stt=deepgram.STT(),
-        llm=gemini.LLM(),
-        turn_detection=smart_turn.TurnDetection(),
+        llm=gemini.LLM(model="gemini-3.1-flash-lite-preview"),
     )
     return agent
 
@@ -55,7 +54,7 @@ async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> Non
         logger.info("LLM ready")
 
         await asyncio.sleep(5)
-        await agent.llm.simple_response(text="Tell me a story about a dragon.")
+        await agent.simple_response(text="Tell me a story about a dragon.")
 
         await agent.finish()  # Run till the call ends
 

diff --git a/plugins/inworld/vision_agents/plugins/inworld/tts.py b/plugins/inworld/vision_agents/plugins/inworld/tts.py
@@ -33,23 +33,24 @@ class TTS(tts.TTS):
     def __init__(
         self,
         api_key: Optional[str] = None,
-        voice_id: str = "Dennis",
+        voice_id: str = "Sarah",
         model_id: Literal[
             "inworld-tts-1.5-max",
             "inworld-tts-1.5-mini",
             "inworld-tts-1",
             "inworld-tts-1-max",
-        ] = "inworld-tts-1",
+            "inworld-tts-2",
+        ] = "inworld-tts-2",
         temperature: float = 1.1,
     ):
         """
         Initialize the Inworld AI TTS service.
         Args:
             api_key: Inworld AI API key. If not provided, the INWORLD_API_KEY
                     environment variable will be used.
-            voice_id: The voice ID to use for synthesis (default: "Dennis").
-            model_id: The model ID to use for synthesis. Options: "inworld-tts-1.5-max",
-                     "inworld-tts-1.5-mini" (default: "inworld-tts-1.5-max").
+            voice_id: The voice ID to use for synthesis (default: "Sarah").
+            model_id: The model ID to use for synthesis. Options: "inworld-tts-2",
+                     "inworld-tts-1.5-max", "inworld-tts-1.5-mini" (default: "inworld-tts-2").
             temperature: Determines the degree of randomness when sampling audio tokens.
                         Accepts values between 0 and 2. Default: 1.1.
         """
@@ -92,6 +93,7 @@ async def stream_audio(self, text: str, *_, **__) -> AsyncIterator[PcmData]:
             "modelId": self.model_id,
             "audioConfig": {
                 "temperature": self.temperature,
+                "audioEncoding": "LINEAR16",
             },
         }