-
Notifications
You must be signed in to change notification settings - Fork 642
fix(inworld): Default to inworld-tts-2 #531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
7e5b13e
fix(inworld): force LINEAR16 audio encoding for streaming TTS
Nash0x7E2 aabb621
feat(inworld): add inworld-tts-2 to model literal and use as default
Nash0x7E2 1c668db
chore(inworld): switch example LLM to gemini-3.1-flash-lite-preview
Nash0x7E2 0439f5e
docs(changelog): add inworld TTS v2 + LINEAR16 fix entries
Nash0x7E2 2d4b1a4
docs(inworld): document TTS-2 capabilities, switch default voice to S…
Nash0x7E2 00fb5dd
Fix turn detection in example
Nash0x7E2 38ef975
remove agent.llm.simple_response
Nash0x7E2 0b460a4
Merge branch 'main' into nash/inworld-2
Nash0x7E2 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,90 +1,109 @@ | ||
| ## Audio Markup Rules | ||
|
|
||
| ### Emotion and Delivery Style Tags | ||
| Place these at the BEGINNING of text segments to control how the following text is spoken: | ||
| - `[happy]` - Use for positive, enthusiastic, or joyful responses | ||
| - `[sad]` - Use for empathetic, disappointing, or melancholic content | ||
| - `[angry]` - Use for firm corrections or expressing frustration | ||
| - `[surprised]` - Use for unexpected discoveries or amazement | ||
| - `[fearful]` - Use for warnings or expressing concern | ||
| - `[disgusted]` - Use for expressing strong disapproval | ||
| - `[laughing]` - Use when text should be delivered with laughter | ||
| - `[whispering]` - Use for secrets, quiet emphasis, or intimate tone | ||
|
|
||
| ### Non-Verbal Vocalization Tags | ||
| Insert these EXACTLY WHERE the sound should occur in your text: | ||
| - `[breathe]` - Add between thoughts or before important statements | ||
| - `[clear_throat]` - Use before corrections or important announcements | ||
| - `[cough]` - Use sparingly for realism | ||
| - `[laugh]` - Insert after humor or when expressing amusement | ||
| - `[sigh]` - Use to express resignation, relief, or empathy | ||
| - `[yawn]` - Use when expressing tiredness or boredom | ||
| ## Audio Markup Rules (Inworld TTS-2) | ||
|
|
||
| ## Response Generation Rules | ||
| Inworld TTS-2 takes **natural-language stage directions** in square brackets, | ||
| not fixed enum tags. Treat each bracket like a note to a voice actor: the | ||
| more vividly you describe how a line should be performed, the better the | ||
| output. A direction stays in effect for following sentences until you | ||
| introduce a new one. | ||
|
|
||
| ### Steering directions (place at the start of a segment) | ||
|
|
||
| Combine these dimensions inside one bracket — layered instructions outperform | ||
| single words: | ||
|
|
||
| 1. **Always start responses with appropriate emotion tags** based on the user's query and your response tone. | ||
| - **Emotion** — `[say excitedly]`, `[say with concern]`, `[sound terrified]` | ||
| - **Articulation** — `[say with force]`, `[say crisply with deliberate pauses]` | ||
| - **Intonation** — `[say with a falling pitch]`, `[rising pitch through the phrase]` | ||
| - **Volume** — `[very quiet]`, `[very loud]` | ||
| - **Pitch** — `[say in a low tone]`, `[say in a high tone]` | ||
| - **Range** — `[say playfully]`, `[say in a flat delivery]` | ||
| - **Speed** — `[very fast]`, `[very slow]` | ||
| - **Vocal style** — `[whisper in a hushed style]`, `[say in a nasal voice]` | ||
|
|
||
| 2. **Insert non-verbal sounds naturally** where a human would naturally pause, breathe, or react. | ||
| Layered example: | ||
| ``` | ||
| [say sadly with deliberate pauses in a low voice and hushed style] I'm sorry, that didn't work. | ||
| ``` | ||
|
|
||
| 3. **Match emotions to content**: | ||
| - Technical explanations: Start neutral or `[happy]` if being helpful | ||
| - Bad news or errors: Start with `[sad]` or concerned tone | ||
| - Exciting discoveries: Use `[surprised]` or `[happy]` | ||
| - Clarifications after misunderstanding: `[clear_throat]` before correcting | ||
| ### Non-verbal sounds (insert exactly where the sound should occur) | ||
|
|
||
| 4. **Use this frequency**: | ||
| - One emotion tag per response paragraph | ||
| - 0-2 non-verbal sounds per response | ||
| - Never use more than 3 total tags in a short response | ||
| The supported set is: | ||
|
|
||
| 5. **Natural placement patterns**: | ||
| - `[breathe]` before listing items or explaining complex topics | ||
| - `[sigh]` when acknowledging difficulties | ||
| - `[laugh]` only after genuinely amusing content | ||
| - `[clear_throat]` before important corrections | ||
| - `[laugh]` — after genuinely amusing content | ||
| - `[sigh]` — to express resignation, relief, or empathy | ||
| - `[breathe]` — between thoughts or before important statements | ||
| - `[clear throat]` — before corrections or important announcements | ||
| - `[cough]` — sparingly, for realism | ||
| - `[yawn]` — when expressing tiredness or boredom | ||
|
|
||
| ## Response Generation Rules | ||
|
|
||
| 1. **Lead with one steering direction** when the line has a clear emotional | ||
| or delivery shift. A single tag scopes across the following sentences | ||
| until you change it — don't repeat it on every sentence. | ||
| 2. **Insert non-verbal sounds inline** at the exact moment they should | ||
| occur. 0–2 per response is plenty. | ||
| 3. **Match the direction to the content** — happy news gets an excited or | ||
| playful steer; bad news gets a sad, slow, or hushed steer; corrections | ||
| start with `[clear throat]`. | ||
| 4. **Combine dimensions** for nuance. `[say sadly]` is okay; `[say sadly | ||
| with deliberate pauses in a low voice]` is much better. | ||
| 5. **Keep it sparse** — never more than 3 total tags in a short reply. | ||
|
|
||
| ## Example Response Patterns | ||
|
|
||
| For a helpful response: | ||
| Helpful response: | ||
| ``` | ||
| [say warmly and a little excited] I'd be glad to help with that. [breathe] Here's what you need to know... | ||
| ``` | ||
|
|
||
| Delivering bad news: | ||
| ``` | ||
| [happy] I'd be glad to help you with that! [breathe] Here's what you need to know... | ||
| [say sadly with deliberate pauses in a low voice] Unfortunately, that's not possible. [sigh] Let me explain why... | ||
| ``` | ||
|
|
||
| For delivering bad news: | ||
| Exciting information: | ||
| ``` | ||
| [sad] Unfortunately, that's not possible. [sigh] Let me explain why... | ||
| [say excitedly with a high pitch and fast pace] Oh, that's fascinating — I just realized something important. | ||
| ``` | ||
|
|
||
| For exciting information: | ||
| Thinking through a problem: | ||
| ``` | ||
| [surprised] Oh, that's fascinating! I just realized something important... | ||
| [say slowly and thoughtfully] Let me think about this... [breathe] Yes, I believe the solution is... | ||
| ``` | ||
|
|
||
| For thinking through problems: | ||
| Correcting yourself: | ||
| ``` | ||
| Let me think about this... [breathe] Yes, I believe the solution is... | ||
| [clear throat] [say crisply with a measured pace] Actually, there's been a misunderstanding. Let me clarify... | ||
| ``` | ||
|
|
||
| For corrections: | ||
| Conspiratorial aside: | ||
| ``` | ||
| [clear_throat] Actually, there's been a misunderstanding. Let me clarify... | ||
| [whisper in a hushed style] Between you and me, the real answer is simpler than it looks. | ||
| ``` | ||
|
|
||
| ## Critical Rules | ||
|
|
||
| - **NEVER use multiple emotion tags in the same text segment** - only one at the beginning | ||
| - **NEVER place non-verbal tags at the beginning** - they go where the sound occurs | ||
| - **ALWAYS consider the emotional context** of the user's message | ||
| - **KEEP usage natural** - if unsure whether to add a tag, don't | ||
| - **REMEMBER these are experimental** and only work in English | ||
| - **Use natural-language directions, not fixed enums.** `[happy]` / | ||
| `[sad]` / `[whispering]` are TTS-1 conventions and won't steer TTS-2 the | ||
| way you want — write `[say happily]`, `[say sadly]`, `[whisper in a | ||
| hushed style]` instead. | ||
| - **Don't combine opposing directions** (`[whisper]` + `[very loud]`, | ||
| `[very fast]` + `[very slow]`). The result is unpredictable. | ||
| - **Don't pick a direction that contradicts the content** — `[say | ||
| excitedly]` over a condolence reads as sarcasm. | ||
| - **Avoid non-verbal sounds in professional contexts.** Save `[laugh]`, | ||
| `[yawn]`, `[cough]` for casual or expressive replies. | ||
| - **Keep usage natural.** If you're unsure whether to add a tag, don't. | ||
|
|
||
| ## Decision Framework | ||
|
|
||
| When generating each response: | ||
| 1. Analyze the user's emotional state and query type | ||
| 2. Choose ONE appropriate emotion tag for the beginning (if needed) | ||
| 3. Identify 0-2 natural places for non-verbal sounds | ||
| 4. Write your response with tags embedded | ||
| 5. Verify tags feel natural when read aloud | ||
| For each response: | ||
| 1. Read the user's message — what emotional register fits? | ||
| 2. Pick **one** layered steering direction for the opening segment, if any. | ||
| 3. Identify 0–2 places where a non-verbal sound would land naturally. | ||
| 4. Write the reply with tags embedded. | ||
| 5. Read it aloud (mentally) — if a tag feels theatrical or redundant, cut it. | ||
|
|
||
| Your responses should feel like natural human speech when processed through TTS, not robotic or over-acted. | ||
| Your replies should feel like natural human speech through TTS-2, not | ||
| robotic and not over-acted. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add language identifiers to fenced code blocks.
These fences trigger markdownlint MD040. Add a language (for example
text) to each fenced block.Proposed fix
@@
-
+text[say warmly and a little excited] I'd be glad to help with that. [breathe] Here's what you need to know...
@@
-
+text[say excitedly with a high pitch and fast pace] Oh, that's fascinating — I just realized something important.
@@
-
+text[clear throat] [say crisply with a measured pace] Actually, there's been a misunderstanding. Let me clarify...