Skip to content

Commit 8810ecc

Browse files
authored
chore: restructurize T2S & add docs (#738)
## Description Introduces significant changes in Text to Speech native implementation for Kokoro model: - Adapted the T2S logic to dynamic input shape version of the Kokoro model - Restructurized and simplified the native implementation - Simplified typescript API Additionally, it adds docs for Text to Speech module and useTextToSpeech hook. ### Introduces a breaking change? - [x] Yes - [ ] No ### Type of change - [ ] Bug fix (change which fixes an issue) - [ ] New feature (change which adds functionality) - [x] Documentation update (improves or adds clarity to existing documentation) - [x] Other (chores, tests, code style improvements etc.) ### Tested on - [x] iOS - [x] Android ### Testing instructions <!-- Provide step-by-step instructions on how to test your changes. Include setup details if necessary. --> ### Screenshots <!-- Add screenshots here, if applicable --> ### Related issues <!-- Link related issues here using #issue-number --> ### Checklist - [x] I have performed a self-review of my code - [x] I have commented my code, particularly in hard-to-understand areas - [x] I have updated the documentation accordingly - [x] My changes generate no new warnings ### Additional notes <!-- Include any additional information, assumptions, or context that reviewers might need to understand this PR. -->
1 parent 9efd924 commit 8810ecc

33 files changed

Lines changed: 975 additions & 670 deletions

File tree

.cspell-wordlist.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@ kokoro
8888
phonemizer
8989
phonemizers
9090
phonemis
91+
phonemizing
9192
Español
9293
Français
9394
Português

apps/speech/screens/Quiz.tsx

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,8 @@ import Animated, {
1818
} from 'react-native-reanimated';
1919
import { SafeAreaProvider, SafeAreaView } from 'react-native-safe-area-context';
2020
import {
21-
KOKORO_EN,
22-
KOKORO_VOICE_AF_HEART,
21+
KOKORO_MEDIUM,
22+
KOKORO_VOICE_AM_SANTA,
2323
useTextToSpeech,
2424
} from 'react-native-executorch';
2525
import {
@@ -61,8 +61,8 @@ const createAudioBufferFromVector = (
6161
export const Quiz = ({ onBack }: { onBack: () => void }) => {
6262
// --- Hooks & State ---
6363
const model = useTextToSpeech({
64-
model: KOKORO_EN,
65-
voice: KOKORO_VOICE_AF_HEART,
64+
model: KOKORO_MEDIUM,
65+
voice: KOKORO_VOICE_AM_SANTA,
6666
});
6767

6868
const [shuffledQuestions] = useState(() => shuffleArray(QUESTIONS));
@@ -153,7 +153,7 @@ export const Quiz = ({ onBack }: { onBack: () => void }) => {
153153
});
154154
};
155155

156-
await model.stream({ text, onNext, onEnd: async () => {} });
156+
await model.stream({ text, speed: 0.9, onNext, onEnd: async () => {} });
157157
} catch (e) {
158158
console.error(e);
159159
} finally {

apps/speech/screens/TextToSpeechScreen.tsx

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ import {
1010
} from 'react-native';
1111
import { SafeAreaProvider, SafeAreaView } from 'react-native-safe-area-context';
1212
import {
13-
KOKORO_EN,
13+
KOKORO_MEDIUM,
1414
KOKORO_VOICE_AF_HEART,
1515
useTextToSpeech,
1616
} from 'react-native-executorch';
@@ -49,14 +49,8 @@ const createAudioBufferFromVector = (
4949

5050
export const TextToSpeechScreen = ({ onBack }: { onBack: () => void }) => {
5151
const model = useTextToSpeech({
52-
model: KOKORO_EN,
52+
model: KOKORO_MEDIUM,
5353
voice: KOKORO_VOICE_AF_HEART,
54-
options: {
55-
// This allows to minimize the memory usage by utilizing only one of the models.
56-
// However, it either increases the latency (in case of the largest model) or
57-
// decreases the quality of the results (in case of the smaller models).
58-
// fixedModel: "large"
59-
},
6054
});
6155

6256
const [inputText, setInputText] = useState('');

docs/docs/02-benchmarks/inference-time.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,8 @@ The values below represent the averages across all runs for the benchmark image.
6666

6767
❌ - Insufficient RAM.
6868

69+
## Speech to Text
70+
6971
### Encoding
7072

7173
Average time for encoding audio of given length over 10 runs. For `Whisper` model we only list 30 sec audio chunks since `Whisper` does not accept other lengths (for shorter audio the audio needs to be padded to 30sec with silence).
@@ -82,6 +84,15 @@ Average time for decoding one token in sequence of approximately 100 tokens, wit
8284
| ------------------ | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: |
8385
| Whisper-tiny (30s) | 23 | 25 | 121 | 92 | 115 |
8486

87+
## Text to Speech
88+
89+
Average time to synthesize speech from an input text of approximately 60 tokens, resulting in 2 to 5 seconds of audio depending on the input and selected voice.
90+
91+
| Model | iPhone 17 Pro (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] |
92+
| ------------- | :--------------------------: | :-----------------------: |
93+
| Kokoro-small | 2051 | 1548 |
94+
| Kokoro-medium | 2124 | 1625 |
95+
8596
## Text Embeddings
8697

8798
| Model | iPhone 17 Pro (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] |

docs/docs/02-benchmarks/memory-usage.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,17 @@ All the below benchmarks were performed on iPhone 17 Pro (iOS) and OnePlus 12 (A
5656
| ------------ | :--------------------: | :----------------: |
5757
| WHISPER_TINY | 410 | 375 |
5858

59+
## Text to speech
60+
61+
| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |
62+
| ------------- | :--------------------: | :----------------: |
63+
| KOKORO_SMALL | 820 | 820 |
64+
| KOKORO_MEDIUM | 1140 | 1100 |
65+
66+
:::info
67+
The reported memory usage values include the memory footprint of the Phonemis package, which is used for phonemizing input text. Currently, this can range from 100 to 150 MB depending on the device.
68+
:::
69+
5970
## Text Embeddings
6071

6172
| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] |

docs/docs/02-benchmarks/model-size.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,13 @@ title: Model Size
6363
| WHISPER_SMALL_EN | 968 |
6464
| WHISPER_SMALL | 968 |
6565

66+
## Text to speech
67+
68+
| Model | XNNPACK [MB] |
69+
| ------------- | :----------: |
70+
| KOKORO_SMALL | 329.6 |
71+
| KOKORO_MEDIUM | 334.4 |
72+
6673
## Text Embeddings
6774

6875
| Model | XNNPACK [MB] |
Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
---
2+
title: useTextToSpeech
3+
keywords: [
4+
text to speech
5+
tts,
6+
voice synthesizer,
7+
transcription,
8+
kokoro,
9+
react native,
10+
executorch,
11+
ai,
12+
machine learning,
13+
on-device,
14+
mobile ai,
15+
]
16+
description: "Learn how to use text-to-speech models in your React Native applications with React Native ExecuTorch's useTextToSpeech hook."
17+
---
18+
19+
Text to speech is a task that allows to transform written text into spoken language. It is commonly used to implement features such as voice assistants, accessibility tools, or audiobooks.
20+
21+
:::warning
22+
It is recommended to use models provided by us, which are available at our [Hugging Face repository](https://huggingface.co/software-mansion/react-native-executorch-kokoro). You can also use [constants](https://github.com/software-mansion/react-native-executorch/blob/main/packages/react-native-executorch/src/constants/modelUrls.ts) shipped with our library.
23+
:::
24+
25+
## Reference
26+
27+
You can play the generated waveform in any way most suitable to you; however, in the snippet below we utilize the react-native-audio-api library to play synthesized speech.
28+
29+
```typescript
30+
import {
31+
useTextToSpeech,
32+
KOKORO_MEDIUM,
33+
KOKORO_VOICE_AF_HEART,
34+
} from 'react-native-executorch';
35+
import { AudioContext } from 'react-native-audio-api';
36+
37+
const model = useTextToSpeech({
38+
model: KOKORO_MEDIUM,
39+
voice: KOKORO_VOICE_AF_HEART,
40+
});
41+
42+
const audioContext = new AudioContext({ sampleRate: 24000 });
43+
44+
const handleSpeech = async (text: string) => {
45+
const speed = 1.0;
46+
const waveform = await model.forward(text, speed);
47+
48+
const audioBuffer = audioContext.createBuffer(1, waveform.length, 24000);
49+
audioBuffer.getChannelData(0).set(waveform);
50+
51+
const source = audioContext.createBufferSource();
52+
source.buffer = audioBuffer;
53+
source.connect(audioContext.destination);
54+
source.start();
55+
};
56+
```
57+
58+
### Arguments
59+
60+
**`model`** (`KokoroConfig`) - Object specifying the source files for the Kokoro TTS model (duration predictor, synthesizer).
61+
62+
**`voice`** (`VoiceConfig`) - Object specifying the voice data and phonemizer assets (tagger and lexicon).
63+
64+
**`preventLoad?`** - Boolean that can prevent automatic model loading after running the hook.
65+
66+
For more information on loading resources, take a look at [loading models](../../01-fundamentals/02-loading-models.md) page.
67+
68+
### Returns
69+
70+
| Field | Type | Description |
71+
| ------------------ | --------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
72+
| `forward` | `(text: string, speed?: number) => Promise<Float32Array>` | Synthesizes a full text into speech. Returns a promise resolving to the full audio waveform as a `Float32Array`. |
73+
| `stream` | `(input: TextToSpeechStreamingInput) => Promise<void>` | Starts a streaming synthesis session. Takes a text input and callbacks to handle audio chunks as they are generated. Ideal for reducing the "time to first audio" for long sentences |
74+
| `streamStop` | `(): void` | Stops the streaming process if there is any ongoing. |
75+
| `error` | `RnExecutorchError \| null` | Contains the error message if the model failed to load or synthesis failed. |
76+
| `isGenerating` | `boolean` | Indicates whether the model is currently processing a synthesis. |
77+
| `isReady` | `boolean` | Indicates whether the model has successfully loaded and is ready for synthesis. |
78+
| `downloadProgress` | `number` | Tracks the progress of the model and voice assets download process. |
79+
80+
<details>
81+
<summary>Type definitions</summary>
82+
83+
```typescript
84+
interface TextToSpeechStreamingInput {
85+
text: string;
86+
speed?: number;
87+
onBegin?: () => void | Promise<void>;
88+
onNext?: (chunk: Float32Array) => Promise<void> | void;
89+
onEnd?: () => Promise<void> | void;
90+
}
91+
92+
interface KokoroConfig {
93+
durationSource: ResourceSource;
94+
synthesizerSource: ResourceSource;
95+
}
96+
97+
interface VoiceConfig {
98+
voiceSource: ResourceSource;
99+
extra: {
100+
taggerSource: ResourceSource;
101+
lexiconSource: ResourceSource;
102+
};
103+
}
104+
```
105+
106+
</details>
107+
108+
## Running the model
109+
110+
The module provides two ways to generate speech:
111+
112+
1. **`forward(text, speed)`**: Generates the complete audio waveform at once. Returns a promise resolving to a `Float32Array`.
113+
114+
:::note
115+
Since it processes the entire text at once, it might take a significant amount of time to produce an audio for long text inputs.
116+
:::
117+
118+
2. **`stream({ text, speed })`**: An async generator that yields chunks of audio as they are computed.
119+
This is ideal for reducing the "time to first audio" for long sentences.
120+
121+
## Example
122+
123+
### Speech Synthesis
124+
125+
```tsx
126+
import React from 'react';
127+
import { Button, View } from 'react-native';
128+
import {
129+
useTextToSpeech,
130+
KOKORO_MEDIUM,
131+
KOKORO_VOICE_AF_HEART,
132+
} from 'react-native-executorch';
133+
import { AudioContext } from 'react-native-audio-api';
134+
135+
export default function App() {
136+
const tts = useTextToSpeech({
137+
model: KOKORO_MEDIUM,
138+
voice: KOKORO_VOICE_AF_HEART,
139+
});
140+
141+
const generateAudio = async () => {
142+
const audioData = await tts.forward({
143+
text: 'Hello world! This is a sample text.',
144+
});
145+
146+
// Playback example
147+
const ctx = new AudioContext({ sampleRate: 24000 });
148+
const buffer = ctx.createBuffer(1, audioData.length, 24000);
149+
buffer.getChannelData(0).set(audioData);
150+
151+
const source = ctx.createBufferSource();
152+
source.buffer = buffer;
153+
source.connect(ctx.destination);
154+
source.start();
155+
};
156+
157+
return (
158+
<View style={{ flex: 1, justifyContent: 'center', alignItems: 'center' }}>
159+
<Button title="Speak" onPress={generateAudio} disabled={!tts.isReady} />
160+
</View>
161+
);
162+
}
163+
```
164+
165+
### Streaming Synthesis
166+
167+
```tsx
168+
import React, { useRef } from 'react';
169+
import { Button, View } from 'react-native';
170+
import {
171+
useTextToSpeech,
172+
KOKORO_MEDIUM,
173+
KOKORO_VOICE_AF_HEART,
174+
} from 'react-native-executorch';
175+
import { AudioContext } from 'react-native-audio-api';
176+
177+
export default function App() {
178+
const tts = useTextToSpeech({
179+
model: KOKORO_MEDIUM,
180+
voice: KOKORO_VOICE_AF_HEART,
181+
});
182+
183+
const contextRef = useRef(new AudioContext({ sampleRate: 24000 }));
184+
185+
const generateStream = async () => {
186+
const ctx = contextRef.current;
187+
188+
await tts.stream({
189+
text: "This is a longer text, which is being streamed chunk by chunk. Let's see how it works!",
190+
onNext: async (chunk) => {
191+
return new Promise((resolve) => {
192+
const buffer = ctx.createBuffer(1, chunk.length, 24000);
193+
buffer.getChannelData(0).set(chunk);
194+
195+
const source = ctx.createBufferSource();
196+
source.buffer = buffer;
197+
source.connect(ctx.destination);
198+
source.onEnded = () => resolve();
199+
source.start();
200+
});
201+
},
202+
});
203+
};
204+
205+
return (
206+
<View style={{ flex: 1, justifyContent: 'center', alignItems: 'center' }}>
207+
<Button title="Stream" onPress={generateStream} disabled={!tts.isReady} />
208+
</View>
209+
);
210+
}
211+
```
212+
213+
## Supported models
214+
215+
| Model | Language |
216+
| -------------------------------------------------------------------------------- | :------: |
217+
| [Kokoro](https://huggingface.co/software-mansion/react-native-executorch-kokoro) | English |

0 commit comments

Comments
 (0)