|
| 1 | +# Speech Recognition on iOS with Wav2Vec2 |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +Facebook AI's [wav2vec 2.0](https://github.com/pytorch/fairseq/tree/master/examples/wav2vec) is one of the leading models in speech recognition. It is also available in the [Huggingface Transformers](https://github.com/huggingface/transformers) library, which is also used in another PyTorch iOS demo app for [Question Answering](https://github.com/pytorch/ios-demo-app/tree/master/QuestionAnswering). |
| 6 | + |
| 7 | +In this demo app, we'll show how to quantize, trace, and optimize the wav2vec2 model for mobile and how to use the converted model on an iOS demo app to perform speech recognition. |
| 8 | + |
| 9 | +## Prerequisites |
| 10 | + |
| 11 | +* PyTorch 1.8.0/1.8.1 (Optional) |
| 12 | +* Python 3.8 or above (Optional) |
| 13 | +* iOS PyTorch pod library 1.8.0 |
| 14 | +* Xcode 12 or later |
| 15 | + |
| 16 | +## Quick Start |
| 17 | + |
| 18 | +### 1. Prepare the Model |
| 19 | + |
| 20 | +First, run the following commands on a Terminal: |
| 21 | +``` |
| 22 | +git clone https://github.com/pytorch/ios-demo-app |
| 23 | +cd ios-demo-app/SpeechRecognition |
| 24 | +``` |
| 25 | + |
| 26 | +If you don't have PyTorch 1.8.1 installed or want to have a quick try of the demo app, you can download the quantized scripted wav2vec2 model file [here](https://drive.google.com/file/d/1RcCy3K3gDVN2Nun5IIdDbpIDbrKD-XVw/view?usp=sharing), then drag and drop to the project, and continue to Step 2. |
| 27 | + |
| 28 | +Be aware that the downloadable model file was created with PyTorch 1.8, matching the iOS LibTorch library 1.8.0 specified in the `Podfile`. If you use a different version of PyTorch to create your model by following the instructions below, make sure you specify the same iOS LibTorch version in the `Podfile` to avoid possible errors caused by the version mismatch. Furthermore, if you want to use the latest prototype features in the PyTorch master branch to create the model, follow the steps at [Building PyTorch iOS Libraries from Source](https://pytorch.org/mobile/ios/#build-pytorch-ios-libraries-from-source) on how to use the model in iOS. |
| 29 | + |
| 30 | +With PyTorch 1.8.1 installed, first install the `soundfile` package by running `pip install pysoundfile`, then install the Huggingface `transformers` by running `pip install transformers` (the version that has been tested is 4.4.2). Finally run `python create_wav2vec2.py`, which creates `wav2vec2.pt` in the project folder. [Dynamic quantization](https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html) is used to quantize the model to reduce its size. |
| 31 | + |
| 32 | +Note that the sample `scent_of_a_woman_future.wav` file used to trace the model is about 6 second long, so 6 second is the limit of the recorded audio for speech recognition in the demo app. If your speech is less than 6 seconds, padding is applied in the iOS code to make the model work correctly. |
| 33 | + |
| 34 | +### 2. Use LibTorch |
| 35 | + |
| 36 | +Run the commands below: |
| 37 | + |
| 38 | +``` |
| 39 | +cd SpeechRecognition |
| 40 | +pod install |
| 41 | +open SpeechRecognition.xcworkspace/ |
| 42 | +``` |
| 43 | + |
| 44 | +### 3. Build and run with Xcode |
| 45 | + |
| 46 | +After the app runs, tap the Start button and start saying something; after 6 seconds, the model will infer to recognize your speech. Only basic decoding of the recognition result, in the form of an array of floating numbers of logits, to a list of tokens is provided in this demo app, but it is easy to see, without further post-processing, whether the model can recognize your utterances. Some example results are as follows: |
| 47 | + |
| 48 | + |
| 49 | + |
| 50 | + |
0 commit comments