Merge pull request #53 from jeffxtang/master

xta0 · web-flow · commit e0c436db2b6c · 2021-06-15T11:57:43.000-07:00
updated script and iOS code to use torchaudio 0.9 based wav2vec2 model with no input limit
diff --git a/SpeechRecognition/Podfile b/SpeechRecognition/Podfile
@@ -6,5 +6,5 @@ target 'SpeechRecognition' do
   use_frameworks!
 
   # Pods for SpeechRecognition
-  pod 'LibTorch', '~>1.8.0'
+  pod 'LibTorch', '~>1.9.0'
 end
diff --git a/SpeechRecognition/README.md b/SpeechRecognition/README.md
@@ -2,48 +2,63 @@
 
 ## Introduction
 
-Facebook AI's [wav2vec 2.0](https://github.com/pytorch/fairseq/tree/master/examples/wav2vec) is one of the leading models in speech recognition. It is also available in the [Huggingface Transformers](https://github.com/huggingface/transformers) library, which is also used in another PyTorch iOS demo app for [Question Answering](https://github.com/pytorch/ios-demo-app/tree/master/QuestionAnswering).
+Facebook AI's [wav2vec 2.0](https://github.com/pytorch/fairseq/tree/master/examples/wav2vec) is one of the leading models in speech recognition. It is also available in the [Hugging Face Transformers](https://github.com/huggingface/transformers) library, which is also used in another PyTorch iOS demo app for [Question Answering](https://github.com/pytorch/ios-demo-app/tree/master/QuestionAnswering).
 
-In this demo app, we'll show how to quantize, trace, and optimize the wav2vec2 model for mobile and how to use the converted model on an iOS demo app to perform speech recognition.
+In this demo app, we'll show how to quantize, trace, and optimize the wav2vec2 model, powered by the newly released torchaudio 0.9.0, and how to use the converted model on an iOS demo app to perform speech recognition.
 
 ## Prerequisites
 
-* PyTorch 1.8.0/1.8.1 (Optional)
+* PyTorch 1.9.0 and torchaudio 0.9.0 (Optional)
 * Python 3.8 or above (Optional)
-* iOS PyTorch pod library 1.8.0
-* Xcode 12 or later
+* iOS PyTorch Cocoapods library LibTorch 1.9.0
+* Xcode 12.4 or later
 
 ## Quick Start
 
-### 1. Prepare the Model
+### 1. Get the Repo
+
+Simply run the commands below:
 
-First, run the following commands on a Terminal:
 ```
 git clone https://github.com/pytorch/ios-demo-app
 cd ios-demo-app/SpeechRecognition
 ```
 
-If you don't have PyTorch 1.8.1 installed or want to have a quick try of the demo app, you can download the quantized scripted wav2vec2 model file [here](https://drive.google.com/file/d/1RcCy3K3gDVN2Nun5IIdDbpIDbrKD-XVw/view?usp=sharing), then drag and drop to the project, and continue to Step 2.
+If you don't have PyTorch 1.9.0 and torchaudio 0.9.0 installed or want to have a quick try of the demo app, you can download the quantized scripted wav2vec2 model file [here](https://drive.google.com/file/d/1RcCy3K3gDVN2Nun5IIdDbpIDbrKD-XVw/view?usp=sharing), then drag and drop to the project, and continue to Step 3.
+
+Be aware that the downloadable model file was created with PyTorch 1.9.0 and torchaudio 0.9.0, matching the iOS LibTorch library 1.9.0 specified in the `Podfile`. If you use a different version of PyTorch to create your model by following the instructions below, make sure you specify the same iOS LibTorch version in the `Podfile` to avoid possible errors caused by the version mismatch. Furthermore, if you want to use the latest prototype features in the PyTorch master branch to create the model, follow the steps at [Building PyTorch iOS Libraries from Source](https://pytorch.org/mobile/ios/#build-pytorch-ios-libraries-from-source) on how to use the model in iOS.
+
+
+### 2. Prepare the Model
+
+To install PyTorch 1.9.0 and torchvision 0.10.0, you can do something like this:
 
-Be aware that the downloadable model file was created with PyTorch 1.8, matching the iOS LibTorch library 1.8.0 specified in the `Podfile`. If you use a different version of PyTorch to create your model by following the instructions below, make sure you specify the same iOS LibTorch version in the `Podfile` to avoid possible errors caused by the version mismatch. Furthermore, if you want to use the latest prototype features in the PyTorch master branch to create the model, follow the steps at [Building PyTorch iOS Libraries from Source](https://pytorch.org/mobile/ios/#build-pytorch-ios-libraries-from-source) on how to use the model in iOS.
+```
+conda create -n wav2vec2 python=3.8.5
+conda activate wav2vec2
+pip install torch torchvision
+```
+
+Now with PyTorch 1.9.0 and torchaudio 0.9.0 installed, run the following commands on a Terminal:
 
-With PyTorch 1.8.1 installed, first install the `soundfile` package by running `pip install pysoundfile`, then install the Huggingface `transformers` by running `pip install transformers` (the version that has been tested is 4.4.2). Finally run `python create_wav2vec2.py`, which creates `wav2vec2.pt` in the project folder. [Dynamic quantization](https://pytorch.org/tutorials/intermediate/dynamic_quantization_bert_tutorial.html) is used to quantize the model to reduce its size.
+```
+python create_wav2vec2.py
+```
 
-Note that the sample `scent_of_a_woman_future.wav` file used to trace the model is about 6 second long, so 6 second is the limit of the recorded audio for speech recognition in the demo app. If your speech is less than 6 seconds, padding is applied in the iOS code to make the model work correctly.
+This will create the model file `wav2vec2.pt` and save to the `SpeechRecognition` folder.
 
 ### 2. Use LibTorch
 
 Run the commands below:
 
 ```
-cd SpeechRecognition
 pod install
 open SpeechRecognition.xcworkspace/
 ```
 
 ### 3. Build and run with Xcode
 
-After the app runs, tap the Start button and start saying something; after 6 seconds, the model will infer to recognize your speech. Only basic decoding of the recognition result, in the form of an array of floating numbers of logits, to a list of tokens is provided in this demo app, but it is easy to see, without further post-processing, whether the model can recognize your utterances. Some example results are as follows:
+After the app runs, tap the Start button and start saying something; after 12 seconds (you can change `private let AUDIO_LEN_IN_SECOND = 12` in `ViewController.swift` for the recording length), the model will infer to recognize your speech. Some example results are as follows:
 
 ![](screenshot1.png)
 ![](screenshot2.png)
diff --git a/SpeechRecognition/SpeechRecognition/InferenceModule.h b/SpeechRecognition/SpeechRecognition/InferenceModule.h
@@ -14,7 +14,7 @@ NS_ASSUME_NONNULL_BEGIN
 - (nullable instancetype)initWithFileAtPath:(NSString*)filePath
     NS_SWIFT_NAME(init(fileAtPath:))NS_DESIGNATED_INITIALIZER;
 - (instancetype)init NS_UNAVAILABLE;
-- (nullable NSString*)recognize:(void*)wavBuffer NS_SWIFT_NAME(recognize(wavBuffer:));
+- (nullable NSString*)recognize:(void*)wavBuffer bufLength:(int)bufLength NS_SWIFT_NAME(recognize(wavBuffer:bufLength));
 @end
 
 NS_ASSUME_NONNULL_END
diff --git a/SpeechRecognition/SpeechRecognition/InferenceModule.mm b/SpeechRecognition/SpeechRecognition/InferenceModule.mm
@@ -13,9 +13,6 @@
 #import <AudioToolbox/AudioToolbox.h>
 
 
-const int MODEL_INPUT_LENGTH = 65024;
-const NSString *TOKENS[] = {@"<s>", @"<pad>", @"</s>", @"<unk>", @"|", @"E", @"T", @"A", @"O", @"N", @"I", @"H", @"S", @"R", @"D", @"L", @"U", @"M", @"W", @"C", @"F", @"G", @"Y", @"P", @"B", @"V", @"K", @"'", @"X", @"J", @"Q", @"Z"};
-
 @implementation InferenceModule {
     
     @protected torch::jit::script::Module _impl;
@@ -40,65 +37,26 @@ - (nullable instancetype)initWithFileAtPath:(NSString*)filePath {
     return self;
 }
 
-- (int)argMax:(NSArray*)array {
-    int maxIdx = 0;
-    float maxVal = -FLT_MAX;
-    for (int j = 0; j < [array count]; j++) {
-      if ([array[j] floatValue]> maxVal) {
-          maxVal = [array[j] floatValue];
-          maxIdx = j;
-      }
-    }
-    return maxIdx;
-}
-
 
-- (NSString*)recognize:(void*)wavBuffer {
+- (NSString*)recognize:(void*)wavBuffer bufLength:(int)bufLength{
     try {
-        at::Tensor tensorInputs = torch::from_blob((void*)wavBuffer, {1, MODEL_INPUT_LENGTH}, at::kFloat);
+        at::Tensor tensorInputs = torch::from_blob((void*)wavBuffer, {1, bufLength}, at::kFloat);
         
         float* floatInput = tensorInputs.data_ptr<float>();
         if (!floatInput) {
             return nil;
         }
         NSMutableArray* inputs = [[NSMutableArray alloc] init];
-        for (int i = 0; i < MODEL_INPUT_LENGTH; i++) {
+        for (int i = 0; i < bufLength; i++) {
             [inputs addObject:@(floatInput[i])];
         }
         
         torch::autograd::AutoGradMode guard(false);
         at::AutoNonVariableTypeMode non_var_type_mode(true);
     
-        auto outputDict = _impl.forward({ tensorInputs }).toGenericDict();
-
-        auto logitsTensor = outputDict.at("logits").toTensor();
-        float* logitsBuffer = logitsTensor.data_ptr<float>();
-        if (!logitsBuffer) {
-            return nil;
-        }
-        
-        NSUInteger TOKEN_LENGTH = (NSUInteger) (sizeof(TOKENS) / sizeof(NSString*));
-        int64_t output_len = logitsTensor.numel();
-        NSMutableArray* logits = [[NSMutableArray alloc] init];
-        NSString *result = @"";
-        for (int i = 0; i < output_len; i++) {
-            // for every 32 output values, get the argmax and its token
-            if (i > 0 && i % TOKEN_LENGTH == 0) {
-                int tid = [self argMax:logits];
-                if (tid > 4)
-                    result = [NSString stringWithFormat:@"%@%@", result, TOKENS[tid]];
-                else if (tid == 4)
-                    result = [NSString stringWithFormat:@"%@ ", result];
+        auto result = _impl.forward({ tensorInputs }).toStringRef();
 
-                [logits removeAllObjects];
-                [logits addObject:@(logitsBuffer[i])];
-            }
-            else {
-                [logits addObject:@(logitsBuffer[i])];
-            }
-        }
-        
-        return result;
+        return [NSString stringWithCString:result.c_str() encoding:[NSString defaultCStringEncoding]];
     }
     catch (const std::exception& exception) {
         NSLog(@"%s", exception.what());
diff --git a/SpeechRecognition/SpeechRecognition/ViewController.swift b/SpeechRecognition/SpeechRecognition/ViewController.swift
@@ -23,6 +23,9 @@ class ViewController: UIViewController, AVAudioRecorderDelegate  {
     
     private var audioRecorder: AVAudioRecorder!
     private var _recorderFilePath: String!
+    
+    private let AUDIO_LEN_IN_SECOND = 12
+    private let SAMPLE_RATE = 16000
 
     private lazy var module: InferenceModule = {
         if let filePath = Bundle.main.path(forResource:
@@ -55,7 +58,7 @@ class ViewController: UIViewController, AVAudioRecorderDelegate  {
 
         let settings = [
             AVFormatIDKey: Int(kAudioFormatLinearPCM),
-            AVSampleRateKey: 16000,
+            AVSampleRateKey: SAMPLE_RATE,
             AVNumberOfChannelsKey: 1,
             AVLinearPCMBitDepthKey: 16,
             AVLinearPCMIsBigEndianKey: false,
@@ -67,7 +70,7 @@ class ViewController: UIViewController, AVAudioRecorderDelegate  {
             _recorderFilePath = NSHomeDirectory().stringByAppendingPathComponent(path: "tmp").stringByAppendingPathComponent(path: "recorded_file.wav")
             audioRecorder = try AVAudioRecorder(url: NSURL.fileURL(withPath: _recorderFilePath), settings: settings)
             audioRecorder.delegate = self
-            audioRecorder.record(forDuration: 6)
+            audioRecorder.record(forDuration: TimeInterval(AUDIO_LEN_IN_SECOND))
         } catch let error {
             tvResult.text = "error:" + error.localizedDescription
         }
@@ -88,7 +91,7 @@ class ViewController: UIViewController, AVAudioRecorderDelegate  {
 
             DispatchQueue.global().async {
                 floatArray.withUnsafeMutableBytes {
-                    let result = self.module.recognize(wavBuffer: $0.baseAddress!)
+                    let result = self.module.recognize($0.baseAddress!, bufLength: Int32(self.AUDIO_LEN_IN_SECOND * self.SAMPLE_RATE))
                     DispatchQueue.main.async {
                         self.tvResult.text = result
                         self.btnStart.setTitle("Start", for: .normal)
diff --git a/SpeechRecognition/create_wav2vec2.py b/SpeechRecognition/create_wav2vec2.py
@@ -1,19 +1,65 @@
-import soundfile as sf
 import torch
-from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer
+from torch import Tensor
 from torch.utils.mobile_optimizer import optimize_for_mobile
+import torchaudio
+from torchaudio.models.wav2vec2.utils.import_huggingface import import_huggingface_model
+from transformers import Wav2Vec2ForCTC
 
-tokenizer = Wav2Vec2Tokenizer.from_pretrained("facebook/wav2vec2-base-960h")
+# Wav2vec2 model emits sequences of probability (logits) distributions over the characters
+# The following class adds steps to decode the transcript (best path)
+class SpeechRecognizer(torch.nn.Module):
+    def __init__(self, model):
+        super().__init__()
+        self.model = model
+        self.labels = [
+            "<s>", "<pad>", "</s>", "<unk>", "|", "E", "T", "A", "O", "N", "I", "H", "S",
+            "R", "D", "L", "U", "M", "W", "C", "F", "G", "Y", "P", "B", "V", "K", "'", "X",
+            "J", "Q", "Z"]
+
+    def forward(self, waveforms: Tensor) -> str:
+        """Given a single channel speech data, return transcription.
+        
+        Args:
+            waveforms (Tensor): Speech tensor. Shape `[1, num_frames]`.
+
+        Returns:
+            str: The resulting transcript
+        """
+        logits, _ = self.model(waveforms)  # [batch, num_seq, num_label]
+        best_path = torch.argmax(logits[0], dim=-1)  # [num_seq,]
+        prev = ''
+        hypothesis = ''
+        for i in best_path:
+            char = self.labels[i]
+            if char == prev:
+                continue
+            if char == '<s>':
+                prev = ''
+                continue
+            hypothesis += char
+            prev = char
+        return hypothesis.replace('|', ' ')
+
+
+# Load Wav2Vec2 pretrained model from Hugging Face Hub
 model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
-model.eval()
-
-audio_input, _ = sf.read("scent_of_a_woman_future.wav")
-input_values = tokenizer(audio_input, return_tensors="pt").input_values
-logits = model(input_values).logits
-predicted_ids = torch.argmax(logits, dim=-1)
-transcription = tokenizer.batch_decode(predicted_ids)[0]
-
-model_dynamic_quantized = torch.quantization.quantize_dynamic(model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
-traced_quantized_model = torch.jit.trace(model_dynamic_quantized, input_values, strict=False)
-optimized_traced_quantized_model = optimize_for_mobile(traced_quantized_model)
-optimized_traced_quantized_model.save("wav2vec2.pt")
+# Convert the model to torchaudio format, which supports TorchScript.
+model = import_huggingface_model(model)
+# Remove weight normalization which is not supported by quantization.
+model.encoder.transformer.pos_conv_embed.__prepare_scriptable__()
+model = model.eval()
+# Attach decoder
+model = SpeechRecognizer(model)
+
+# Apply quantization / script / optimize for motbile
+quantized_model = torch.quantization.quantize_dynamic(
+    model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
+scripted_model = torch.jit.script(quantized_model)
+optimized_model = optimize_for_mobile(scripted_model)
+
+# Sanity check
+waveform , _ = torchaudio.load('scent_of_a_woman_future.wav')
+print(waveform.size())
+print('Result:', optimized_model(waveform))
+
+optimized_model.save("SpeechRecognition/wav2vec2.pt")