This project began as a part of finals of Defense Innovation Challenge, Shaastra 2020, IIT Madras.
Click here to see the Problem statement
The competition is over and I have taken it as a personal project now. All the Neural Network architectures in this project are written from scratch using Numpy and PyTorch.
The given problem statement of translating the incoming input signal and giving output can be broken down into two subtasks:
- Speech Recognition – This system takes in the input audio and outputs it in text format in the same language in which it was spoken.
- Machine Translation – This system takes the output text of speech recognition model and translates into the required output language.
- I am using data driven Deep Learning methods to realize these systems. And I can find datasets which are speech to text and text to text online.
- Deep Learning based Speech Recognition model. I have implemented an Attention based Encoder-Decoder network for the ASR system.
- It is based on the paper Listen, Attend and Spell.
- The audio input is converted to a MelSpectrogram which is fed to the model.
- It is a character-based model and thus decoder outputs character sequences.
- The model is based on Encoder-Decoder architecture with Attention.
- It gets the input from the ASR and it translates it to the selected output language.
- Before feeding to the model, the sentences are pre-processed, tokenized and normalized.
- The attention mechanism, here helps to translate longer sentences.
- Implement CTC model
- Implement attention based Encoder-Decoder model
- Add Language model
- Add SpecAugment on input Spectrograms
- Add Beam Search
- Implement joint Attention-CTC model
- Implement words, sub-words model
- Thanks, Yash Patel for guiding me during this project.
- Alexander's End-to-end-ASR-Pytorch has been a great help for me while developing this project.
- Listen, Attend and Spell, W Chan et al.
- SpecAugemnt: A Simple Data Augmentation Method for Speech Recognition, Park et al.
- Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks, A Graves et al.
- Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning, S Kim et al.

