Hidden Markov Model (HMM), deep neural network models are used to convert the audio into text.
Speech Recognition API supports several API’s, here I used Google speech recognition API. For more details, please check https://pypi.org/project/SpeechRecognition/. It helps to translate for converting speech into text.
Audio file supports by speech recognition: wav, AIFF, AIFF-C, FLAC. I used ‘wav’ file here.