Automatic speech recognition with speaker diarisation.
Based on:
- NVIDIA NeMo Parakeet TDT 0.6b V3: Multilingual Speech-to-Text Model for automatic speech recognition
- NVIDIA NeMo Sortformer Diarizer 4spk v1 for speaker diarisation
Linux:
sudo apt install ffmpegconda create -n nemoasr python=3.12
conda activate nemoasrpip install git+https://github.com/HanBnrd/NeMoASR.gitMacOS:
brew install ffmpegconda create -n nemoasr python=3.12
conda activate nemoasrpip install git+https://github.com/HanBnrd/NeMoASR.gitpip install --upgrade git+https://github.com/HanBnrd/NeMoASR.gitTo transcribe a WAV or MPEG file:
nemoasr myfile.mp3Note: running this for the first time may be long as the models need to be downloaded.
The default configuration cuts long audio files into 7-minute chunks, which should work well on machines with limited RAM or VRAM. However, the chunk duration can be adjusted if needed. For example with more RAM or VRAM:
nemoasr myfile.mp3 --max-duration=12This will cut a long audio file into chunks of 12 minutes maximum.