-
Notifications
You must be signed in to change notification settings - Fork 6
Description
It would be really cool to process the audio in real time using something like this https://github.com/NVIDIA/NeMo/blob/stable/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py
At the moment, even though the /transcribe endpoint is marked as async, it is completely blocked until the self.transcribe() call completes. It also only accepts an entire file, as opposed to a stream of audio bytes. This means we can only process an entire file at once, and probably means we need more resources to read the entire file into memory and transfer it to the model. If we could stream, say 10s at a time, we would probably be able to transcribe longer files, and if we did something like the cache aware ^^ thing up there, we might be able to do this without a significant degradation in performance?
Does this require changing the model to accept a collection of bytes instead of an entire file?