Hello,
I would be interested to train an audio-only model (or, perhaps, a bimodal audio-text one) using CMU-MOSEI data.
I would be recomputing the audio embeddings.
So I would need only the links to the videos plus the timestamps and the annotated emotions per timestamp range.
How would I go about extracting this information?
Thanks,
Ed