Practical notebooks for shrinking, optimizing, and customizing audio AI models with the Hugging Face ecosystem.
- Inference with Perception Encoder for Audio-Video (PE-AV)
- Fine-tune Audio Flamingo 3
- Granite Speech 4.0 1b ASR
Note
GitHub doesn't always render notebooks well. If you have trouble viewing them, try opening in Colab using the links below.
| Category | Notebook | Description |
|---|---|---|
| ASR Fine-tuning | Fine-tune Whisper | Fine-tune Whisper on a custom language/domain using transformers + datasets |
| ASR Fine-tuning | Fine-tune Granite Speech Italian | Fine-tune IBM Granite Speech for Italian ASR with the YODAS-Granary dataset |
| Audio Captioning | Fine-tune Audio Flamingo 3 | Fine-tune Audio Flamingo 3 for audio captioning (full + LoRA) |
| ASR Fine-tuning | Fine-tune Parakeet | Fine-tune NVIDIA Parakeet CTC for speech recognition (full + LoRA) |
| ASR Fine-tuning | Fine-tune Voxtral ASR | Fine-tune Voxtral for ASR with prompt masking (full + LoRA) |
| Multimodal | Inference with PE-AV-Base | Zero-shot video classification and audio↔text retrieval (AudioCaps) with Meta's Perception Encoder for Audio-Video |
