A project-driven proseminar course focusing on applying machine learning techniques to audio and visual data. Topics include optimization, template matching, face detection, audio feature extraction, CNNs, and end-to-end multimedia analysis projects.
This course provided hands-on experience with modern ML workflows for analyzing audio and video data. Key technical components included:
- Data acquisition & preprocessing (images, audio, video)
- Optimization methods for model training
- Template matching and feature extraction
- Computer vision (OpenCV, face detection/recognition, CNNs)
- Audio analysis (spectrograms, MFCCs, librosa)
- Deep learning (PyTorch Lightning, pre-trained models)
- Weeks 1-4: Data pipelines, optimization fundamentals, text classification
- Weeks 5-8: Image processing (OpenCV), template matching, face detection/recognition
- Weeks 9-12: Audio analysis (librosa), spectrogram features, neural networks for audio
- Convolutional Neural Networks (CNNs) for image classification
- Transfer learning with pre-trained models (e.g., ResNet)
- Image captioning and multimodal analysis
- Audio classification pipelines
Machine Learning
- Model optimization (loss functions, gradient descent)
- Feature engineering for images/audio
- CNN architectures & transfer learning
Tools
- Python (NumPy, pandas, scikit-learn)
- TensorFlow/Keras, PyTorch Lightning
- OpenCV, librosa, ffmpeg
- Google Colab, Jupyter Notebooks
Data Types
- Image datasets (MNIST, facial recognition datasets)
- Audio waveforms & spectrograms
- Video processing fundamentals
| Project | Description | Technologies |
|---|---|---|
| Face Detection | Implemented Haar cascades & deep learning models for face recognition | OpenCV, CNN |
| Audio Feature Extraction | Analyzed MFCCs & spectral features for audio classification | librosa, scipy |
| Image Classification | Built CNN pipelines for digit recognition | TensorFlow, Keras |
- Data-Centric ML: Preprocessing pipelines for unstructured data
- Model Deployment: Optimized inference for real-world constraints
- Multimodal Analysis: Combined audio/visual features in joint systems