A collection of deep learning implementations covering transformer architectures, multimodal systems, and retrieval-augmented generation.
Building a modern Transformer-based language model from the ground up. This project implements all core components of the Qwen3 architecture including:
- Grouped Query Attention mechanism
- Root Mean Square Layer Normalization
- Feed Forward networks
- Key-Value caching for efficient inference
- Complete Transformer blocks
An end-to-end Retrieval-Augmented Generation pipeline that processes PDF documents containing both text and images. Features:
- Multimodal embeddings with Jina-CLIP
- Vector database storage with ChromaDB
- Image and text extraction from PDFs
- Question-answering with Phi-3-Vision
- Interactive chat interface
Parameter-efficient fine-tuning of Vision Transformer models using Low-Rank Adaptation for food image classification. Features:
- LoRA integration reducing trainable parameters by 98.56%
- Vision Transformer (ViT) architecture
- Food101 dataset with 101 food categories
- Data augmentation pipeline
- Mixed precision training
- Experiment tracking with Weights & Biases
A pipeline for curating validation datasets from 216,930 Jeopardy questions to evaluate Named Entity Recognition (NER) algorithms. Features:
- LLM-based classification using Qwen3-4B-Instruct
- Stratified sampling maintaining category distribution
- Three linguistic challenge categories (numbers, non-English words, unusual proper nouns)
- GPU-accelerated batch processing with checkpointing
- Statistical analysis across the full dataset