Skip to content

AddictivelyRecursive/lightweight-multimodal-transformer-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lightweight Multimodal Transformer Pipeline

This repository provides a comprehensive pipeline for building a high-performance, resource-efficient image understanding system. By fine-tuning state-of-the-art lightweight Vision Transformers (ViTs)-specifically MobileViT and DeiT-Tiny-this project bridges the gap between accuracy and computational efficiency.

The resulting system goes beyond simple classification, offering a multi-faceted approach to image understanding including attribute prediction and cross-modal retrieval.

Key Capabilities

The fine-tuned models in this repository are capable of three distinct downstream tasks:

  1. Object Classification: Accurate categorization of test images into 10 specific everyday object classes.
  2. Visual Attribute Prediction: extracting semantic details and relevant visual attributes from images (e.g., color, texture, shape).
  3. Text-to-Image Retrieval: Retrieving the most relevant images based on short textual queries.

Datasets

The models were rigorously trained and evaluated on two distinct datasets to ensuring robustness:

  • Primary Dataset (Open Source): A curated collection of ~600 self-collected images representing everyday objects.
  • Pooled Large-Scale Dataset: A massive aggregation of over 11,000+ images used to pre-train and stabilize the model weights (proprietary/unreleased).

Repository Structure

Directory Description
./checkpoints Pre-trained Weights: Download and use the trained model checkpoints directly for inference without re-training.
./assets Visual Results & Analysis: Sample images illustrating classification outputs and attribute predictions on the validation set, along with embedding-space visualizations.
./notebooks Training Pipeline: Complete Jupyter notebooks for fine-tuning MobileViT and DeiT-Tiny. Includes data loading, augmentation strategies, training loops, and hyperparameter configurations.
./interface.ipynb Inference Demo: A user-friendly, interactive interface for testing the classification model on new images.

Performance & Evaluation

The ./notebooks directory contains detailed training logs, including:

  • Accuracy Charts: Visualization of training and validation accuracy over epochs.
  • Loss Curves: Tracking convergence stability.
  • Evaluation Metrics: Precision, Recall, and F1-score breakdowns for the fine-tuned models.

Getting Started

To start using the classification interface:

  1. Clone the repository.
  2. Install dependencies (ensure torch, torchvision, and transformers are installed).
  3. Open interface.ipynb:
    jupyter notebook interface.ipynb
  4. Load a checkpoint from ./checkpoints and start classifying!

This project demonstrates the power of lightweight transformers for edge-ready computer vision applications.

About

Lightweight multimodal transformer pipeline comparing MobileViT and DeiT-Tiny for image classification, attribute prediction, and image–text retrieval. Part of CS F425: Deep Learning , undertaken during our 7th Semester (Fall 2025) at BITS Pilani.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors