Lightweight Multimodal Transformer Pipeline

This repository provides a comprehensive pipeline for building a high-performance, resource-efficient image understanding system. By fine-tuning state-of-the-art lightweight Vision Transformers (ViTs)-specifically MobileViT and DeiT-Tiny-this project bridges the gap between accuracy and computational efficiency.

The resulting system goes beyond simple classification, offering a multi-faceted approach to image understanding including attribute prediction and cross-modal retrieval.

Key Capabilities

The fine-tuned models in this repository are capable of three distinct downstream tasks:

Object Classification: Accurate categorization of test images into 10 specific everyday object classes.
Visual Attribute Prediction: extracting semantic details and relevant visual attributes from images (e.g., color, texture, shape).
Text-to-Image Retrieval: Retrieving the most relevant images based on short textual queries.

Datasets

The models were rigorously trained and evaluated on two distinct datasets to ensuring robustness:

Primary Dataset (Open Source): A curated collection of ~600 self-collected images representing everyday objects.
- Access: Everyday Object Catalog on Kaggle
- Note: Detailed class lists and attribute taxonomies are available at the link above.
Pooled Large-Scale Dataset: A massive aggregation of over 11,000+ images used to pre-train and stabilize the model weights (proprietary/unreleased).

Repository Structure

Directory	Description
`./checkpoints`	Pre-trained Weights: Download and use the trained model checkpoints directly for inference without re-training.
`./assets`	Visual Results & Analysis: Sample images illustrating classification outputs and attribute predictions on the validation set, along with embedding-space visualizations.
`./notebooks`	Training Pipeline: Complete Jupyter notebooks for fine-tuning MobileViT and DeiT-Tiny. Includes data loading, augmentation strategies, training loops, and hyperparameter configurations.
`./interface.ipynb`	Inference Demo: A user-friendly, interactive interface for testing the classification model on new images.

Performance & Evaluation

The ./notebooks directory contains detailed training logs, including:

Accuracy Charts: Visualization of training and validation accuracy over epochs.
Loss Curves: Tracking convergence stability.
Evaluation Metrics: Precision, Recall, and F1-score breakdowns for the fine-tuned models.

Getting Started

To start using the classification interface:

Clone the repository.
Install dependencies (ensure torch, torchvision, and transformers are installed).
Open interface.ipynb:
```
jupyter notebook interface.ipynb
```
Load a checkpoint from ./checkpoints and start classifying!

This project demonstrates the power of lightweight transformers for edge-ready computer vision applications.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lightweight Multimodal Transformer Pipeline

Key Capabilities

Datasets

Repository Structure

Performance & Evaluation

Getting Started

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
assets		assets
checkpoints		checkpoints
notebooks		notebooks
README.md		README.md
interface.ipynb		interface.ipynb

Folders and files

Latest commit

History

Repository files navigation

Lightweight Multimodal Transformer Pipeline

Key Capabilities

Datasets

Repository Structure

Performance & Evaluation

Getting Started

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages