Skip to content

princ3kr/VQAModel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ–ΌοΈ Visual Question Answering (VQA) Model

Python PyTorch License

A deep learning model for Visual Question Answering that combines visual and textual understanding to answer questions about images. Built with PyTorch, leveraging ResNet50 for image encoding and BERT for text encoding with a novel gated fusion mechanism.


πŸ“Š Model Performance

Dataset Hard Accuracy Soft Accuracy (VQA Standard)
Validation 49.90% 58.56%

Note: Soft Accuracy uses the official VQA evaluation metric: min(#humans_who_gave_answer / 3, 1)


πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Input Image   β”‚     β”‚  Input Question β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚
         β–Ό                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   ResNet50      β”‚     β”‚  BERT Encoder   β”‚
β”‚ (Image Encoder) β”‚     β”‚ (Text Encoder)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
              β”‚  Attention  β”‚
              β”‚   Module    β”‚
              β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
              β”‚   Gated     β”‚
              β”‚   Fusion    β”‚
              β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
              β”‚ Classifier  β”‚
              β”‚  (FC Layer) β”‚
              β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚   Answer    β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Components

Component Description
Image Encoder ResNet50 (pretrained on ImageNet)
Text Encoder BERT-base-uncased
Attention Visual attention guided by text features
Fusion Gated fusion mechanism with dropout
Classifier Fully connected layer (1000 answer classes)

πŸ“ Project Structure

VQAModel/
β”œβ”€β”€ configs/
β”‚   └── default_config.yaml    # Hyperparameters & paths
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ vqa_dataset.py         # Dataset loader
β”‚   └── answer_to_idx.json     # Answer vocabulary
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ vqa_model.py           # Main VQA model
β”‚   β”œβ”€β”€ encoders.py            # Image & Text encoders
β”‚   β”œβ”€β”€ fusion.py              # Gated fusion module
β”‚   └── attention.py           # Attention mechanism
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ train.py               # Training script
β”‚   β”œβ”€β”€ evaluate.py            # Evaluation script
β”‚   β”œβ”€β”€ generate_submission.py # EvalAI submission generator
β”‚   └── build_vocab.py         # Answer vocabulary builder
β”œβ”€β”€ utils/
β”‚   └── metrics.py             # VQA soft accuracy metric
β”œβ”€β”€ checkpoints/               # Saved model weights
β”œβ”€β”€ notebooks/                 # Jupyter notebooks
└── dataset/                   # VQA v2.0 dataset

πŸš€ Getting Started

Prerequisites

  • Python 3.10+
  • CUDA-compatible GPU (recommended)
  • 16GB+ RAM

Installation

  1. Clone the repository

    git clone https://github.com/princ3kr/VQAModel.git
    cd VQAModel
  2. Install dependencies

    pip install -r requirements.txt
  3. Download VQA v2.0 Dataset

    Download the following from VQA v2.0 website:

    • Training images (COCO 2014)
    • Validation images (COCO 2014)
    • Training questions
    • Validation questions
    • Training annotations
    • Validation annotations

    Place them in the dataset/coco2014/ directory following this structure:

    dataset/coco2014/
    β”œβ”€β”€ images/
    β”‚   β”œβ”€β”€ train2014/
    β”‚   β”œβ”€β”€ val2014/
    β”‚   └── test2014/
    β”œβ”€β”€ questions/
    β”‚   β”œβ”€β”€ OpenEnded_mscoco_train2014_questions.json
    β”‚   β”œβ”€β”€ OpenEnded_mscoco_val2014_questions.json
    β”‚   └── OpenEnded_mscoco_test2015_questions.json
    └── annotations/
        β”œβ”€β”€ mscoco_train2014_annotations.json
        └── mscoco_val2014_annotations.json
    

πŸ‹οΈ Training

Build Answer Vocabulary (First Time Only)

python scripts/build_vocab.py

Train the Model

python scripts/train.py

Configuration

Edit configs/default_config.yaml to customize training:

model:
  image_encoder:
    model_name: "resnet50"
    frozen: true
  text_encoder:
    model_name: "bert-base-uncased"
    frozen: false
  fusion:
    hidden_size: 1024
    dropout: 0.5
  output_size: 1000

training:
  batch_size: 16
  epochs: 10
  learning_rate: 0.0001
  save_dir: "checkpoints/"

πŸ“ˆ Evaluation

Evaluate on Validation Set

python scripts/evaluate.py

Generate EvalAI Submission (Test Set)

python scripts/generate_submission.py

The submission file will be saved to checkpoints/vqa_submission.json.


πŸ“‹ Results Interpretation

The model uses two accuracy metrics:

  • Hard Accuracy: Exact match with the most common ground truth answer
  • Soft Accuracy (VQA Standard): min(#annotators_who_gave_answer / 3, 1)
    • If 0 annotators gave the predicted answer: 0%
    • If 1 annotator gave the predicted answer: 33.3%
    • If 2 annotators gave the predicted answer: 66.7%
    • If 3+ annotators gave the predicted answer: 100%

πŸ”§ Technical Details

Hardware Requirements

  • GPU: NVIDIA GPU with 8GB+ VRAM (training)
  • RAM: 16GB+ recommended
  • Storage: ~25GB for dataset

Training Details

  • Optimizer: Adam
  • Learning Rate: 1e-4
  • Batch Size: 16-32
  • Image Size: 224Γ—224
  • Max Question Length: 30 tokens

πŸ“š References


πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“§ Contact

For questions or feedback, please open an issue on GitHub.


Made with ❀️ using PyTorch

About

πŸ–ΌοΈ Visual Question Answering model using ResNet50 + BERT with Gated Fusion. Achieves 58.56% accuracy on VQA v2.0 validation set.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors