A deep learning model for Visual Question Answering that combines visual and textual understanding to answer questions about images. Built with PyTorch, leveraging ResNet50 for image encoding and BERT for text encoding with a novel gated fusion mechanism.
| Dataset | Hard Accuracy | Soft Accuracy (VQA Standard) |
|---|---|---|
| Validation | 49.90% | 58.56% |
Note: Soft Accuracy uses the official VQA evaluation metric:
min(#humans_who_gave_answer / 3, 1)
βββββββββββββββββββ βββββββββββββββββββ
β Input Image β β Input Question β
ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ
β β
βΌ βΌ
βββββββββββββββββββ βββββββββββββββββββ
β ResNet50 β β BERT Encoder β
β (Image Encoder) β β (Text Encoder) β
ββββββββββ¬βββββββββ ββββββββββ¬βββββββββ
β β
βββββββββββββ¬ββββββββββββ
β
ββββββββΌβββββββ
β Attention β
β Module β
ββββββββ¬βββββββ
β
ββββββββΌβββββββ
β Gated β
β Fusion β
ββββββββ¬βββββββ
β
ββββββββΌβββββββ
β Classifier β
β (FC Layer) β
ββββββββ¬βββββββ
β
βΌ
βββββββββββββββ
β Answer β
βββββββββββββββ
| Component | Description |
|---|---|
| Image Encoder | ResNet50 (pretrained on ImageNet) |
| Text Encoder | BERT-base-uncased |
| Attention | Visual attention guided by text features |
| Fusion | Gated fusion mechanism with dropout |
| Classifier | Fully connected layer (1000 answer classes) |
VQAModel/
βββ configs/
β βββ default_config.yaml # Hyperparameters & paths
βββ data/
β βββ vqa_dataset.py # Dataset loader
β βββ answer_to_idx.json # Answer vocabulary
βββ models/
β βββ vqa_model.py # Main VQA model
β βββ encoders.py # Image & Text encoders
β βββ fusion.py # Gated fusion module
β βββ attention.py # Attention mechanism
βββ scripts/
β βββ train.py # Training script
β βββ evaluate.py # Evaluation script
β βββ generate_submission.py # EvalAI submission generator
β βββ build_vocab.py # Answer vocabulary builder
βββ utils/
β βββ metrics.py # VQA soft accuracy metric
βββ checkpoints/ # Saved model weights
βββ notebooks/ # Jupyter notebooks
βββ dataset/ # VQA v2.0 dataset
- Python 3.10+
- CUDA-compatible GPU (recommended)
- 16GB+ RAM
-
Clone the repository
git clone https://github.com/princ3kr/VQAModel.git cd VQAModel -
Install dependencies
pip install -r requirements.txt
-
Download VQA v2.0 Dataset
Download the following from VQA v2.0 website:
- Training images (COCO 2014)
- Validation images (COCO 2014)
- Training questions
- Validation questions
- Training annotations
- Validation annotations
Place them in the
dataset/coco2014/directory following this structure:dataset/coco2014/ βββ images/ β βββ train2014/ β βββ val2014/ β βββ test2014/ βββ questions/ β βββ OpenEnded_mscoco_train2014_questions.json β βββ OpenEnded_mscoco_val2014_questions.json β βββ OpenEnded_mscoco_test2015_questions.json βββ annotations/ βββ mscoco_train2014_annotations.json βββ mscoco_val2014_annotations.json
python scripts/build_vocab.pypython scripts/train.pyEdit configs/default_config.yaml to customize training:
model:
image_encoder:
model_name: "resnet50"
frozen: true
text_encoder:
model_name: "bert-base-uncased"
frozen: false
fusion:
hidden_size: 1024
dropout: 0.5
output_size: 1000
training:
batch_size: 16
epochs: 10
learning_rate: 0.0001
save_dir: "checkpoints/"python scripts/evaluate.pypython scripts/generate_submission.pyThe submission file will be saved to checkpoints/vqa_submission.json.
The model uses two accuracy metrics:
- Hard Accuracy: Exact match with the most common ground truth answer
- Soft Accuracy (VQA Standard):
min(#annotators_who_gave_answer / 3, 1)- If 0 annotators gave the predicted answer: 0%
- If 1 annotator gave the predicted answer: 33.3%
- If 2 annotators gave the predicted answer: 66.7%
- If 3+ annotators gave the predicted answer: 100%
- GPU: NVIDIA GPU with 8GB+ VRAM (training)
- RAM: 16GB+ recommended
- Storage: ~25GB for dataset
- Optimizer: Adam
- Learning Rate: 1e-4
- Batch Size: 16-32
- Image Size: 224Γ224
- Max Question Length: 30 tokens
- VQA: Visual Question Answering - Agrawal et al.
- Making the V in VQA Matter - Goyal et al.
- BERT: Pre-training of Deep Bidirectional Transformers - Devlin et al.
- Deep Residual Learning for Image Recognition - He et al.
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
For questions or feedback, please open an issue on GitHub.
Made with β€οΈ using PyTorch