Skip to content

oscarqjh/SC4001-Group-Project

Repository files navigation

SC4001 Group Project: Qwen3-VL Fine-Grained Visual Recognition

This project investigates multimodal models' fine-grained visual recognition and classification capabilities using the Flowers102 dataset. We explore how supervised fine-tuning (SFT) and custom classification heads can progressively improve classification accuracy on challenging visual recognition tasks.

Full paper: ./report/paper.pdf

Research Overview

Primary Research Question: How can we enhance multimodal vision-language models' fine-grained visual classification capabilities?

Research Hypotheses:

  1. SFT Hypothesis: Supervised fine-tuning can help a general-purpose multimodal model increase its fine-grained classification accuracy
  2. Classification Head Hypothesis: Adding a custom classification head can further improve classification accuracy beyond SFT alone
  3. Specialization Hypothesis: A more specialized (fine-tuned) base model will achieve better performance when combined with a classification head compared to using the general base model

Dataset: Flowers102 (102 flower categories, 7,169 training + 1,020 test samples) - A challenging fine-grained visual recognition benchmark

Experimental Results

Our experiments validate all three hypotheses, showing progressive improvements in classification accuracy:

Model Configuration Accuracy Improvement
Qwen3-VL-8B-Instruct (baseline) 16.08% -
Qwen3-VL-4B-Instruct (baseline) 20.78% +4.70%
InstructBLIP-Flan-T5-XL (baseline) 21.18% +0.40%
Idefics2-8B (baseline) 22.65% +1.47%
Qwen3-VL-4B + Classification Head 64.60% +43.82%*
Qwen3-VL-4B-SFT (fine-tuned) 73.52% +8.92%*
Qwen3-VL-4B-SFT + Classification Head 95.19% +21.67%*
ResNet50 (baseline) 93.24% -1.95%

Key Findings:

  • βœ… Hypothesis 1 Validated: SFT dramatically improved accuracy from 20.78% to 73.52% (+254% relative improvement)
  • βœ… Hypothesis 2 Validated: Classification heads provide consistent improvements (base: +43.82%, SFT: +21.67%)
  • βœ… Hypothesis 3 Validated: Specialized model + classifier (95.19%) significantly outperforms base + classifier (64.60%)
  • Experimentation with a traditional CNN (ResNet50) as a comparison baseline. Specialized model + classifier (95.19%) outperforms CNN baseline model (93.24%)

Models and Dataset

All trained models and processed datasets are available on Hugging Face for reproducing the experimental results:

Models

Model Description Hugging Face ID Accuracy
Base SFT Model Fine-tuned Qwen3-VL on flowers domain oscarqjh/Qwen3-VL-4B-Instruct-Flowers102-Open-QA 73.52%
Base + Classifier Base model with classification head oscarqjh/Qwen3-VL-4B-Instruct-Flowers102-Classifier 64.60%
Fine-tuned + Classifier Fine-tuned model with classification head oscarqjh/Qwen3-VL-4B-Instruct-SFT-Flowers102-Classifier 95.19%
Baseline Resnet50 model Fine-tuned Resnet50 on flowers102 dataset sukinggg/resnet50-flowers102-classifier 93.04%

Dataset

Resource Description Hugging Face ID
Flowers102 Dataset Processed dataset with prompts for all tasks (open-qa, closed-qa, closed-negative-qa, open-qa-mixcut) oscarqjh/SC4001-flowers102

Quick Start

Prerequisites

git clone --recurse-submodules https://github.com/oscarqjh/SC4001-Group-Project.git
cd SC4001-Group-Project

# Install dependencies
uv venv -p 3.11
source .venv/bin/activate
uv pip install -e . -e ./extern/lmms-engine -e ./extern/lmms-eval

Option 1: Use Pre-trained Models (Recommended)

Skip to Step 6: Evaluate Models to use our pre-trained models from Hugging Face.

Option 2: Reproduce Full Experiment

Follow Steps 1-6 to reproduce the complete training pipeline.

Step-by-Step Reproduction Guide

Step 1: Dataset Setup

Download and process the Flowers102 dataset:

# Download dataset
python ./scripts/download_dataset.py --output-dir ./data/flowers102

# Process and resize images
python ./scripts/process_dataset.py --resize 448 --output-dir ./data/flowers102

# Generate prompts for different tasks
python ./scripts/generate_prompt.py \
    --task all \
    --input "data/flowers102/flowers102.jsonl" \
    --output "data/flowers102/prompts" \
    --data_dir "data/flowers102"

# Split into train/test sets
./scripts/bash/split_all_datasets.sh

# Optional: Offline MixUp/CutMix Augmentation for ablation study
python scripts/apply_data_augmentation.py \
--technique both \
--input data/flowers102/prompts/train/flower-raw-open-qa.jsonl \
--output data/flowers102/prompts/train/flower-raw-open-qa-mixup-cutmix.jsonl \
--alpha 0.2 \
--sample-ratio 0.22 \
--seed 42 \
--combine-original \
--shuffle

# convert to sft message format
./scripts/bash/convert_all_to_messages.sh --formatter lmms_engine

Expected Result: Processed dataset with train/test splits in data/flowers102/prompts/

πŸ“– Detailed Guide: See docs/download_dataset.md, docs/process_dataset.md, and docs/generate_prompt.md for comprehensive documentation.

Step 2: Baseline Evaluation

Evaluate the frozen base model performance:

CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/eval_qwen3vl.sh

Expected Result: Baseline performance metrics for comparison

πŸ“– Detailed Guide: See docs/eval_qwen3vl.md

Step 3: Supervised Fine-Tuning (SFT)

Fine-tune the base model on flowers domain to test Hypothesis 1:

CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/training/qwen3_vl_4b_train.sh

Expected Result: Fine-tuned model saved to output/qwen3_vl_4b_open_qa_sft/ with significantly improved accuracy

πŸ“– Detailed Guide: See docs/multi_gpu_training.md

Step 4: Train Classification Heads

Train custom classification heads to test Hypotheses 2 & 3:

# Base model + custom classification head (Hypothesis 2)
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/training/qwen3_vl_classifier_distributed.sh

# Fine-tuned model + custom classification head (Hypothesis 3)
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/training/train_qwen3_vl_classifier_finetuned_distributed.sh

Expected Results:

  • Base classifier: output/qwen3_vl_4b_instruct_classifier/ (validates Hypothesis 2)
  • Fine-tuned classifier: output/qwen3_vl_finetuned_base_classifier/ (validates Hypothesis 3)

πŸ“– Detailed Guide: See docs/train_qwen3_vl_classifier_distributed.md and docs/ablation_study_guide.md

Step 5: Finetune ResNet-50 CNN models

Train CNN classifier using various augmentation modes Run the consolidated wrapper (from the repository root):

bash scripts/bash/training/train_resnet_classifier.sh <aug_mode>

πŸ“– Detailed Guide: See docs/train_resnet_classifier.md

Step 6: Model Validation (Optional)

Validate your trained models before evaluation:

python scripts/diagnostic_test.py

πŸ“– Detailed Guide: See docs/diagnostic_test.md

Step 7: Evaluate Trained Models

Compare the performance to validate all three research hypotheses:

Using Local Models

# Base model + classification head (Hypothesis 2)
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/evaluate_qwen3_vl_classifier_distributed.sh \
    --model_path output/qwen3_vl_4b_instruct_classifier \
    --base_model Qwen/Qwen3-VL-4B-Instruct

# Fine-tuned model + classification head (Hypothesis 3)
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/evaluate_qwen3_vl_classifier_distributed.sh \
    --model_path output/qwen3_vl_finetuned_base_classifier \
    --base_model oscarqjh/Qwen3-VL-4B-Instruct-Flowers102-Open-QA

# Evaluate ResNet classifier (edit the configurations in this shell script)
bash /scripts/bash/evaluation/eval_resnet_classifier.sh    

Using Huggingface Models (Recommended)

# Base model + classification head
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/evaluate_qwen3_vl_classifier_distributed.sh \
    --model_path oscarqjh/Qwen3-VL-4B-Instruct-Flowers102-Classifier \
    --base_model Qwen/Qwen3-VL-4B-Instruct

# Fine-tuned model + classification head
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/evaluate_qwen3_vl_classifier_distributed.sh \
    --model_path oscarqjh/Qwen3-VL-4B-Instruct-SFT-Flowers102-Classifier \
    --base_model oscarqjh/Qwen3-VL-4B-Instruct-Flowers102-Open-QA

Expected Results: Progressive accuracy improvements validating the research hypotheses

πŸ“– Detailed Guide: See docs/evaluate_qwen3_vl_classifier_distributed.md

Step 7 (Optional): Ablation Study on MixUp/CutMix Augmented SFT Dataset

# Fine-tune Qwen3-VL-4B on MixUp/CutMix augmented dataset
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/training/qwen3_vl_4b_train.sh \
--dataset_path training-configs/mixcut_config.yaml \
--run_name qwen3_vl_4b_open_qa_mixcut_sft

# Build Model Locally
./scripts/bash/push_to_hf.sh \
--checkpoint-dir output/qwen3_vl_4b_open_qa_mixcut_sft \
--training-config training-configs/mixcut_config.yaml \
--local-deploy ./checkpoints/qwen3_vl_4b_open_qa_mixcut_2e-5_1200 \
--use-latest 

# Run Evaluation with LMMs-Eval
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/eval_qwen3vl.sh \
--model_path checkpoints/qwen3_vl_4b_open_qa_mixcut_2e-5_1200

Our model performance are shown in the table below,

Model Configuration Accuracy Improvement
Qwen3-VL-4B-SFT (fine-tuned) 73.52% -
Qwen3-VL-4B-SFT-MixUp-CutMix 66.27% -7.25%

Research Impact

This work contributes to understanding fine-grained visual recognition in multimodal models:

  1. Progressive Enhancement Strategy: Demonstrates a systematic approach to improving classification accuracy through SFT β†’ classification heads β†’ specialized base models

  2. Quantified Improvements: Shows concrete evidence that:

    • SFT provides the largest single improvement (+236% relative)
    • Classification heads offer consistent benefits across different base models
    • Model specialization amplifies classification head effectiveness
  3. Practical Applications: The methodology can be applied to other fine-grained recognition tasks beyond flowers (medical imaging, product classification, species identification, etc.)

Project Structure

β”œβ”€β”€ scripts/           # Training and evaluation scripts
β”‚   β”œβ”€β”€ bash/         # Convenient wrapper scripts
β”‚   └── *.py          # Core Python scripts
β”œβ”€β”€ src/              # Source code modules
β”‚   β”œβ”€β”€ models/       # Model implementations
β”‚   β”œβ”€β”€ datasets/     # Dataset handling
β”‚   └── evaluation/   # Evaluation utilities
β”œβ”€β”€ docs/             # Comprehensive documentation
β”œβ”€β”€ data/             # Dataset files
└── output/           # Trained models and results

Documentation

This project includes comprehensive documentation for every component:

For a complete list of documentation, see docs/scripts_documentation_audit.md.

Hardware Requirements

  • Recommended: 4x NVIDIA A100-SXM4-40GB for distributed training (used in this research)
  • Storage: ~50GB for dataset and models
  • RAM: 32GB+ system memory

Note: All experiments in this research were conducted using 4x NVIDIA A100-SXM4-40GB GPUs with distributed training configurations.

Citation

If you use this work, please cite:

@misc{sc4001-flowers102-finegrained-recognition,
  title={A Study on Multimodal Fine-Grained Visual Recognition and Classification on Oxford Flowers102},
  author={Oscar Qian, Suki Ng, Li You},
  year={2025},
  url={https://github.com/oscarqjh/SC4001-Group-Project}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgements

All evaluations of large multimodal models (LMMs) in this project were performed using the lmms-eval framework: https://github.com/EvolvingLMMs-Lab/lmms-eval.

All supervised fine-tuning (SFT) of Qwen3-VL-Instruct models reported here was done using the lmms-engine training framework: https://github.com/EvolvingLMMs-Lab/lmms-engine.

Thanks to the NTU LMMs-Lab for these open-source tools which made the experiments reproducible and efficient.

About

This project investigates multimodal models' fine-grained visual recognition and classification capabilities using the Flowers102 dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors