SC4001 Group Project: Qwen3-VL Fine-Grained Visual Recognition

This project investigates multimodal models' fine-grained visual recognition and classification capabilities using the Flowers102 dataset. We explore how supervised fine-tuning (SFT) and custom classification heads can progressively improve classification accuracy on challenging visual recognition tasks.

Full paper: ./report/paper.pdf

Research Overview

Primary Research Question: How can we enhance multimodal vision-language models' fine-grained visual classification capabilities?

Research Hypotheses:

SFT Hypothesis: Supervised fine-tuning can help a general-purpose multimodal model increase its fine-grained classification accuracy
Classification Head Hypothesis: Adding a custom classification head can further improve classification accuracy beyond SFT alone
Specialization Hypothesis: A more specialized (fine-tuned) base model will achieve better performance when combined with a classification head compared to using the general base model

Dataset: Flowers102 (102 flower categories, 7,169 training + 1,020 test samples) - A challenging fine-grained visual recognition benchmark

Experimental Results

Our experiments validate all three hypotheses, showing progressive improvements in classification accuracy:

Model Configuration	Accuracy	Improvement
Qwen3-VL-8B-Instruct (baseline)	16.08%	-
Qwen3-VL-4B-Instruct (baseline)	20.78%	+4.70%
InstructBLIP-Flan-T5-XL (baseline)	21.18%	+0.40%
Idefics2-8B (baseline)	22.65%	+1.47%
Qwen3-VL-4B + Classification Head	64.60%	+43.82%*
Qwen3-VL-4B-SFT (fine-tuned)	73.52%	+8.92%*
Qwen3-VL-4B-SFT + Classification Head	95.19%	+21.67%*
ResNet50 (baseline)	93.24%	-1.95%

Key Findings:

✅ Hypothesis 1 Validated: SFT dramatically improved accuracy from 20.78% to 73.52% (+254% relative improvement)
✅ Hypothesis 2 Validated: Classification heads provide consistent improvements (base: +43.82%, SFT: +21.67%)
✅ Hypothesis 3 Validated: Specialized model + classifier (95.19%) significantly outperforms base + classifier (64.60%)
Experimentation with a traditional CNN (ResNet50) as a comparison baseline. Specialized model + classifier (95.19%) outperforms CNN baseline model (93.24%)

Models and Dataset

All trained models and processed datasets are available on Hugging Face for reproducing the experimental results:

Models

Model	Description	Hugging Face ID	Accuracy
Base SFT Model	Fine-tuned Qwen3-VL on flowers domain	`oscarqjh/Qwen3-VL-4B-Instruct-Flowers102-Open-QA`	73.52%
Base + Classifier	Base model with classification head	`oscarqjh/Qwen3-VL-4B-Instruct-Flowers102-Classifier`	64.60%
Fine-tuned + Classifier	Fine-tuned model with classification head	`oscarqjh/Qwen3-VL-4B-Instruct-SFT-Flowers102-Classifier`	95.19%
Baseline Resnet50 model	Fine-tuned Resnet50 on flowers102 dataset	`sukinggg/resnet50-flowers102-classifier`	93.04%

Dataset

Resource	Description	Hugging Face ID
Flowers102 Dataset	Processed dataset with prompts for all tasks (open-qa, closed-qa, closed-negative-qa, open-qa-mixcut)	`oscarqjh/SC4001-flowers102`

Quick Start

Prerequisites

git clone --recurse-submodules https://github.com/oscarqjh/SC4001-Group-Project.git
cd SC4001-Group-Project

# Install dependencies
uv venv -p 3.11
source .venv/bin/activate
uv pip install -e . -e ./extern/lmms-engine -e ./extern/lmms-eval

Option 1: Use Pre-trained Models (Recommended)

Skip to Step 6: Evaluate Models to use our pre-trained models from Hugging Face.

Option 2: Reproduce Full Experiment

Follow Steps 1-6 to reproduce the complete training pipeline.

Step-by-Step Reproduction Guide

Step 1: Dataset Setup

Download and process the Flowers102 dataset:

# Download dataset
python ./scripts/download_dataset.py --output-dir ./data/flowers102

# Process and resize images
python ./scripts/process_dataset.py --resize 448 --output-dir ./data/flowers102

# Generate prompts for different tasks
python ./scripts/generate_prompt.py \
    --task all \
    --input "data/flowers102/flowers102.jsonl" \
    --output "data/flowers102/prompts" \
    --data_dir "data/flowers102"

# Split into train/test sets
./scripts/bash/split_all_datasets.sh

# Optional: Offline MixUp/CutMix Augmentation for ablation study
python scripts/apply_data_augmentation.py \
--technique both \
--input data/flowers102/prompts/train/flower-raw-open-qa.jsonl \
--output data/flowers102/prompts/train/flower-raw-open-qa-mixup-cutmix.jsonl \
--alpha 0.2 \
--sample-ratio 0.22 \
--seed 42 \
--combine-original \
--shuffle

# convert to sft message format
./scripts/bash/convert_all_to_messages.sh --formatter lmms_engine

Expected Result: Processed dataset with train/test splits in data/flowers102/prompts/

📖 Detailed Guide: See docs/download_dataset.md, docs/process_dataset.md, and docs/generate_prompt.md for comprehensive documentation.

Step 2: Baseline Evaluation

Evaluate the frozen base model performance:

CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/eval_qwen3vl.sh

Expected Result: Baseline performance metrics for comparison

📖 Detailed Guide: See docs/eval_qwen3vl.md

Step 3: Supervised Fine-Tuning (SFT)

Fine-tune the base model on flowers domain to test Hypothesis 1:

CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/training/qwen3_vl_4b_train.sh

Expected Result: Fine-tuned model saved to output/qwen3_vl_4b_open_qa_sft/ with significantly improved accuracy

📖 Detailed Guide: See docs/multi_gpu_training.md

Step 4: Train Classification Heads

Train custom classification heads to test Hypotheses 2 & 3:

# Base model + custom classification head (Hypothesis 2)
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/training/qwen3_vl_classifier_distributed.sh

# Fine-tuned model + custom classification head (Hypothesis 3)
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/training/train_qwen3_vl_classifier_finetuned_distributed.sh

Expected Results:

Base classifier: output/qwen3_vl_4b_instruct_classifier/ (validates Hypothesis 2)
Fine-tuned classifier: output/qwen3_vl_finetuned_base_classifier/ (validates Hypothesis 3)

📖 Detailed Guide: See docs/train_qwen3_vl_classifier_distributed.md and docs/ablation_study_guide.md

Step 5: Finetune ResNet-50 CNN models

Train CNN classifier using various augmentation modes Run the consolidated wrapper (from the repository root):

bash scripts/bash/training/train_resnet_classifier.sh <aug_mode>

📖 Detailed Guide: See docs/train_resnet_classifier.md

Step 6: Model Validation (Optional)

Validate your trained models before evaluation:

python scripts/diagnostic_test.py

📖 Detailed Guide: See docs/diagnostic_test.md

Step 7: Evaluate Trained Models

Compare the performance to validate all three research hypotheses:

Using Local Models

# Base model + classification head (Hypothesis 2)
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/evaluate_qwen3_vl_classifier_distributed.sh \
    --model_path output/qwen3_vl_4b_instruct_classifier \
    --base_model Qwen/Qwen3-VL-4B-Instruct

# Fine-tuned model + classification head (Hypothesis 3)
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/evaluate_qwen3_vl_classifier_distributed.sh \
    --model_path output/qwen3_vl_finetuned_base_classifier \
    --base_model oscarqjh/Qwen3-VL-4B-Instruct-Flowers102-Open-QA

# Evaluate ResNet classifier (edit the configurations in this shell script)
bash /scripts/bash/evaluation/eval_resnet_classifier.sh

Using Huggingface Models (Recommended)

# Base model + classification head
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/evaluate_qwen3_vl_classifier_distributed.sh \
    --model_path oscarqjh/Qwen3-VL-4B-Instruct-Flowers102-Classifier \
    --base_model Qwen/Qwen3-VL-4B-Instruct

# Fine-tuned model + classification head
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/evaluate_qwen3_vl_classifier_distributed.sh \
    --model_path oscarqjh/Qwen3-VL-4B-Instruct-SFT-Flowers102-Classifier \
    --base_model oscarqjh/Qwen3-VL-4B-Instruct-Flowers102-Open-QA

Expected Results: Progressive accuracy improvements validating the research hypotheses

📖 Detailed Guide: See docs/evaluate_qwen3_vl_classifier_distributed.md

Step 7 (Optional): Ablation Study on MixUp/CutMix Augmented SFT Dataset

# Fine-tune Qwen3-VL-4B on MixUp/CutMix augmented dataset
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/training/qwen3_vl_4b_train.sh \
--dataset_path training-configs/mixcut_config.yaml \
--run_name qwen3_vl_4b_open_qa_mixcut_sft

# Build Model Locally
./scripts/bash/push_to_hf.sh \
--checkpoint-dir output/qwen3_vl_4b_open_qa_mixcut_sft \
--training-config training-configs/mixcut_config.yaml \
--local-deploy ./checkpoints/qwen3_vl_4b_open_qa_mixcut_2e-5_1200 \
--use-latest 

# Run Evaluation with LMMs-Eval
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/eval_qwen3vl.sh \
--model_path checkpoints/qwen3_vl_4b_open_qa_mixcut_2e-5_1200

Our model performance are shown in the table below,

Model Configuration	Accuracy	Improvement
Qwen3-VL-4B-SFT (fine-tuned)	73.52%	-
Qwen3-VL-4B-SFT-MixUp-CutMix	66.27%	-7.25%

Research Impact

This work contributes to understanding fine-grained visual recognition in multimodal models:

Progressive Enhancement Strategy: Demonstrates a systematic approach to improving classification accuracy through SFT → classification heads → specialized base models
Quantified Improvements: Shows concrete evidence that:
- SFT provides the largest single improvement (+236% relative)
- Classification heads offer consistent benefits across different base models
- Model specialization amplifies classification head effectiveness
Practical Applications: The methodology can be applied to other fine-grained recognition tasks beyond flowers (medical imaging, product classification, species identification, etc.)

Project Structure

├── scripts/           # Training and evaluation scripts
│   ├── bash/         # Convenient wrapper scripts
│   └── *.py          # Core Python scripts
├── src/              # Source code modules
│   ├── models/       # Model implementations
│   ├── datasets/     # Dataset handling
│   └── evaluation/   # Evaluation utilities
├── docs/             # Comprehensive documentation
├── data/             # Dataset files
└── output/           # Trained models and results

Documentation

This project includes comprehensive documentation for every component:

Training: docs/train_qwen3_vl_classifier_distributed.md, docs/train_resnet_classifier.md
Evaluation: docs/evaluate_qwen3_vl_classifier_distributed.md
Data Processing: docs/process_dataset.md
Model Upload: docs/huggingface_upload_evaluation.md
Troubleshooting: docs/diagnostic_test.md
Complete Guide: docs/ablation_study_guide.md

For a complete list of documentation, see docs/scripts_documentation_audit.md.

Hardware Requirements

Recommended: 4x NVIDIA A100-SXM4-40GB for distributed training (used in this research)
Storage: ~50GB for dataset and models
RAM: 32GB+ system memory

Note: All experiments in this research were conducted using 4x NVIDIA A100-SXM4-40GB GPUs with distributed training configurations.

Citation

If you use this work, please cite:

@misc{sc4001-flowers102-finegrained-recognition,
  title={A Study on Multimodal Fine-Grained Visual Recognition and Classification on Oxford Flowers102},
  author={Oscar Qian, Suki Ng, Li You},
  year={2025},
  url={https://github.com/oscarqjh/SC4001-Group-Project}
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgements

All evaluations of large multimodal models (LMMs) in this project were performed using the lmms-eval framework: https://github.com/EvolvingLMMs-Lab/lmms-eval.

All supervised fine-tuning (SFT) of Qwen3-VL-Instruct models reported here was done using the lmms-engine training framework: https://github.com/EvolvingLMMs-Lab/lmms-engine.

Thanks to the NTU LMMs-Lab for these open-source tools which made the experiments reproducible and efficient.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
docs		docs
examples		examples
extern		extern
logs/qwen3vl_vllm/Qwen__Qwen3-VL-8B-Instruct		logs/qwen3vl_vllm/Qwen__Qwen3-VL-8B-Instruct
report		report
scripts		scripts
src		src
task-configs/flowers102		task-configs/flowers102
tests		tests
tmp		tmp
training-configs		training-configs
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock
vit.ipynb		vit.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SC4001 Group Project: Qwen3-VL Fine-Grained Visual Recognition

Research Overview

Experimental Results

Models and Dataset

Models

Dataset

Quick Start

Prerequisites

Option 1: Use Pre-trained Models (Recommended)

Option 2: Reproduce Full Experiment

Step-by-Step Reproduction Guide

Step 1: Dataset Setup

Step 2: Baseline Evaluation

Step 3: Supervised Fine-Tuning (SFT)

Step 4: Train Classification Heads

Step 5: Finetune ResNet-50 CNN models

Step 6: Model Validation (Optional)

Step 7: Evaluate Trained Models

Using Local Models

Using Huggingface Models (Recommended)

Step 7 (Optional): Ablation Study on MixUp/CutMix Augmented SFT Dataset

Research Impact

Project Structure

Documentation

Hardware Requirements

Citation

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SC4001 Group Project: Qwen3-VL Fine-Grained Visual Recognition

Research Overview

Experimental Results

Models and Dataset

Models

Dataset

Quick Start

Prerequisites

Option 1: Use Pre-trained Models (Recommended)

Option 2: Reproduce Full Experiment

Step-by-Step Reproduction Guide

Step 1: Dataset Setup

Step 2: Baseline Evaluation

Step 3: Supervised Fine-Tuning (SFT)

Step 4: Train Classification Heads

Step 5: Finetune ResNet-50 CNN models

Step 6: Model Validation (Optional)

Step 7: Evaluate Trained Models

Using Local Models

Using Huggingface Models (Recommended)

Step 7 (Optional): Ablation Study on MixUp/CutMix Augmented SFT Dataset

Research Impact

Project Structure

Documentation

Hardware Requirements

Citation

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages