This project investigates multimodal models' fine-grained visual recognition and classification capabilities using the Flowers102 dataset. We explore how supervised fine-tuning (SFT) and custom classification heads can progressively improve classification accuracy on challenging visual recognition tasks.
Full paper: ./report/paper.pdf
Primary Research Question: How can we enhance multimodal vision-language models' fine-grained visual classification capabilities?
Research Hypotheses:
- SFT Hypothesis: Supervised fine-tuning can help a general-purpose multimodal model increase its fine-grained classification accuracy
- Classification Head Hypothesis: Adding a custom classification head can further improve classification accuracy beyond SFT alone
- Specialization Hypothesis: A more specialized (fine-tuned) base model will achieve better performance when combined with a classification head compared to using the general base model
Dataset: Flowers102 (102 flower categories, 7,169 training + 1,020 test samples) - A challenging fine-grained visual recognition benchmark
Our experiments validate all three hypotheses, showing progressive improvements in classification accuracy:
| Model Configuration | Accuracy | Improvement |
|---|---|---|
| Qwen3-VL-8B-Instruct (baseline) | 16.08% | - |
| Qwen3-VL-4B-Instruct (baseline) | 20.78% | +4.70% |
| InstructBLIP-Flan-T5-XL (baseline) | 21.18% | +0.40% |
| Idefics2-8B (baseline) | 22.65% | +1.47% |
| Qwen3-VL-4B + Classification Head | 64.60% | +43.82%* |
| Qwen3-VL-4B-SFT (fine-tuned) | 73.52% | +8.92%* |
| Qwen3-VL-4B-SFT + Classification Head | 95.19% | +21.67%* |
| ResNet50 (baseline) | 93.24% | -1.95% |
Key Findings:
- β Hypothesis 1 Validated: SFT dramatically improved accuracy from 20.78% to 73.52% (+254% relative improvement)
- β Hypothesis 2 Validated: Classification heads provide consistent improvements (base: +43.82%, SFT: +21.67%)
- β Hypothesis 3 Validated: Specialized model + classifier (95.19%) significantly outperforms base + classifier (64.60%)
- Experimentation with a traditional CNN (ResNet50) as a comparison baseline. Specialized model + classifier (95.19%) outperforms CNN baseline model (93.24%)
All trained models and processed datasets are available on Hugging Face for reproducing the experimental results:
| Model | Description | Hugging Face ID | Accuracy |
|---|---|---|---|
| Base SFT Model | Fine-tuned Qwen3-VL on flowers domain | oscarqjh/Qwen3-VL-4B-Instruct-Flowers102-Open-QA |
73.52% |
| Base + Classifier | Base model with classification head | oscarqjh/Qwen3-VL-4B-Instruct-Flowers102-Classifier |
64.60% |
| Fine-tuned + Classifier | Fine-tuned model with classification head | oscarqjh/Qwen3-VL-4B-Instruct-SFT-Flowers102-Classifier |
95.19% |
| Baseline Resnet50 model | Fine-tuned Resnet50 on flowers102 dataset | sukinggg/resnet50-flowers102-classifier |
93.04% |
| Resource | Description | Hugging Face ID |
|---|---|---|
| Flowers102 Dataset | Processed dataset with prompts for all tasks (open-qa, closed-qa, closed-negative-qa, open-qa-mixcut) | oscarqjh/SC4001-flowers102 |
git clone --recurse-submodules https://github.com/oscarqjh/SC4001-Group-Project.git
cd SC4001-Group-Project
# Install dependencies
uv venv -p 3.11
source .venv/bin/activate
uv pip install -e . -e ./extern/lmms-engine -e ./extern/lmms-evalSkip to Step 6: Evaluate Models to use our pre-trained models from Hugging Face.
Follow Steps 1-6 to reproduce the complete training pipeline.
Download and process the Flowers102 dataset:
# Download dataset
python ./scripts/download_dataset.py --output-dir ./data/flowers102
# Process and resize images
python ./scripts/process_dataset.py --resize 448 --output-dir ./data/flowers102
# Generate prompts for different tasks
python ./scripts/generate_prompt.py \
--task all \
--input "data/flowers102/flowers102.jsonl" \
--output "data/flowers102/prompts" \
--data_dir "data/flowers102"
# Split into train/test sets
./scripts/bash/split_all_datasets.sh
# Optional: Offline MixUp/CutMix Augmentation for ablation study
python scripts/apply_data_augmentation.py \
--technique both \
--input data/flowers102/prompts/train/flower-raw-open-qa.jsonl \
--output data/flowers102/prompts/train/flower-raw-open-qa-mixup-cutmix.jsonl \
--alpha 0.2 \
--sample-ratio 0.22 \
--seed 42 \
--combine-original \
--shuffle
# convert to sft message format
./scripts/bash/convert_all_to_messages.sh --formatter lmms_engineExpected Result: Processed dataset with train/test splits in data/flowers102/prompts/
π Detailed Guide: See
docs/download_dataset.md,docs/process_dataset.md, anddocs/generate_prompt.mdfor comprehensive documentation.
Evaluate the frozen base model performance:
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/eval_qwen3vl.shExpected Result: Baseline performance metrics for comparison
π Detailed Guide: See
docs/eval_qwen3vl.md
Fine-tune the base model on flowers domain to test Hypothesis 1:
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/training/qwen3_vl_4b_train.shExpected Result: Fine-tuned model saved to output/qwen3_vl_4b_open_qa_sft/ with significantly improved accuracy
π Detailed Guide: See
docs/multi_gpu_training.md
Train custom classification heads to test Hypotheses 2 & 3:
# Base model + custom classification head (Hypothesis 2)
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/training/qwen3_vl_classifier_distributed.sh
# Fine-tuned model + custom classification head (Hypothesis 3)
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/training/train_qwen3_vl_classifier_finetuned_distributed.shExpected Results:
- Base classifier:
output/qwen3_vl_4b_instruct_classifier/(validates Hypothesis 2) - Fine-tuned classifier:
output/qwen3_vl_finetuned_base_classifier/(validates Hypothesis 3)
π Detailed Guide: See
docs/train_qwen3_vl_classifier_distributed.mdanddocs/ablation_study_guide.md
Train CNN classifier using various augmentation modes Run the consolidated wrapper (from the repository root):
bash scripts/bash/training/train_resnet_classifier.sh <aug_mode>π Detailed Guide: See
docs/train_resnet_classifier.md
Validate your trained models before evaluation:
python scripts/diagnostic_test.pyπ Detailed Guide: See
docs/diagnostic_test.md
Compare the performance to validate all three research hypotheses:
# Base model + classification head (Hypothesis 2)
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/evaluate_qwen3_vl_classifier_distributed.sh \
--model_path output/qwen3_vl_4b_instruct_classifier \
--base_model Qwen/Qwen3-VL-4B-Instruct
# Fine-tuned model + classification head (Hypothesis 3)
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/evaluate_qwen3_vl_classifier_distributed.sh \
--model_path output/qwen3_vl_finetuned_base_classifier \
--base_model oscarqjh/Qwen3-VL-4B-Instruct-Flowers102-Open-QA
# Evaluate ResNet classifier (edit the configurations in this shell script)
bash /scripts/bash/evaluation/eval_resnet_classifier.sh # Base model + classification head
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/evaluate_qwen3_vl_classifier_distributed.sh \
--model_path oscarqjh/Qwen3-VL-4B-Instruct-Flowers102-Classifier \
--base_model Qwen/Qwen3-VL-4B-Instruct
# Fine-tuned model + classification head
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/evaluate_qwen3_vl_classifier_distributed.sh \
--model_path oscarqjh/Qwen3-VL-4B-Instruct-SFT-Flowers102-Classifier \
--base_model oscarqjh/Qwen3-VL-4B-Instruct-Flowers102-Open-QAExpected Results: Progressive accuracy improvements validating the research hypotheses
π Detailed Guide: See
docs/evaluate_qwen3_vl_classifier_distributed.md
# Fine-tune Qwen3-VL-4B on MixUp/CutMix augmented dataset
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/training/qwen3_vl_4b_train.sh \
--dataset_path training-configs/mixcut_config.yaml \
--run_name qwen3_vl_4b_open_qa_mixcut_sft
# Build Model Locally
./scripts/bash/push_to_hf.sh \
--checkpoint-dir output/qwen3_vl_4b_open_qa_mixcut_sft \
--training-config training-configs/mixcut_config.yaml \
--local-deploy ./checkpoints/qwen3_vl_4b_open_qa_mixcut_2e-5_1200 \
--use-latest
# Run Evaluation with LMMs-Eval
CUDA_VISIBLE_DEVICES=4,5,6,7 ./scripts/bash/evaluation/eval_qwen3vl.sh \
--model_path checkpoints/qwen3_vl_4b_open_qa_mixcut_2e-5_1200Our model performance are shown in the table below,
| Model Configuration | Accuracy | Improvement |
|---|---|---|
| Qwen3-VL-4B-SFT (fine-tuned) | 73.52% | - |
| Qwen3-VL-4B-SFT-MixUp-CutMix | 66.27% | -7.25% |
This work contributes to understanding fine-grained visual recognition in multimodal models:
-
Progressive Enhancement Strategy: Demonstrates a systematic approach to improving classification accuracy through SFT β classification heads β specialized base models
-
Quantified Improvements: Shows concrete evidence that:
- SFT provides the largest single improvement (+236% relative)
- Classification heads offer consistent benefits across different base models
- Model specialization amplifies classification head effectiveness
-
Practical Applications: The methodology can be applied to other fine-grained recognition tasks beyond flowers (medical imaging, product classification, species identification, etc.)
βββ scripts/ # Training and evaluation scripts
β βββ bash/ # Convenient wrapper scripts
β βββ *.py # Core Python scripts
βββ src/ # Source code modules
β βββ models/ # Model implementations
β βββ datasets/ # Dataset handling
β βββ evaluation/ # Evaluation utilities
βββ docs/ # Comprehensive documentation
βββ data/ # Dataset files
βββ output/ # Trained models and results
This project includes comprehensive documentation for every component:
- Training:
docs/train_qwen3_vl_classifier_distributed.md,docs/train_resnet_classifier.md - Evaluation:
docs/evaluate_qwen3_vl_classifier_distributed.md - Data Processing:
docs/process_dataset.md - Model Upload:
docs/huggingface_upload_evaluation.md - Troubleshooting:
docs/diagnostic_test.md - Complete Guide:
docs/ablation_study_guide.md
For a complete list of documentation, see docs/scripts_documentation_audit.md.
- Recommended: 4x NVIDIA A100-SXM4-40GB for distributed training (used in this research)
- Storage: ~50GB for dataset and models
- RAM: 32GB+ system memory
Note: All experiments in this research were conducted using 4x NVIDIA A100-SXM4-40GB GPUs with distributed training configurations.
If you use this work, please cite:
@misc{sc4001-flowers102-finegrained-recognition,
title={A Study on Multimodal Fine-Grained Visual Recognition and Classification on Oxford Flowers102},
author={Oscar Qian, Suki Ng, Li You},
year={2025},
url={https://github.com/oscarqjh/SC4001-Group-Project}
}This project is licensed under the MIT License. See the LICENSE file for details.
All evaluations of large multimodal models (LMMs) in this project were performed using the lmms-eval framework: https://github.com/EvolvingLMMs-Lab/lmms-eval.
All supervised fine-tuning (SFT) of Qwen3-VL-Instruct models reported here was done using the lmms-engine training framework: https://github.com/EvolvingLMMs-Lab/lmms-engine.
Thanks to the NTU LMMs-Lab for these open-source tools which made the experiments reproducible and efficient.