VLM4Bio: Species Classification Replication Project

A partial replication of the NeurIPS 2024 paper VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images. This repository focuses on the Species Classification task, assessing the zero-shot performance of various open‑source Vision‑Language Models (VLMs) on fish, bird, and butterfly images.

🔍 Introduction

Vision‑Language Models (VLMs) have shown remarkable zero‑shot capabilities on generic vision tasks. VLM4Bio introduces a benchmark to evaluate these models on trait discovery within biological images. This replication project zeroes in on the Species Classification task, measuring how well pretrained VLMs can identify species across three taxonomic groups without any fine‑tuning.

✨ Features

Zero‑Shot Evaluation: Test pretrained models directly on unseen biological images.
Difficulty Tiers: Split images into Easy, Medium, and Hard subsets to analyze performance across levels of visual challenge.
Prompt Engineering: Compare different prompting strategies, including contextual prompts, dense captions, and Chain‑of‑Thought (CoT).
Robustness Checks: Perform False‑Confidence Test (FCT) and "None of the Above" (NOTA) experiments to gauge model reliability.

📊 Dataset

Images and metadata are hosted on Kaggle:

📁 Image Dataset (10k images)
🧾 Metadata & CSV Files

CSV files include:

fish.csv — Ground‑truth labels and image paths for fish species.
bird.csv — Metadata for bird species classification.
butterfly.csv — Metadata for butterfly species classification.

The CSVs contain the following columns:

Column	Description
`image_path`	Local path to the image file
`question`	Classification prompt/question
`options`	List of candidate species (for MC tasks)
`answer`	Ground‑truth species label (zero‑shot target)

🧠 Models

We evaluated the following pretrained VLMs (from Hugging Face):

Model Alias	HF Model ID
BLIP2-Flan-T5	`Salesforce/blip2-flan-t5-xxl`
BLIP-VQA-Base	`Salesforce/blip-vqa-base`
Qwen2-VL-7B-Instruct	`Qwen/Qwen2-VL-7B-Instruct`
LLaVA-1.5-7B	`llava-hf/llava-1.5-7b-hf`
Qwen2-VL-2B-Instruct	`Qwen/Qwen2-VL-2B-Instruct`
BLIP-VQA-CapFilt-Large	`Salesforce/blip-vqa-capfilt-large`

Feel free to add or swap models by updating the --model_name argument.

📁 Project Structure

VLM4Bio/
├── data_for_easy_med_hard/       # Split images by difficulty
│   ├── easy/
│   ├── medium/
│   └── hard/
├── bird.csv                      # Bird metadata
├── butterfly.csv                 # Butterfly metadata
├── fish.csv                      # Fish metadata
├── main.py                       # Run experiments on easy/medium
├── main_for_hard_data.py         # Run experiments on hard data
├── merging_data.py               # Utilities to merge/preprocess CSVs
├── results/                      # Output logs and metrics for exp1
├── results_exp2/                 # Output logs and metrics for exp2
├── species classification.pdf    # Detailed experiment report
├── Project_Presentation final.pdf# Slide deck overview
└── README.md                     # ← You are here

⚙️ Installation

#clone the repo
git clone https://github.com/Aryamanporwal/VLM4Bio.git
cd VLM4Bio

# create & activate venv
python3 -m venv .venv
source .venv/bin/activate

# install deps
pip install torch torchvision transformers pandas numpy scikit-learn matplotlib

(If a requirements.txt is available, you can instead run pip install -r requirements.txt.)

🚀 Usage

#Species Classification (Easy/Medium)
python main.py \
  --model_name Salesforce/blip2-flan-t5-xxl \
  --dataset easy
#--model_name: HF model identifier
#--dataset: easy, medium
#Species Classification (Hard)
python main_for_hard_data.py \
  --model_name llava-hf/llava-1.5-7b-hf

🧪 Experiments

Zero‑Shot Accuracy: Measures standard classification accuracy.
Prompt Ablation: Compares contextual vs. CoT prompts.
Robustness Tests: FCT and NOTA to test model confidence and out‑of‑domain handling.

Detailed methodologies are described in species classification.pdf.

📈 Results

Raw logs and metric summaries are available under results/ and results_exp2/.
Key findings:
- BLIP2-Flan-T5 achieved highest zero‑shot accuracy on easy images.
- Performance degrades significantly on hard subset without CoT prompts.

📊 Presentation Summary

The Project Presentation (PDF) offers a visual and conceptual overview of the goals, methodologies, and key findings of the replication effort. It outlines:

Motivation and relevance of biological trait discovery
Description of the datasets and taxonomy splits
Challenges in fine-grained species classification
VLM architectures explored and their comparative performance
Observations from experiments on difficulty tiers and prompt types

🛠 Contributing

Contributions are welcome! To add new models or evaluation protocols:

Fork the repository
Create a feature branch (git checkout -b feature/your-change)
Implement changes and update documentation
Submit a Pull Request

Please follow the Contributor Covenant Code of Conduct.

✉️ Contact

Maintainer: Aryaman Porwal
Email: aryamanlucknow@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLM4Bio: Species Classification Replication Project

📋 Table of Contents

🔍 Introduction

✨ Features

📊 Dataset

🧠 Models

📁 Project Structure

⚙️ Installation

🚀 Usage

🧪 Experiments

📈 Results

📊 Presentation Summary

🛠 Contributing

✉️ Contact

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data_for_easy_med_hard		data_for_easy_med_hard
main		main
result		result
results_exp2		results_exp2
Project_Presentation final.pdf		Project_Presentation final.pdf
README.md		README.md
bird.csv		bird.csv
butterfly.csv		butterfly.csv
fish.csv		fish.csv
main.py		main.py
main_for_hard_data.py		main_for_hard_data.py
merging_data.py		merging_data.py
species classification.pdf		species classification.pdf

Folders and files

Latest commit

History

Repository files navigation

VLM4Bio: Species Classification Replication Project

📋 Table of Contents

🔍 Introduction

✨ Features

📊 Dataset

🧠 Models

📁 Project Structure

⚙️ Installation

🚀 Usage

🧪 Experiments

📈 Results

📊 Presentation Summary

🛠 Contributing

✉️ Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages