DemoBias: Evaluating Demographic Biases in Large Vision Language Models for Biometric Face Recognition
Repository: https://github.com/Sufianlab/DemoBias
DemoBias is an empirical evaluation framework for investigating demographic biases in Large Vision Language Models (LVLMs) for biometric face recognition (FR) with textual token generation. Our study focuses on three widely used LVLMs: LLaVA, BLIP-2, and PaliGemma. We analyze these models using a demographically balanced dataset to quantify and trace performance disparities across different demographic groups such as ethnicity/race, gender, and age.
Note: Code and documentation for LLaVA and the dataset will be updated soon.
- Bias Evaluation: Quantifies demographic biases in LVLMs for biometric face recognition with description.
- Model Coverage: Fine-tuning and evaluation for BLIP-2 and PaliGemma. (LLaVA coming soon)
- Fairness Metrics: Implements group-specific BERTScores and Fairness Discrepancy Rate for thorough bias analysis.
- Reproducible Experiments: Jupyter notebooks for end-to-end fine-tuning and inference.
Currently, the repository includes:
BLIP_2_fine_tuneing.ipynb— Fine-tuning BLIP-2 on the balanced dataset.Blip_2_inference.ipynb— Inference and evaluation for BLIP-2.Paligemma_fine_tuneing.ipynb— Fine-tuning PaliGemma.Paligemma_inference.ipynb— Inference and evaluation for PaliGemma.
LLaVA code and dataset will be available soon.
- Python 3.8+
- PyTorch
- HuggingFace Transformers
- Jupyter Notebook
- Required libraries as specified in each notebook
- Clone this repository:
git clone https://github.com/Sufianlab/DemoBias.git cd DemoBias - Install dependencies as per the requirements in each notebook.
- Run the notebooks for fine-tuning and inference:
- Open the relevant
.ipynbfile in Jupyter Notebook or JupyterLab. - Follow the instructions in each notebook to reproduce experiments or run your own evaluations.
- Open the relevant
- Dataset: Instructions for using the demographically balanced dataset will be provided in upcoming updates.
- Group-specific BERTScore: Measures model performance for each demographic group.
- Fairness Discrepancy Rate (FDR): Quantifies disparity across demographic groups.
- PaliGemma and LLaVA (preliminary results) show higher bias for Hispanic/Latino, Caucasian, and South Asian groups.
- BLIP-2 demonstrates more consistent and fair performance across demographics.
- See respective notebooks for detailed results and analysis.
- LLaVA code and evaluation notebook
- Public release of the demographic-balanced dataset
- Additional scripts and documentation for streamlined workflow
If you use DemoBias in your research, please cite the following paper:
@inproceedings{Sufian2025DemoBias,
author={A. Sufian and A. Ghosh and D. Barman and M. Leo and C. Distante},
title={{Demobias: an Empirical Study to Trace Demographic Biases in Vision Foundation Models}},
booktitle={2025 13th International Workshop on Biometrics and Forensics (IWBF)},
year={2025},
pages={01-06},
doi={10.1109/IWBF63717.2025.11113455},
keywords={Measurement;Deep learning;Analytical models;Foundation models;Face recognition;Biological system modeling;Forensics;Conferences;Authentication;Reliability;Biometric;Deep Learning;Demographic Bias;Face Fairness;Foundation Models;LLM;LVLM}
}This project is licensed under the MIT License. See the LICENSE file for details.
For questions, suggestions, or collaborations, please open an issue or contact Sufianlab.