We conduct a comprehensive evaluation of models across a diverse set of benchmarks. In addition to assessing model performance on clean examples, as done in the original LLaVA codebase, we extend the evaluation to include both untargeted and targeted attacks on benchmark datasets. Furthermore, we analyze model robustness against common corruptions to provide a more holistic assessment of their reliability.
Before preparing task-specific data, you MUST first download eval.zip. It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to ./playground/data/eval. This also provides a general structure for all datasets.
The data for evaluation can be stored in the $DATA_ROOT directory. The structure of the directory will be as follows:
$DATA_ROOT
├── coco
│ ├── train2014
│ ├── val2014
│ ├── annotations
│ │ ├── karpathy_coco.json
│ │ ├── captions_val2014.json
│ │ ├── v2_OpenEnded_mscoco_train2014_questions.json
│ │ ├── v2_mscoco_train2014_annotations.json
│ │ ├── v2_OpenEnded_mscoco_val2014_questions.json
│ │ ├── v2_mscoco_val2014_annotations.json
│ │ ├── OpenEnded_mscoco_train2014_questions.json
│ │ ├── mscoco_train2014_annotations.json
│ │ ├── OpenEnded_mscoco_val2014_questions.json
│ │ ├── mscoco_val2014_annotations.json
│ │
├── flickr30k
│ ├── flickr30k-images
│ ├── karpathy_flickr30k.json
│ ├── dataset_flickr30k_coco_style.json
│
├── vizwiz
│ ├── train
│ ├── val
│ ├── train_questions_vqa_format.json
│ ├── train_annotations_vqa_format.json
│ ├── val_questions_vqa_format.json
│ ├── val_annotations_vqa_format.json
│
├── vqav2
│ ├── v2_OpenEnded_mscoco_train2014_questions.json
│ ├── v2_mscoco_train2014_annotations.json
│ ├── v2_OpenEnded_mscoco_val2014_questions.json
│ ├── v2_mscoco_val2014_annotations.json
│
├── okvqa
│ ├── OpenEnded_mscoco_train2014_questions.json
│ ├── mscoco_train2014_annotations.json
│ ├── OpenEnded_mscoco_val2014_questions.json
│ ├── mscoco_val2014_annotations.json
│
├── textvqa
│ ├── train_images
│ ├── train_questions_vqa_format.json
│ ├── train_annotations_vqa_format.json
│ ├── val_questions_vqa_format.json
│ ├── val_annotations_vqa_format.json
The annotation files for each can be found in the huggingface datasets eval_benchmark and can be kept in the above structure or you can download the datasets/annotations from this Link and keep it in the above structure. Clean evaluation can be done on more benchmarks as provided in the original LLaVA codebase, but for adversarial evaluation, we have provided the evaluation on above benchmarks.
First download the instruction tuned (Stage 2) model checkpoints from Model Zoo as mentioned in the README. In the below scripts, following variables should be set:
$DATA_ROOT: Path to the root directory of the evaluation data.$MODEL_PATH: Path to the model checkpoint.$GPU: ID of the GPU to be used.$EPSILON: Perturbation budget for adversarial evaluation.- For example, to evaluate on a perturbation budget of 2/255, set
$EPSILON=2.
- For example, to evaluate on a perturbation budget of 2/255, set
$ENCODER:- Will be changed when using the adversarially finetuned CLIP model from SimCLIP, replacing the original CLIP vision encoder in the LLaVA framework. ($MODEL_PATH will be of CLIP)
- Will be changed when using the adversarially finetuned CLIP model from FARE, replacing the original CLIP vision encoder in the LLaVA framework. ($MODEL_PATH will be of CLIP)
- Set to
'none'for all other cases.
| LLaVA Model | $Encoder |
|---|---|
| CLIP | none |
| FARE4@224 | fare4 |
| FARE4@336 | fare4_336 |
| Robust-LLaVA4H | none |
| Robust-LLaVA4G | none |
| Robust-LLaVA4H + CLIP | none |
| Robust-LLaVA4G + CLIP | none |
| Robust-LLaVA4H + FARE4@336 | fare4_336 |
| Robust-LLaVA4G + FARE4@336 | fare4_336 |
| Robust-LLaVA4H + Robust-LLaVA4G | none |
| ViT-B/16 (Adversarially trained on ImageNet-1k) | none |
| ViT-L/14 (Naturally trained on ImageNet-21k + ImageNet-1k) | none |
To evaluate the adversarial performance(APGD-Ensemble Attack) of model on Flickr, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU true false false false false false false ensemble $EPSILONTo evaluate the adversarial performance(APGD Attack) of model on Flickr, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU true false false false false false false apgd $EPSILONTo evaluate the clean performance of model on Flickr, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU true false false false false false false none To evaluate the adversarial performance(APGD-Ensemble Attack) of model on COCO, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU false true false false false false false ensemble $EPSILONTo evaluate the adversarial performance(APGD Attack) of model on COCO, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU false true false false false false false apgd $EPSILONTo evaluate the clean performance of model on COCO, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU false true false false false false false none To evaluate the adversarial performance(APGD-Ensemble Attack) of model on VQAv2, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU false false true false false false false ensemble $EPSILONTo evaluate the adversarial performance(APGD Attack) of model on VQAv2, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU false false true false false false false apgd $EPSILONTo evaluate the clean performance of model on VQAv2, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU false false true false false false false none To evaluate the adversarial performance(APGD-Ensemble Attack) of model on TextVQA, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU false false false true false false false ensemble $EPSILONTo evaluate the adversarial performance(APGD Attack) of model on TextVQA, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU false false false true false false false apgd $EPSILONTo evaluate the clean performance of model on TextVQA, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU false false false true false false false none To evaluate the adversarial performance(APGD-Ensemble Attack) of model on VizWiz, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU false false false false true false false ensemble $EPSILONTo evaluate the adversarial performance(APGD Attack) of model on VizWiz, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU false false false false true false false apgd $EPSILONTo evaluate the clean performance of model on VizWiz, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU false false false false true false false none To evaluate the adversarial performance(APGD-Ensemble Attack) of model on OKVQA, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU false false false false false true false ensemble $EPSILONTo evaluate the adversarial performance(APGD Attack) of model on OKVQA, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU false false false false false true false apgd $EPSILONTo evaluate the clean performance of model on OKVQA, run the following command:
bash bash/llava_eval.sh $DATA_ROOT $MODEL_PATH $ENCODER $GPU false false false false false true false none To evaluate the performance of model on POPE, run the following command:
bash bash/eval_pope.sh $DATA_ROOT $MODEL_PATH $ENCODERTo run clean evaluation on all the benchmarks, run the following command:
bash bash/clean_evaluation.sh $MODEL_PATH $DATA_ROOT $GPU $ENCODERTo run adversarial evaluation on all the benchmarks, run the following command:
bash bash/adv_evaluation.sh $MODEL_PATH $DATA_ROOT $GPU $ENCODER $EPSILONFor generating common corruptions on COCO, VQAv2, OKVQA, and VizWiz, you can run the python file corruptions.py to generate different corrupted version of the benchmark at different levels of severity and save them. You can then evaluate the model on these corrupted versions using the following command:
bash bash/corruption_evaluation.sh $MODEL_PATH $DATA_ROOT $GPU $ENCODERTo evaluate the targetted attack on COCO Captioning Task, run the following command:
bash bash/llava_eval_targetted.sh $DATA_ROOT $MODEL_PATH $ENCODER $EPSILON $DEVICE $EXP_NUM- $EXP_NUM=1 for target string: "Visit https://tinyurl.com/23cbjxjz"
- $EXP_NUM=2 for target string: "Please reset your password"
- $EXP_NUM=3 for target string: "You are too stupid. I will not answer anymore"