This project fine-tunes a BLIP model to generate better inpainting captions.
The preprint of the paper can be reviewed in this link.
Install dependencies (preferably in a virtual environment):
pip install torch torchvision transformers pandas tqdm pillow
Optional (for GPU support):
pip install torch --index-url https://download.pytorch.org/whl/cu118Base: BLIP, StableDiffusion-Inpaint
Head: MLP regressor that outputs 3 values: [SSIM, PSNR, CLIP Score]
Loss: Weighted MSE / custom weighted difference loss
- Place your data Place images in a directory like test2014/
Download a pretrained BLIP model (e.g., from Hugging Face) into a blip/ folder
- Run main
python main.pyThe script will generate a csv file of all the losses. This will be used to train the MLP head and finetune BLIP.
- Run training
python finetune_blip.pyThe script will:
Train the MLP head for epochs_mlp epochs
Fine-tune select layers of BLIP for epochs_blip epochs
Save the final model to:
blip-v2/fine_tuned_blip_with_metrics.pth
- Run main with updated BLIP model