We train our models using the following three-stage process:
We provide unified scripts for projector pretraining and video fine-tuning Mobile-VideoGPT for 0.5B and 1.5B models. Please follow the instructions below:
You can download the pretraining and fine-tuning datasets from here: https://huggingface.co/datasets/MBZUAI/VideoGPT-plus_Training_Dataset/tree/main/pretraining
You can download the utilized instruction tuning dataset from (1) VideoChat2-IT. (2) PerceptionTest, and Academic_sources from LLaVA-Video-178K. (3) Instruction-Tuning of VideoGPT-plus.
Run the three-stage training pipeline using the script: Mobile-VideoGPT-0.5B_training.sh. It contains (1) Image projector pertaining, (2) Video projector pretraining, and (3) video instruction fine-tuning for Mobile-VideoGPT-0.5B.
Run the three-stage training pipeline using the script: Mobile-VideoGPT-1.5B_training.sh. It contains (1) Image projector pertaining, (2) Video projector pretraining, and (3) video instruction fine-tuning for Mobile-VideoGPT-1.5B.
