Fine-Tuning Vision-Language Models for Agentic Tasks

This repository provides a complete pipeline for fine-tuning Vision-Language Models (VLMs) using Reinforcement Learning (PPO) and Supervised Fine-Tuning (SFT) on agentic tasks that I set up and used during my MVA internship. My internship thesis can be found here. Without a quick review of the VLM section of the mentioned document, it would be a bit uneasy to go through this code since it covers many and different experiments. The system trains (at least tries to do so:) ) VLM agents to interact with visual environments by observing rendered frames and producing text-based actions.

Supported Models

Qwen2-VL-2B-Instruct / Qwen2.5-VL-3B-Instruct (primary)
LLaVA-Mistral-7B (legacy)

Supported Environments

Gym-Cards — Card game environments (NumberLine, Blackjack, EZPoints, Points24)
MiniGrid — Grid-world navigation tasks (DoorKey, Empty, etc.)
ALFWorld — Text+vision household tasks (via AI2-THOR)

Pipeline


 │  Data Collection  │────▶│  Data Preprocessing │────▶│    SFT     │────▶│  RL (PPO)     │────▶│  Eval    │
    (Qwen 32B)

Data Collection — Run a large VLM (Qwen2 32B) in the environment to collect labeled trajectories
Preprocessing — Convert trajectories into (image, conversation) pairs for SFT
SFT — Supervised fine-tuning with LoRA on the collected data
RL (PPO) — Further fine-tune with PPO using environment rewards
Evaluation — Test the trained agent and generate analysis artifacts

Repository Structure

VLM_finetune/
├── VLM_PPO/                  # Main codebase (RL, SFT, data, evaluation)
│   ├── main.py               #   PPO training — LLaVA on gym-cards
│   ├── main_qwen.py          #   PPO training — Qwen on gym-cards
│   ├── main_minigrid.py      #   PPO training — Qwen on MiniGrid
│   ├── train_sft.py          #   Supervised fine-tuning
│   ├── vlm_traj_label.py     #   Trajectory data collection & labeling
│   ├── vlm_traj_preprocess.py  # Data preprocessing for SFT
│   ├── eval_minigrid.py      #   Evaluation on MiniGrid
│   ├── a2c_ppo_acktr/        #   RL algorithm, policy models, env wrappers
│   ├── SFT/                  #   Custom HuggingFace trainer
│   ├── scripts/              #   Shell scripts for launching experiments
│   └── ...
├── VLM_PPO_ALF/              # ALFWorld variant (AI2-THOR environments)
├── LLaVA/                    # Forked LLaVA repo (patched for RL training)
├── gym-cards/                # Custom gym environments for card games
└── docs/notes/               # Developer notes and scratch files

See VLM_PPO/README.md for detailed documentation on all files, scripts, and configs.

Setup

Prerequisites

Python 3.10+
CUDA-capable GPU (A100/H100 recommended for large models)
Conda

Installation

# Clone the repo
git clone <repo-url>
cd VLM_finetune

# Create conda environment
conda create -n vlm python=3.10 -y
conda activate vlm

# Install dependencies
cd VLM_PPO
pip install -e .
pip install -r requirements.txt

# Install LLaVA (required for LLaVA-based training)
pip install -e ../LLaVA

# Install gym-cards environment
pip install -e ../gym-cards

# Install additional dependencies
pip install transformers accelerate deepspeed peft bitsandbytes
pip install qwen-vl-utils  # For Qwen models

ALFWorld Setup

See VLM_PPO_ALF/README.md for ALFWorld-specific installation.

Quick Start

Train with PPO on MiniGrid

cd VLM_PPO/scripts
bash run_minigrid_qwen.sh 1 "my_run" "/path/to/output" 29488

Train with SFT

cd VLM_PPO/scripts
bash run_minigrid_sft.sh /path/to/output

Evaluate

cd VLM_PPO/scripts
bash eval_minigrid_qwen.sh 1 "eval_run" "/path/to/output" 29488

License

See LICENSE.txt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fine-Tuning Vision-Language Models for Agentic Tasks

Supported Models

Supported Environments

Pipeline

Repository Structure

Setup

Prerequisites

Installation

ALFWorld Setup

Quick Start

Train with PPO on MiniGrid

Train with SFT

Evaluate

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
LLaVA		LLaVA
VLM_PPO		VLM_PPO
VLM_PPO_ALF		VLM_PPO_ALF
docs/notes		docs/notes
gym-cards		gym-cards
imgs		imgs
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
dejavu.zip		dejavu.zip

Folders and files

Latest commit

History

Repository files navigation

Fine-Tuning Vision-Language Models for Agentic Tasks

Supported Models

Supported Environments

Pipeline

Repository Structure

Setup

Prerequisites

Installation

ALFWorld Setup

Quick Start

Train with PPO on MiniGrid

Train with SFT

Evaluate

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages