tinyvlm

Dual-encoder, shared embedding space, based on CLIP

Set up environment + Install dependencies

python -m venv .venv

# Activate
.venv/Scripts/activate

pip install -r requirements.txt

If using GPU, download PyTorch(GPU):

pip install torch==2.2.2+cu118 torchvision==0.17.2+cu118 --extra-index-url https://download.pytorch.org/whl/cu118

Image Encoder: ResNet18 (pretrained on ImageNet, classification head removed).
Text Encoder: BERT-base uncased, CLS token embedding as text representation.
Projection layers: Linear mappings align both encoders into the same latent space.

Contrastive loss: Loss = average of image→text and text→image cross-entropy