This repository is designed to demonstrate how to integrate ColPali embeddings for advanced multi-modal retrieval augmented generation (RAG). We use a PDF index for querying, combined with a Llama 3.2 Vision-Language model for result generation.
We incorporate the ColPali Embedding model from Hugging Face, specifically vidore/colpali-v1.2, which provides robust embeddings for text and vision. The RAGMultiModalModel class is leveraged for indexing and retrieval.
-
Install Requirements
pip install byaldi sudo apt-get install -y poppler-utils pip install huggingface_hub !pip install together --q -
Log in to Hugging Face
Provide yourHF_TOKENto authenticate with Hugging Face.from huggingface_hub import login login(token="HF_TOKEN")
-
Initialize the Model
from byaldi import RAGMultiModalModel model = RAGMultiModalModel.from_pretrained('vidore/colpali-v1.2')
The PDF file colpali.pdf is downloaded, then passed to model.index, which creates an index for retrieval. The index_name argument is set to 'colpali'.
After generating the index, we run a query such as:
query = "What is ColPali's (late interaction) evaluation base line score on DocQ and InfoQ?"
results = model.search(query, k=2)The top retrievals are processed to gather the best possible answer from the colpali.pdf content.
- User Query
- Text + Vision Embedding via ColPali
- Index -> Retrieve relevant pages
- Llama 3.2 VLM processes both text query and retrieved PDF content
- Generated Answer