Skip to content

mo10107/MultiModal-RAG-with-ColPali

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

MultiModal RAG with ColPali

Overview

This repository is designed to demonstrate how to integrate ColPali embeddings for advanced multi-modal retrieval augmented generation (RAG). We use a PDF index for querying, combined with a Llama 3.2 Vision-Language model for result generation.

ColPali Model

We incorporate the ColPali Embedding model from Hugging Face, specifically vidore/colpali-v1.2, which provides robust embeddings for text and vision. The RAGMultiModalModel class is leveraged for indexing and retrieval.

Installation Steps

  1. Install Requirements

    pip install byaldi
    sudo apt-get install -y poppler-utils
    pip install huggingface_hub
    !pip install together --q
  2. Log in to Hugging Face
    Provide your HF_TOKEN to authenticate with Hugging Face.

    from huggingface_hub import login
    login(token="HF_TOKEN")
  3. Initialize the Model

    from byaldi import RAGMultiModalModel
    model = RAGMultiModalModel.from_pretrained('vidore/colpali-v1.2')

Index Creation

The PDF file colpali.pdf is downloaded, then passed to model.index, which creates an index for retrieval. The index_name argument is set to 'colpali'.

Querying the Model

After generating the index, we run a query such as:

query = "What is ColPali's (late interaction) evaluation base line score on DocQ and InfoQ?"
results = model.search(query, k=2)

The top retrievals are processed to gather the best possible answer from the colpali.pdf content.

MultiModal RAG Flow

  1. User Query
  2. Text + Vision Embedding via ColPali
  3. Index -> Retrieve relevant pages
  4. Llama 3.2 VLM processes both text query and retrieved PDF content
  5. Generated Answer

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors