Evaluating RAG systems can be done on two sides:
- retrieval evaluation: assess the accuracy and relevance of the information retrieved by the system
- response evaluation: assess the quality and appropriateness of the responses generated by the system based on the retrieved information.
This folder contains 3 scripts which can be used to perform retrieval evaluation. It calculates 3 metrics (hit rate, Mean Reciprocal Rank or MRR and NDCG) on automatically generated questions. It is based on this OpenAI cookbook by LlamaIndex.
The scripts can be used to 1) download an entire index (collection of chunks) from Azure AI Search 2) use an LLM to generate a question for each chunk 3) apply the retriever on the generated questions to compute the metrics.
Note: this repository assumes you're working with Azure AI Search and Azure OpenAI models. The idea remains the same if you're working with different vector databases, clouds or LLM APIs.
One can install the dependencies like so:
uv add rag-evaluationThen complete config.py with the required environment variables.
Simply run the 3 scripts in a sequence (note that you still need to enter the API key of Azure AI Search in the first and third script):
uv run step_1_download_index.py
uv run step_2_generate_question_context_pairs.py
uv run step_3_calculate_metrics.py --search-type hybrid --use-reranker