Skip to content

pablobec93/RAG-system-assignment-2

Repository files navigation

Assignment 2: LangChain - RAG

Pablo Tagarro-Melón

Step 3 (3 points)

Small error analysis and model improvements.

Project overview

The project implements the complete workflow of a Retrieval‑Augmented Generation (RAG) system, comparing a baseline LLM (no context) (evaluate_baseline.py) with a RAG-based, hybrid system that injects supporting passages into the prompt (evaluate_rag.py). This document corresponds to the third step (i.e., Small error analysis and model improvements) of the project, according to the instructions posted in eGela.

  • Evaluation set (dataset.jsonl) – 20 multiple‑choice questions on two topics related to the history of the city of León (my precious hometown :P): • Roman León: the legionary fortress of Legio VII Gemina (1‑10) • Cortes of León 1188: the earliest documented parliament (11‑20)
  • Knowledge corpus (corpus.py) – Literal excerpts from two research PDFs (data/Text 1.pdf, data/Text 2.pdf) plus 50 distractor passages.
  • Local LLMs – Served by Ollama (tested with Llama 3, Gemma 2 × 2B, Mistral).
  • Exact‑match accuracy – Same metric for baseline and RAG so you can measure the retrieval improvement (if any!).
Script Purpose
evaluate_baseline.py Run test without retrieval
evaluate_rag.py Same test with BM25 retrieval

This is the project layout...

assignment‑2‑pablo‑tagarro/
├── data/
│   ├── Text 1.pdf            # Roman camps monograph
│   ├── Text 2.pdf            # Masferrer, Decreta of León
│
├── corpus.jsonl              # 20 answer passages  + 50 distractors
├── dataset.jsonl             # 20 items (id, question, options, answer_index)
│
├── evaluate_baseline.py      # LLM‑only accuracy
├── evaluate_rag.py           # BM25 + LLM accuracy
│
├── requirements.txt          
└── README.md                 # You are here ;)

Error analysis

According to the results in Table 1, retrieval helps when the model is already reasonably accurate (Gemma, Llama).

Model Setup Accuracy
Gemma 2B Baseline 80 %
RAG 95 %
Llama 3:8B Baseline 75 %
RAG 90 %
Llava 13B Baseline 65 %
RAG 95 %

Table 1. Aggregate results

As for the proportion of wrong answers (n = 60 across all models tested), we can conclude that

  1. Formatting – sometimes model returns explanations instead of a single letter.
  2. Hallucination – numeric facts invented (particularly when RAG is not used).

Model improvements

Sometimes the models return an explanation instead of a single letter, even though we have used the technique of "one-shot learning" in the prompt template. Perhaps, we could be a good idea to fine tune the models for following the instructions of this specific task. In a similar vein, results do not improve a lot after using a hybrid retriever (lexical retrieval with BM25 + semantic retrieval with embeddings), but this might be due to the fact that the questions selected coincide with the texts provided as grounding. All this means that a lexical retriever, such as BM25, is more than enough for this dataset of questions. Another area of improvement could be code a self-verification system, for example. That way models can generate two (or more) independent paths for answer generation and, if agree with the answer, accept the generated answer.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages