The project implements the complete workflow of a Retrieval‑Augmented Generation (RAG) system, comparing a baseline LLM (no context) (evaluate_baseline.py) with a RAG-based, hybrid system that injects supporting passages into the prompt (evaluate_rag.py). This document corresponds to the third step (i.e., Small error analysis and model improvements) of the project, according to the instructions posted in eGela.
- Evaluation set (
dataset.jsonl) – 20 multiple‑choice questions on two topics related to the history of the city of León (my precious hometown :P): • Roman León: the legionary fortress of Legio VII Gemina (1‑10) • Cortes of León 1188: the earliest documented parliament (11‑20) - Knowledge corpus (
corpus.py) – Literal excerpts from two research PDFs (data/Text 1.pdf,data/Text 2.pdf) plus 50 distractor passages. - Local LLMs – Served by Ollama (tested with Llama 3, Gemma 2 × 2B, Mistral).
- Exact‑match accuracy – Same metric for baseline and RAG so you can measure the retrieval improvement (if any!).
| Script | Purpose |
|---|---|
evaluate_baseline.py |
Run test without retrieval |
evaluate_rag.py |
Same test with BM25 retrieval |
This is the project layout...
assignment‑2‑pablo‑tagarro/
├── data/
│ ├── Text 1.pdf # Roman camps monograph
│ ├── Text 2.pdf # Masferrer, Decreta of León
│
├── corpus.jsonl # 20 answer passages + 50 distractors
├── dataset.jsonl # 20 items (id, question, options, answer_index)
│
├── evaluate_baseline.py # LLM‑only accuracy
├── evaluate_rag.py # BM25 + LLM accuracy
│
├── requirements.txt
└── README.md # You are here ;)
According to the results in Table 1, retrieval helps when the model is already reasonably accurate (Gemma, Llama).
| Model | Setup | Accuracy |
|---|---|---|
| Gemma 2B | Baseline | 80 % |
| RAG | 95 % | |
| Llama 3:8B | Baseline | 75 % |
| RAG | 90 % | |
| Llava 13B | Baseline | 65 % |
| RAG | 95 % |
Table 1. Aggregate results
As for the proportion of wrong answers (n = 60 across all models tested), we can conclude that
- Formatting – sometimes model returns explanations instead of a single letter.
- Hallucination – numeric facts invented (particularly when RAG is not used).
Sometimes the models return an explanation instead of a single letter, even though we have used the technique of "one-shot learning" in the prompt template. Perhaps, we could be a good idea to fine tune the models for following the instructions of this specific task. In a similar vein, results do not improve a lot after using a hybrid retriever (lexical retrieval with BM25 + semantic retrieval with embeddings), but this might be due to the fact that the questions selected coincide with the texts provided as grounding. All this means that a lexical retriever, such as BM25, is more than enough for this dataset of questions. Another area of improvement could be code a self-verification system, for example. That way models can generate two (or more) independent paths for answer generation and, if agree with the answer, accept the generated answer.