Assignment 2: LangChain - RAG

Pablo Tagarro-Melón

Step 3 (3 points)

Small error analysis and model improvements.

Project overview

The project implements the complete workflow of a Retrieval‑Augmented Generation (RAG) system, comparing a baseline LLM (no context) (evaluate_baseline.py) with a RAG-based, hybrid system that injects supporting passages into the prompt (evaluate_rag.py). This document corresponds to the third step (i.e., Small error analysis and model improvements) of the project, according to the instructions posted in eGela.

Evaluation set (dataset.jsonl) – 20 multiple‑choice questions on two topics related to the history of the city of León (my precious hometown :P): • Roman León: the legionary fortress of Legio VII Gemina (1‑10) • Cortes of León 1188: the earliest documented parliament (11‑20)
Knowledge corpus (corpus.py) – Literal excerpts from two research PDFs (data/Text 1.pdf, data/Text 2.pdf) plus 50 distractor passages.
Local LLMs – Served by Ollama (tested with Llama 3, Gemma 2 × 2B, Mistral).
Exact‑match accuracy – Same metric for baseline and RAG so you can measure the retrieval improvement (if any!).

Script	Purpose
`evaluate_baseline.py`	Run test without retrieval
`evaluate_rag.py`	Same test with BM25 retrieval

This is the project layout...

assignment‑2‑pablo‑tagarro/
├── data/
│   ├── Text 1.pdf            # Roman camps monograph
│   ├── Text 2.pdf            # Masferrer, Decreta of León
│
├── corpus.jsonl              # 20 answer passages  + 50 distractors
├── dataset.jsonl             # 20 items (id, question, options, answer_index)
│
├── evaluate_baseline.py      # LLM‑only accuracy
├── evaluate_rag.py           # BM25 + LLM accuracy
│
├── requirements.txt          
└── README.md                 # You are here ;)

Error analysis

According to the results in Table 1, retrieval helps when the model is already reasonably accurate (Gemma, Llama).

Model	Setup	Accuracy
Gemma 2B	Baseline	80 %
	RAG	95 %
Llama 3:8B	Baseline	75 %
	RAG	90 %
Llava 13B	Baseline	65 %
	RAG	95 %

Table 1. Aggregate results

As for the proportion of wrong answers (n = 60 across all models tested), we can conclude that

Formatting – sometimes model returns explanations instead of a single letter.
Hallucination – numeric facts invented (particularly when RAG is not used).

Model improvements

Sometimes the models return an explanation instead of a single letter, even though we have used the technique of "one-shot learning" in the prompt template. Perhaps, we could be a good idea to fine tune the models for following the instructions of this specific task. In a similar vein, results do not improve a lot after using a hybrid retriever (lexical retrieval with BM25 + semantic retrieval with embeddings), but this might be due to the fact that the questions selected coincide with the texts provided as grounding. All this means that a lexical retriever, such as BM25, is more than enough for this dataset of questions. Another area of improvement could be code a self-verification system, for example. That way models can generate two (or more) independent paths for answer generation and, if agree with the answer, accept the generated answer.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Assignment 2: LangChain - RAG

Pablo Tagarro-Melón

Step 3 (3 points)

Small error analysis and model improvements.

Project overview

Error analysis

Model improvements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
data		data
.DS_Store		.DS_Store
README.md		README.md
corpus.jsonl		corpus.jsonl
dataset.jsonl		dataset.jsonl
evaluate_baseline.py		evaluate_baseline.py
evaluate_rag.py		evaluate_rag.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Assignment 2: LangChain - RAG

Pablo Tagarro-Melón

Step 3 (3 points)

Small error analysis and model improvements.

Project overview

Error analysis

Model improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages