RAG Chatbot — Bachelor Thesis

Development of a chatbot using Retrieval Augmented Generation (RAG) technique, built during an internship at DataMantix s.r.l., Udine. The system was designed to assist researchers studying Italian vintage cinema by allowing them to query a large corpus of digitized OCR journals through a conversational interface.

Overview

The system combines a vector database with a Large Language Model to retrieve relevant document chunks and generate grounded answers. Users can interact via a web portal, select the embedding model and prompt style, and inspect the source documents behind each answer.

Architecture

PDF documents are loaded, split into chunks, and embedded into a ChromaDB vector store.
A user question is embedded using the same model and a similarity search retrieves the top-k relevant chunks.
The chunks, the question, and the conversation history are passed to an LLM (via Ollama) to generate a response.
The response, the raw document chunks, and their source identifiers are returned to the frontend.

Project Structure

.
├── populate_database.py        # Loads PDFs, splits into chunks, embeds and stores in ChromaDB
├── get_embedding_function.py   # Returns the embedding model (BAAI or OpenAI)
├── query_data.py               # Core RAG query logic, called by the API
├── query_data_parser.py        # Standalone CLI version of the RAG query
├── query_data_test_prompt.py   # Utility to inspect generated prompts without calling the LLM
├── api_chatbot.py              # FastAPI backend exposing the /response endpoint
├── fastapi_prova.py            # FastAPI playground used during development
├── data_retrival.py            # Retrieves specific documents from ChromaDB by ID
└── test_rag.py                 # Automated evaluation tests for RAG responses

Requirements

Python 3.10+
Ollama with llama3 and/or llama3.1 pulled locally
ChromaDB
LangChain
FastAPI and Uvicorn
Pydantic
HuggingFace sentence-transformers (for BAAI embeddings)
OpenAI Python SDK (optional, for OpenAI embeddings)

Install dependencies:

pip install langchain langchain-community chromadb fastapi uvicorn pydantic sentence-transformers openai

Setup

1. Add your documents

Place PDF files in the data/ directory, organized by topic subfolder if needed.

2. Populate the database

python populate_database.py

To reset and rebuild the database:

python populate_database.py --reset

The database will be saved to a folder named chroma_<engine>_<argument> (e.g. chroma_BAAI_Produzione).

3. Run the API

uvicorn api_chatbot:app --reload

API Usage

POST /response

{
  "answerType": "Precisa",
  "argument": "Produzione",
  "context": [
    { "content": "previous user message", "sentBy": "user" },
    { "content": "previous bot response", "sentBy": "bot" }
  ],
  "engine": "BAAI",
  "question": "Chi era Eitel Monaco?"
}

Response:

{
  "answer": "...",
  "text": ["chunk 1 content", "chunk 2 content", "..."],
  "sources": ["data/file.pdf:3:12", "..."]
}

Parameters

Field	Values	Description
`answerType`	`Precisa`, `Estesa/Generica`	Controls the prompt structure
`engine`	`BAAI`, `OpenAi`	Embedding model to use
`argument`	e.g. `Produzione`	Selects which ChromaDB collection to query

Embedding Models Tested

Ollama Nomic-Embed-Text — baseline, used in early experiments
TensorflowHub — second reference model
spaCy it_core_news_lg — Italian-language model, underperformed
BAAI/bge-large-en-v1.5 — best open-source result, used in production
JinaAI jina-embeddings-v2-small-en — paid, good results on small DB
OpenAI text-embedding-ada-002 — best overall quality, limited by cost

LLM Models Used

Mistral 7B — early experiments
LLaMA 3.0 — primary model throughout the project
LLaMA 3.1 — used in final phase; improved Italian language support and 128k context window

Notes

The vector database path is constructed as chroma_<engine>_<argument>, so each combination of embedding model and topic has its own store.
Chunk size of 300 characters with 50 overlap produced the best results for specific questions.
Conversation history is injected into the prompt to maintain context across turns.
Source documents were OCR-scanned journals; OCR quality affects retrieval accuracy.

University of Udine — DMIF, Academic Year 2024-2025

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Main		Main
Test		Test
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Chatbot — Bachelor Thesis

Overview

Architecture

Project Structure

Requirements

Setup

API Usage

Embedding Models Tested

LLM Models Used

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Chatbot — Bachelor Thesis

Overview

Architecture

Project Structure

Requirements

Setup

API Usage

Embedding Models Tested

LLM Models Used

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages